How does Marfeel extract the metadata from articles

Marfeel is a visual representation on how Google bot sees your site. When a user visits a page and triggers an event to Marfeel, the Editorial crawler will crawl the page and will detect, extract and audit all the metadata of the canonical url including fields like:

  • Publication date,
  • Title,
  • Main image,
  • Author(s),
  • Section(s)

Once the canonical URL of a page has been successfully registered, it is not crawled again. All the page’s information always comes from the canonical URL.

Only the canonical version of a page is analyzed by Compass to retrieve all the information, but our crawler looks at every page of your site that receives visits and searches for its canonical URL every time.
Google AMP pages, Facebook Instant Articles, and Native apps must all have a valid canonical link pointing to the original HTML page. Compass never extracts information directly from those kinds of pages.

Publication date

The publication date is extracted from multiple different sources following the next chain. Take into account that if multiple publication dates exist, no publication date is extracted.

  1. JSON+LD (For more details visit datePublished - Schema.org Property)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "datePublished": "2021-08-01"
}
</script>
  1. Meta article type article:publish_time
<meta property="article:published_time" content="2021-08-01T17:41:45+00:00" />
  1. Meta item property type
<meta itemprop="datePublished" content="2021-08-01" id="date">
  1. Time item property type as datetime
<time itemprop="datePublished" datetime="2021-08-01T09:00Z">
  1. Time item property type as content
<time itemprop="datePublished" content="2021-08-01T09:00Z">
  1. Time item property type as node value
<time itemprop="datePublished">2021-08-01T09:00Z</time>

Title

The title attribute is extrated from the HTML title tag:

<title>Article title</title>

Main image

Main article image is extracted either from JSON+LD data (For more details visit image - Schema.org Property)

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "image": "mainImage.jpg"
}
</script>

Or from Meta OG types:

<meta property="og:image" content="https://mywebsite.com/images/mainImage.jpg" />

Author(s)

Authors are extracted from multiple tags in the following order.

  1. Meta tag mrf:Authors
<meta property="mrf:authors" content="Author One;Author Two">
  1. JSON+LD (For more details visit author - Schema.org Property)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "author": [
    {
      "@type":"Person",
      "name":"Author One"
    },
    {
      "@type":"Person",
      "name":"Author Two"
    }
  ]
}
</script>
<script type="application/ld+json">
{
  "author": "Author One"
}
</script>
  1. Meta tag article:author
<meta property="article:author" content="Author One">
  1. Meta tag name=“author”
<meta name="author" content="Author One">

Section(s)

Sections are extracted from multiple tags in the following order.

  1. Meta tag mrf:sections
<meta property="mrf:sections" content="Parent section;Child Section">
  1. JSON+LD (For more details visit articleSection - Schema.org Property)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "articleSection": "News"
}
</script>
  1. Meta tag article
<meta property="article:section" content="Parent section;Child Section">

Tag(s)

Tags are extracted from multiple markups:

  1. Meta tag mrf:tags
<meta property="mrf:tags" content="tagGroup1:tag_name;tagGroup2:another_tag_name" />
  1. From meta keywords
 <meta name="keywords" content="tag1, tag2, ta3">
  1. From keywords in Structure Date
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "keywords": "tag1, tag2, tag3"
}
</script>

In case of error

If the canonical URL of a page is not accessible, the crawler retries automatically to extract information from the page one hour later.

If the canonical is still not available after 10 attempts, the crawler does not attempt to read that page again.