How does Marfeel extract the metadata from articles

Marfeel is a visual representation on how Google bot sees your site. When a user visits a page and triggers an event to Marfeel, the Editorial crawler will crawl the page and will detect, extract and audit all the metadata of the canonical url including fields like:

  • Publication date,
  • Title,
  • Main image,
  • Author(s),
  • Section(s)

The canonical URL of a page is periodically crawled based on it’s update date.

Marfeel always tracks the traffic of all urls in a site whether they have valid structure data or not. URLs are only enriched with the information of their canonicals when the canonical has valid structure data.

Google AMP pages, Facebook Instant Articles, and Native apps must all have a valid canonical link pointing to the original HTML page. Compass never extracts information directly from AMP, FBIA or native applications.
Correctly placed canonical tags in the form of <link rel="canonical" href="{{my-canonical-url}}"/> are essential for both SEO and Marfeel's understanding of a site. If they are missing or incorrect, editorial metadata won't be correctly assigned to each article.

Title

The title is extracted from multiple different sources following the next chain.

  1. JSON+LD. More information
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "headline": "Title of a News Article"
}
</script>
  1. Microdata
<div itemprop="headline">Title of News Article</div>
  1. <title> meta tag. This might not be an ideal fallback since most sites add the name of the site on the their title.

Author & Bylines

An Author refers to the person responsible for creating a piece of content like an article, a blog post or any other form of written material. Normally matches the Author that created the content on the CMS.

A Byline is the public text that acknowledges and identifies the author(s) responsible for creating an article. Normally matches the public byline as informed on the structure data.

As an example, the Author John Smith might publish articles using these different bylines:

  1. John Smith
  2. JS
  3. J.S.
  4. John S.
  5. J. Smith
  6. Generic Editorial byline

Marfeel automatically tracks the byline of an article based its public structure data. It also allows allows tracking the Author that wrote the article.

  1. JSON+LD (For more details visit author - Schema.org Property)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "author": [
    {
      "@type":"Person",
      "name":"Author One"
    },
    {
      "@type":"Person",
      "name":"Author Two"
    }
  ]
}
</script>
<script type="application/ld+json">
{
  "author": "Author One"
}
</script>
  1. Meta tag article:author
<meta property="article:author" content="Author One">
  1. Meta tag name=“author”
<meta name="author" content="Author One">

Section(s)

Sections are extracted from multiple tags in the following order.

  1. Meta tag mrf:sections
<meta property="mrf:sections" content="Parent section;Child Section">
  1. JSON+LD (For more details visit articleSection - Schema.org Property)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "articleSection": "News"
}
</script>
  1. Meta tag article
<meta property="article:section" content="Parent section;Child Section">

Publication date

The publication date is extracted from multiple different sources following the next chain. Take into account that if multiple publication dates exist, no publication date is extracted.

  1. JSON+LD (For more details visit datePublished - Schema.org Property)
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "datePublished": "2021-08-01"
}
</script>
  1. Meta article type article:publish_time
<meta property="article:published_time" content="2021-08-01T17:41:45+00:00" />
  1. Meta item property type
<meta itemprop="datePublished" content="2021-08-01" id="date">
  1. Time item property type as datetime
<time itemprop="datePublished" datetime="2021-08-01T09:00Z">
  1. Time item property type as content
<time itemprop="datePublished" content="2021-08-01T09:00Z">
  1. Time item property type as node value
<time itemprop="datePublished">2021-08-01T09:00Z</time>

Main image

Main article image is extracted either from JSON+LD data (For more details visit image - Schema.org Property)

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "image": "mainImage.jpg"
}
</script>

Or from Meta OG types:

<meta property="og:image" content="https://mywebsite.com/images/mainImage.jpg" />

Tag(s)

Tags are extracted from multiple markups:

  1. Meta tag mrf:tags
<meta property="mrf:tags" content="tagGroup1:tag_name;tagGroup2:another_tag_name" />
  1. From meta keywords
 <meta name="keywords" content="tag1, tag2, ta3">
  1. From keywords in Structure Date
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "NewsArticle",
  "keywords": "tag1, tag2, tag3"
}
</script>

In case of error

If the canonical URL of a page is not accessible, the crawler retries automatically to extract information from the page one hour later.

If the canonical is still not available after 10 attempts, the crawler does not attempt to read that page again.