Editorial Crawler Inspector

The Marfeel Editorial Crawler crawls a url and builds the editorial profile of a page using its metadata and article body. The Marfeel Editorial Crawler does a two-phase crawling:

  1. Metadata crawler: It crawls a url and follows a chain process to extract metadata from:
    • Custom Marfeel tagging
    • Structure Data
    • Microdata
    • Open Graph og:*
    • RDFa
  2. Article body crawler: In order to extract content metrics like the number of words, paragraph, number of images Marfeel or its entities Marfeel needs to identify the content piece of url removing all the navigation and boilerplate elements.

Crawler location to prevent geo-blocking

Marfeel crawlers can be executed from multiple locations. By default Marfeel crawlers will execute from the closer location to the account specified Timezone country. You can change the location manually:

  1. Go to Organization > Crawler settings
  2. Choose your preferred location: United States, Europe

Article body crawler

To compute content metrics like the number of words, paragraphs, or flesh index or detect the entities of a url to connect it to the Marfeel knowledge graph and make AI based content recommendations, Marfeel needs to be able to detect the body of an article.

Out-of-the-box the Marfeel crawler detects the article body of a page using a Reader View kind of browser extension which uses a set of heuristics to differentiate the content from UI elements like sidebars, footers or related articles modules.

Depending on the html markup the Reader View might incorrectly keep or remove text from an article causing inaccuracies. In such cases, it may detect modules as content or vice versa, resulting in wrongly computed metrics and poorly detected entities.

To ensure that the article body is detected accurately and that metrics and entities are computed correctly it is recommended to provide hints to the crawler to properly crawl your site and fine-tune the article body detection.

You can setup the Editorial Crawler by:

  1. Go to Organization > Crawler > Inspector

  2. Select a url on the top left corner to preview the whole text that Marfeel Editorial Crawler detects

  3. Define the main article body CSS selector as the parent node. The default setting is body, but it’s recommended to provide a selector with higher specificity.

  4. Add css selectors of modules to remove from the parent node element. Blacklist modules are useful to remove in-article modules like recommendations

Recrawl Articles in Bulk

From the inspector, you can see how many articles are being processed,
image

and trigger a recrawl for all articles that match your query.

1 Like