The Marfeel Editorial Crawler crawls a url and builds the editorial profile of a page using its metadata and article body. The Marfeel Editorial Crawler does a two-phase crawling:
- Metadata crawler: It crawls a url and follows a chain process to extract metadata from:
- Custom Marfeel tagging
- Structure Data
- Microdata
- Open Graph og:*
- RDFa
- Article body crawler: In order to extract content metrics like the number of words, paragraph, number of images Marfeel or its entities Marfeel needs to identify the content piece of url removing all the navigation and boilerplate elements.
Crawler location to prevent geo-blocking
Marfeel crawlers can be executed from multiple locations. By default Marfeel crawlers will execute from the closer location to the account specified Timezone country. You can change the location manually:
- Go to Organization > Crawler settings
- Choose your preferred location: United States, Europe
Article body crawler
To compute content metrics like the number of words, paragraphs, or flesh index or detect the entities of a url to connect it to the Marfeel knowledge graph and make AI based content recommendations, Marfeel needs to be able to detect the body of an article.
Out-of-the-box the Marfeel crawler detects the article body of a page using a Reader View kind of browser extension which uses a set of heuristics to differentiate the content from UI elements like sidebars, footers or related articles modules.
Depending on the html markup the Reader View might incorrectly keep or remove text from an article causing inaccuracies. In such cases, it may detect modules as content or vice versa, resulting in wrongly computed metrics and poorly detected entities.
To ensure that the article body is detected accurately and that metrics and entities are computed correctly it is recommended to provide hints to the crawler to properly crawl your site and fine-tune the article body detection.
You can setup the Editorial Crawler by:
-
Go to
Organization
>Crawler
> Inspector -
Select a url on the top left corner to preview the whole text that Marfeel Editorial Crawler detects
-
Define the main article body CSS selector as the
parent node
. The default setting isbody
, but it’s recommended to provide a selector with higher specificity. -
Add css selectors of modules to remove from the
parent node
element.Blacklist modules
are useful to remove in-article modules like recommendations
Recrawl Articles in Bulk
From the inspector, you can see how many articles are being processed,
and trigger a recrawl for all articles that match your query.