About Marfeel crawlers

Marfeel tracker does not send any metadata about the page in every request, it just sends the essential. This allows it to be lightweight and consume a minimal amount of bandwidth. All the rest of data needed is obtained through our crawlers.

Many urls may point to the same content, Marfeel crawlers only crawl canonical urls and their amphtml counterparts. All urls pointing to the same canonical will be stored as aliases.

Make sure both canonical and amphtml link rel elements are correctly set in all your content for Marfeel crawling to work perfectly.

Learn more about how Marfeel reads your pages metadata.

Good citizen practices

All Marfeel bots follow the following rules in order to be good web citizens:

  • Sites are not proactively crawled to identify new content. Marfeel only crawls urls with active users.
  • Marfeel limits the number of concurrent requests to each of our client’s servers.
  • All assets are centrally cached so different bots may reuse them without having to fetch them separately.
  • Redirects are not followed unless necessary.
Whenever a domain starts using Marfeel, crawling during first days may be more intense, as there is a lot of content to be discovered. It will however respect the pace of the servers and slow down over time.

Marfeel crawlers

Marfeel currently uses 3 types of crawlers.

Editorial crawler

This crawler obtains editorial information such as the title, images and many other inputs to build the editorial profile of the page. It crawls all pages with pageviews the first time they are visited, and every time the content is modified.

Content modifications are detected through the article:modified_time meta property or the article's structured data field dateModified. Make sure these fields are correctly updated for Marfeel and Googlebot to be always up-to-date.

The user agent used by the editorial crawler is:

Mozilla/5.0 (compatible; NewsRoom.BI/0.1; +http://www.newsroom.bi/bot.html)

Whenever a crawl attempt fails, the crawler will retry every hour for up to 10 times before giving up on a url.

Force an editorial recrawl

Whenever the editorial information of an article is not updated you can force Marfeel to crawl it using the link on the “Updated x hours ago” dialog.

Audits crawler

In order to detect structured data, meta tags and many other potential issues in our client’s HTML, Marfeel periodically crawls all relevant urls (the ones that have traffic) using the following user agents:

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Booldog/1.0)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36(KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 (compatible; mrfCompass-Booldog/1.0)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; mrfCompass-Marshall/1.0)
  • mrfCompass-Booldog will crawl each url initially using a mobile user agent, and if a vary: User-Agent header is received in the response, it will crawl it using a desktop user agent as well.
  • mrfCompass-Marshall will crawl all amphtml links found by mrfCompass-Booldog.

Flowcards crawler

Flowcards that load content directly from specific urls will also use a bot to fetch mentioned content. This bot identifies itself with the following user agent:

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Jukebox/1.0)

The recurrency of the crawling respects the cache-control header returned.

Whitelisting Marfeel crawlers

Many hosting and CDN providers include WAF services that may consider Marfeel bots to be potentially malicious and block them
To make sure Marfeel can access and monitor your website, you can either whitelist User Agents mentioned above or whitelist our list of static IPs available here.

Cloudflare

If you are using Cloudflare as your CDN provider, you can whitelist Marfeel crawlers’ IPs following these steps:

  1. On your Cloudflare console, click on the firewall icon on Tools tab.
  2. List Marfeel’s crawlers IP addresses under the IP Access Rules.
    a. Enter the IP address
    b. Choose Whitelistas the action to apply
    c. Choose the website where to apply whitelisting rules
  3. Click add
  4. Repeat for each IP

Verifying Marfeel Crawlers

All Marfeel Crawler IP addresses offer a reverse DNS lookup pointing to crawler.marfeel.com.
You can use it to verify Marfeel bots authenticity. You can do it following these steps:

  1. Run a reverse DNS lookup on the accessing IP address from your logs, using the host command.
  2. Verify that the domain name is crawler.marfeel.com.
  3. Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name.
  4. Verify that it’s the same as the original accessing IP address from your logs.
$ host 162.55.235.182
182.235.55.162.in-addr.arpa domain name pointer crawler.marfeel.com.

$ host crawler.marfeel.com
crawler.marfeel.com is an alias for vampiresquid.het.mrf.io.
vampiresquid.het.mrf.io has address 162.55.235.186
vampiresquid.het.mrf.io has address 162.55.235.182