Marfeel tracker does not send any metadata about the page in every request, it just sends the essential. This allows it to be lightweight and consume a minimal amount of bandwidth. All the rest of data needed is obtained through our crawlers.
Many urls may point to the same content, Marfeel crawlers only crawl canonical urls and their amphtml counterparts. All urls pointing to the same canonical will be stored as aliases.
amphtmllink rel elements are correctly set in all your content for Marfeel crawling to work perfectly.
All Marfeel bots follow the following rules in order to be good web citizens:
- Sites are not proactively crawled to identify new content. Marfeel only crawls urls with active users.
- Marfeel limits the number of concurrent requests to each of our client’s servers.
- All assets are centrally cached so different bots may reuse them without having to fetch them separately.
- Redirects are not followed unless necessary.
Marfeel currently uses 3 types of crawlers.
This crawler obtains editorial information such as the title, images and many other inputs to build the editorial profile of the page. It crawls all pages with pageviews the first time they are visited, and every time the content is modified.
article:modified_timemeta property or the article's structured data field
dateModified. Make sure these fields are correctly updated for Marfeel and Googlebot to be always up-to-date.
The user agent used by the editorial crawler is:
Mozilla/5.0 (compatible; NewsRoom.BI/0.1; +http://www.newsroom.bi/bot.html)
Whenever a crawl attempt fails, the crawler will retry every hour for up to 10 times before giving up on a url.
Whenever the editorial information of an article is not updated you can force Marfeel to crawl it using the link on the “Updated x hours ago” dialog.
In order to detect structured data, meta tags and many other potential issues in our client’s HTML, Marfeel periodically crawls all relevant urls (the ones that have traffic) using the following user agents:
Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Booldog/1.0) Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36(KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 (compatible; mrfCompass-Booldog/1.0) Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; mrfCompass-Marshall/1.0)
mrfCompass-Booldogwill crawl each url initially using a mobile user agent, and if a
vary: User-Agentheader is received in the response, it will crawl it using a desktop user agent as well.
mrfCompass-Marshallwill crawl all amphtml links found by
Flowcards that load content directly from specific urls will also use a bot to fetch mentioned content. This bot identifies itself with the following user agent:
Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Jukebox/1.0)
The recurrency of the crawling respects the
cache-control header returned.
Many hosting and CDN providers include WAF services that may consider Marfeel bots to be potentially malicious and block them
To make sure Marfeel can access and monitor your website, you can either whitelist User Agents mentioned above or whitelist our list of static IPs available here.
If you are using Cloudflare as your CDN provider, you can whitelist Marfeel crawlers’ IPs following these steps:
- On your Cloudflare console, click on the firewall icon on Tools tab.
- List Marfeel’s crawlers IP addresses under the IP Access Rules.
a. Enter the IP address
Whitelistas the action to apply
c. Choose the website where to apply whitelisting rules
- Click add
- Repeat for each IP
All Marfeel Crawler IP addresses offer a reverse DNS lookup pointing to
You can use it to verify Marfeel bots authenticity. You can do it following these steps:
- Run a reverse DNS lookup on the accessing IP address from your logs, using the
- Verify that the domain name is
- Run a forward DNS lookup on the domain name retrieved in step 1 using the
hostcommand on the retrieved domain name.
- Verify that it’s the same as the original accessing IP address from your logs.
$ host 220.127.116.11 18.104.22.168.in-addr.arpa domain name pointer crawler.marfeel.com. $ host crawler.marfeel.com crawler.marfeel.com is an alias for vampiresquid.het.mrf.io. vampiresquid.het.mrf.io has address 22.214.171.124 vampiresquid.het.mrf.io has address 126.96.36.199