Marfeel tracker does not send any metadata about the page in every request, it just sends the essential. This allows it to be lightweight and consume a minimal amount of bandwidth. All the rest of data needed is obtained through our crawlers.
Many urls may point to the same content, Marfeel crawlers only crawl canonical urls and their amphtml counterparts. All urls pointing to the same canonical will be stored as aliases.
canonical
and amphtml
link rel elements are correctly set in all your content for Marfeel crawling to work perfectly.
Good citizen practices
All Marfeel bots follow the following rules in order to be good web citizens:
- Sites are not proactively crawled to identify new content. Marfeel only crawls urls with active users.
- Marfeel limits the number of concurrent requests to each of our client’s servers.
- All assets are centrally cached so different bots may reuse them without having to fetch them separately.
- Redirects are not followed unless necessary.
Marfeel crawlers
Marfeel currently uses 3 types of crawlers.
Editorial crawler
This crawler obtains editorial information such as the title, images and many other inputs to build the editorial profile of the page. It crawls all pages with pageviews the first time they are visited, and every time the content is modified.
article:modified_time
meta property or the article's structured data field dateModified
. Make sure these fields are correctly updated for Marfeel and Googlebot to be always up-to-date.
The user agent used by the editorial crawler is:
Mozilla/5.0 (compatible; NewsRoom.BI/0.1; +http://www.newsroom.bi/bot.html)
Whenever a crawl attempt fails, the crawler will retry every hour for up to 10 times before giving up on a url.
Whenever the editorial information of an article is not updated you can force Marfeel to crawl it using the link on the “Updated x hours ago” dialog.
Audits crawler
In order to detect structured data, meta tags and many other potential issues in our client’s HTML, Marfeel periodically crawls all relevant urls (the ones that have traffic) using the following user agents:
Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Booldog/1.0)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36(KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 (compatible; mrfCompass-Booldog/1.0)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; mrfCompass-Marshall/1.0)
-
mrfCompass-Booldog
will crawl each url initially using a mobile user agent, and if avary: User-Agent
header is received in the response, it will crawl it using a desktop user agent as well. -
mrfCompass-Marshall
will crawl all amphtml links found bymrfCompass-Booldog
.
Flowcards crawler
Flowcards that load content directly from specific urls will also use a bot to fetch mentioned content. This bot identifies itself with the following user agent:
Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA51N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Mobile Safari/537.36 (compatible; mrfCompass-Jukebox/1.0)
The recurrency of the crawling respects the cache-control
header returned.
Whitelisting Marfeel crawlers
Many hosting and CDN providers include WAF services that may consider Marfeel bots to be potentially malicious and block them.
To make sure Marfeel can access and monitor your website, you can either whitelist User Agents mentioned above or whitelist our list of static IPs available here.
Cloudflare
If you are using Cloudflare as your CDN provider, you can whitelist Marfeel crawlers’ IPs following these steps:
- On your Cloudflare console, click on the firewall icon on Tools tab.
- List Marfeel’s crawlers IP addresses under the IP Access Rules.
a. Enter the IP address
b. ChooseWhitelist
as the action to apply
c. Choose the website where to apply whitelisting rules - Click add
- Repeat for each IP
Verifying Marfeel Crawlers
All Marfeel Crawler IP addresses offer a reverse DNS lookup pointing to crawler.marfeel.com
.
You can use it to verify Marfeel bots authenticity. You can do it following these steps:
- Run a reverse DNS lookup on the accessing IP address from your logs, using the
host
command. - Verify that the domain name is
crawler.marfeel.com
. - Run a forward DNS lookup on the domain name retrieved in step 1 using the
host
command on the retrieved domain name. - Verify that it’s the same as the original accessing IP address from your logs.
$ host 162.55.235.182
182.235.55.162.in-addr.arpa domain name pointer crawler.marfeel.com.
$ host crawler.marfeel.com
crawler.marfeel.com is an alias for vampiresquid.het.mrf.io.
vampiresquid.het.mrf.io has address 162.55.235.186
vampiresquid.het.mrf.io has address 162.55.235.182