clubmate.fi

A good[ish] website

Web development blog, loads of UI and JavaScript topics

Block bad bots in your robots.txt

Filed under: Webdev— Tagged with: seo, tools

This post shows how to block bad bots in your robots.txt file. And it tries to figure out if it does anything? Spoiler alert, probably not.

Blocking bots from your robots.txt doesn’t block a human from scouring your site, it’s merely there to block some web-based tools or automated scraping attempts, and maybe really dumb script kiddies.

For example, blocking wget does actually block it, but any human with more than three brain cells knows how to bypass the robots limitation:

$ wget robots=off https://example.com

I don’t actually have any data to show if blocking bad bots hampers attempts to scrape or otherwise prod your site, just guesswork and conjecture.

List of bad bots

This is where it gets tricky immediately, there used to be a site called botreports.com, but the domain has been sniped and is replaced by a placeholder page with affiliate links. But I’ve got a copy of the bots listed on that site (from July 11th 2020).

// Copied from http://www.botreports.com/badbots/index.shtml at 2020-06-11.
const badBots = [
  'AITCSRobot/1.1',
  'Alexibot',
  'Aqua_Products',
  'Arachnophilia',
  'ASpider/0.09',
  'asterias',
  'asterias',
  'AURESYS/1.0',
  'b2w/0.1',
  'BackDoorBot',
  'BackDoorBot/1.0',
  'BackRub/.',
  'Baiduspider-video',
  'Big Brother',
  'Bizbot003',
  'BizBot04 kirk.overleaf.com',
  'Black Hole',
  'Black.Hole',
  'BlackWidow',
  'BLEXBot',
  'BlowFish',
  'BlowFish/1.0',
  'Bookmark search tool',
  'Bot mailto:craftbot@yahoo.com',
  'BotALot',
  'BotRightHere',
  'BSpider/1.0 libwww-perl/0.40',
  'BuiltBotTough',
  'Bullseye',
  'Bullseye/1.0',
  'BunnySlippers',
  'CACTVS Chemistry Spider',
  'Cegbfeieh',
  'ChangeDetection',
  'Checkbot/x.xx LWP/5.x',
  'CheeseBot',
  'CherryPicker',
  'CherryPickerElite/1.0',
  'CherryPickerSE/1.0',
  'ChinaClaw',
  'combine/0.0',
  'conceptbot/0.3',
  'Copernic',
  'CopyRightCheck',
  'cosmos',
  'Crescent',
  'Crescent Internet ToolPak HTTP OLE Control v.1.0',
  'Custo',
  'CyberPatrol SiteCat Webbot',
  'CyberSpyder/2.1',
  'Daumoa',
  'Deweb/1.01',
  'DISCo',
  'DISCo Pump 3.0',
  'DISCo Pump 3.2',
  'DISCoFinder',
  'DittoSpyder',
  'Download Demon',
  'Download Demon/3.2.0.8',
  'Download Demon/3.5.0.11',
  'dumbot',
  'eCatch',
  'eCatch/3.0',
  'EirGrabber',
  'EmailCollector',
  'EmailSiphon',
  'EmailWolf',
  'EnigmaBot',
  'EroCrawler',
  'es',
  'explorersearch',
  'Express WebPictures',
  'Express WebPictures (www.express-soft.com)',
  'ExtractorPro',
  'EyeNetIE',
  'FairAd Client',
  'FelixIDE/1.0',
  'fido/0.9 Harvest/1.4.pl2',
  'Fish-Search-Robot',
  'Flaming AttackBot',
  'FlashGet',
  'FlashGet WebWasher 3.2',
  'Foobot',
  'Freecrawl',
  'FrontPage',
  'FrontPage [NC,OR]',
  'Gaisbot',
  'gcreep/1.0',
  'GetRight',
  'GetRight/2.11',
  'GetRight/3.1',
  'GetRight/3.2',
  'GetRight/3.3',
  'GetRight/3.3.3',
  'GetRight/3.3.4',
  'GetRight/4.0.0',
  'GetRight/4.1.0',
  'GetRight/4.1.1',
  'GetRight/4.1.2',
  'GetRight/4.2',
  'GetRight/4.2b (Portuguxeas)',
  'GetRight/4.2c',
  'GetRight/4.3',
  'GetRight/4.5',
  'GetRight/4.5a',
  'GetRight/4.5b',
  'GetRight/4.5b1',
  'GetRight/4.5b2',
  'GetRight/4.5b3',
  'GetRight/4.5b6',
  'GetRight/4.5b7',
  'GetRight/4.5c',
  'GetRight/4.5d',
  'GetRight/4.5e',
  'GetRight/5.0beta1',
  'GetRight/5.0beta2',
  'GetURL.rexx v1.05',
  'GetWeb!',
  'Go!Zilla',
  'Go!Zilla (www.gozilla.com)',
  'Go!Zilla 3.3 (www.gozilla.com)',
  'Go!Zilla 3.5 (www.gozilla.com)',
  'Go-Ahead-Got-It',
  'Golem/1.1',
  'GrabNet',
  'Grafula',
  'Gromit/1.0',
  'grub',
  'HappyFunBot',
  'Harvest',
  'Harvest/1.5',
  'Hatena Antenna',
  "Hazel's Ferret Web hopper",
  'hloader',
  'HMView',
  'httplib',
  'HTTrack',
  'HTTrack 3.0',
  'HTTrack [NC,OR]',
  'Huaweisymantecspider',
  'humanlinks',
  'H‰m‰h‰kki/0.2',
  'Image Stripper',
  'Image Sucker',
  'inagist.com url crawler',
  'IncyWincy/1.0b1',
  'Indy Library',
  'Indy Library [NC,OR]',
  'InfoNaviRobot',
  'Informant',
  'INGRID/0.1',
  'InterGET',
  'Internet Ninja',
  'Internet Ninja 4.0',
  'Internet Ninja 5.0',
  'Internet Ninja 6.0',
  'Iron33/1.0.2',
  'IsraeliSearch/1.0',
  'ITI Spider',
  'JennyBot',
  'JetCar',
  'JOC Web Spider',
  'JubiiRobot',
  'jumpstation',
  'Katipo/1.0',
  'Kenjin Spider',
  'Kenjin.Spider',
  'Keyword Density/0.9',
  'Keyword.Density',
  'KIT-Fireball/2.0 libwww/5.0a',
  'LabelGrab/1.1',
  'larbin',
  'larbin (samualt9@bigfoot.com)',
  'larbin samualt9@bigfoot.com',
  'larbin_2.6.2 (kabura@sushi.com)',
  'larbin_2.6.2 (larbin2.6.2@unspecified.mail)',
  'larbin_2.6.2 (listonATccDOTgatechDOTedu)',
  'larbin_2.6.2 (vitalbox1@hotmail.com)',
  'larbin_2.6.2 kabura@sushi.com',
  'larbin_2.6.2 larbin2.6.2@unspecified.mail',
  'larbin_2.6.2 larbin@correa.org',
  'larbin_2.6.2 listonATccDOTgatechDOTedu',
  'larbin_2.6.2 vitalbox1@hotmail.com',
  'LeechFTP',
  'LexiBot',
  'libWeb/clsHTTP',
  'LinkextractorPro',
  'linklooker',
  'LinkScan/8.1a Unix',
  'LinkScan/8.1a.Unix',
  'LinkWalker',
  'LNSpiderguy',
  'lwp-trivial',
  'lwp-trivial/1.34',
  'Mass Downloader',
  'Mass Downloader/2.2',
  'Mata Hari',
  'Mata.Hari',
  'MediaFox/x.y',
  'MerzScope',
  'METAGOPHER',
  'Microsoft URL Control',
  'Microsoft URL Control - 5.01.4511',
  'Microsoft URL Control - 6.00.8169',
  'Microsoft.URL',
  'MIDown tool',
  'MIIxpc',
  'MIIxpc/4.2',
  'Mister PiX',
  'Mister Pix II 2.01',
  'Mister Pix II 2.02a',
  'Mister PiX version.dll',
  'Mister.PiX',
  'moget',
  'moget/2.1',
  'MOMspider/1.00 libwww-perl/0.40',
  'Motor/0.2',
  'MSIECrawler',
  'naver',
  'Navroad',
  'NearSite',
  'NeoScioCrawler',
  'Net Vampire',
  'Net Vampire/3.0',
  'NetAnts',
  'NetAnts/1.10',
  'NetAnts/1.23',
  'NetAnts/1.24',
  'NetAnts/1.25',
  'NetCarta CyberPilot Pro',
  'NetMechanic',
  'NetScoop/1.0 libwww/5.0a',
  'NetSpider',
  'NetZIP',
  'NetZip Downloader 1.0 Win32(Nov 12 1998)',
  'NetZip-Downloader/1.0.62 (Win32; Dec 7 1998)',
  'NetZippy+(http://www.innerprise.net/usp-spider.asp)',
  'NHSEWalker/3.0',
  'NICErsPRO',
  'Nomad-V2.x',
  'NPbot',
  'Nutch',
  'Occam/1.0',
  'Octopus',
  'Offline Explorer',
  'Offline Explorer/1.2',
  'Offline Explorer/1.4',
  'Offline Explorer/1.6',
  'Offline Explorer/1.7',
  'Offline Explorer/1.9',
  'Offline Explorer/2.0',
  'Offline Explorer/2.1',
  'Offline Explorer/2.3',
  'Offline Explorer/2.4',
  'Offline Explorer/2.5',
  'Offline Navigator',
  'Offline.Explorer',
  'OGspider',
  'Open Text Site Crawler V1.0',
  'Openbot',
  'Openfind',
  'Openfind data gatherer',
  'Oracle Ultra Search',
  'PageGrabber',
  'Papa Foto',
  'pavuk',
  'pcBrowser',
  'PerMan',
  'PGP-KA/1.2',
  'ProPowerBot/2.14',
  'ProWebWalker',
  'psbot',
  'Python-urllib',
  'QueryN Metasearch',
  'QueryN.Metasearch',
  'R6_CommentReader',
  'R6_FeedFetcher',
  'Radiation Retriever 1.1',
  'RealDownload',
  'RealDownload/4.0.0.40',
  'RealDownload/4.0.0.41',
  'RealDownload/4.0.0.42',
  'ReGet',
  'RepoMonkey',
  'RepoMonkey Bait & Tackle/v1.01',
  'Resume Robot',
  'RMA',
  'Roverbot',
  'SafetyNet Robot 0.1',
  'searchpreview',
  'Senrigan/xxxxxx',
  'SiteSnagger',
  'SlySearch',
  'SmartDownload',
  'SmartDownload/1.2.76 (Win32; Apr 1 1999)',
  'SmartDownload/1.2.77 (Win32; Aug 17 1999)',
  'SmartDownload/1.2.77 (Win32; Feb 1 2000)',
  'SmartDownload/1.2.77 (Win32; Jun 19 2001)',
  'Snooper/b97_01',
  'Solbot/1.0 LWP/5.07',
  'sootle',
  'SpankBot',
  'spanner',
  'Spanner/1.0 (Linux 2.0.27 i586)',
  'spyder3.microsys.com',
  'Sqworm/2.9.85-BETA (beta_release; 20011115-775; i686-pc-linux',
  'SuperBot',
  'SuperBot/3.0 (Win32)',
  'SuperBot/3.1 (Win32)',
  'SuperHTTP',
  'SuperHTTP/1.0',
  'Surfbot',
  'suzuran',
  'Szukacz/1.4',
  'tAkeOut',
  'Teleport',
  'Teleport Pro',
  'Teleport Pro/1.29',
  'Teleport Pro/1.29.1590',
  'Teleport Pro/1.29.1634',
  'Teleport Pro/1.29.1718',
  'Teleport Pro/1.29.1820',
  'Teleport Pro/1.29.1847',
  'TeleportPro',
  'Telesoft',
  'The Intraformant',
  'The.Intraformant',
  'TheNomad',
  'TightTwatBot',
  'Titan',
  'toCrawl/UrlDispatcher',
  'True_Robot',
  'True_Robot/1.0',
  'turingos',
  'TurnitinBot',
  'UCSD-Crawler',
  'UnisterBot',
  'UnwindFetchor/1.0',
  'URL Control',
  'URLSpiderPro',
  'urlck/1.2.3',
  'URLy Warning',
  'URLy.Warning',
  'Valkyrie/1.0 libwww-perl/0.40',
  'vBSEO',
  'VCI',
  'VCI WebViewer VCI WebViewer Win32',
  'VoidEYE',
  'Web Image Collector',
  'Web Sucker',
  'Web.Image.Collector',
  'WebAuto',
  'WebAuto/3.40 (Win98; I)',
  'WebBandit',
  'WebBandit/3.50',
  'WebCapture 2.0',
  'WebCopier',
  'WebCopier v.2.2',
  'WebCopier v2.5',
  'WebCopier v2.6',
  'WebCopier v2.7a',
  'WebCopier v2.8',
  'WebCopier v3.0',
  'WebCopier v3.0.1',
  'WebCopier v3.2',
  'WebCopier v3.2a',
  'WebCopy/',
  'WebCrawler/3.0 Robot libwww/5.0a',
  'WebEMailExtrac.*',
  'WebEnhancer',
  'WebFerret',
  'WebFetch',
  'webfetch/2.1.0',
  'WebFetcher/0.8,',
  'WebGo IS',
  'weblayers/0.0',
  'WebLeacher',
  'WebLinker/0.0 libwww-perl/0.1',
  'WebmasterWorld Extractor',
  'WebmasterWorld Extractor',
  'WebmasterWorldForumBot',
  'WebmasterWorldForumBot',
  'WebMoose/0.0.0000',
  'WebReaper',
  'WebReaper [info@webreaper.net]',
  'WebReaper [webreaper@otway.com]',
  'WebReaper v9.1 - www.otway.com/webreaper',
  'WebReaper v9.7 - www.webreaper.net',
  'WebReaper v9.8 - www.webreaper.net',
  'WebReaper vWebReaper v7.3 - www,otway.com/webreaper',
  'webs@recruit.co.jp',
  'WebSauger',
  'WebSauger 1.20b',
  'WebSauger 1.20j',
  'WebSauger 1.20k',
  'Website eXtractor',
  'Website Quester',
  'Website Quester - www.asona.org',
  'Website Quester - www.esalesbiz.com/extra/',
  'Website.Quester',
  'Webster Pro',
  'Webster.Pro',
  'WebStripper',
  'WebStripper/2.03',
  'WebStripper/2.10',
  'WebStripper/2.12',
  'WebStripper/2.13',
  'WebStripper/2.15',
  'WebStripper/2.16',
  'WebStripper/2.19',
  'WebVac',
  'webvac/1.0',
  'webwalk',
  'WebWalker',
  'WebWalker/1.10',
  'WebWatch',
  'WebWhacker',
  'WebZIP',
  'WebZIP/2.75 (http://www.spidersoft.com)',
  'WebZIP/3.65 (http://www.spidersoft.com)',
  'WebZIP/3.80 (http://www.spidersoft.com)',
  'WebZip/4.0',
  'WebZIP/4.0 (http://www.spidersoft.com)',
  'WebZIP/4.1 (http://www.spidersoft.com)',
  'WebZIP/4.21',
  'WebZIP/4.21 (http://www.spidersoft.com)',
  'WebZIP/5.0',
  'WebZIP/5.0 (http://www.spidersoft.com)',
  'WebZIP/5.0 PR1 (http://www.spidersoft.com)',
  'Wget',
  'wget',
  'Wget/1.4.0',
  'Wget/1.5.2',
  'Wget/1.5.3',
  'Wget/1.6',
  'Wget/1.7',
  'Wget/1.8',
  'Wget/1.8.1',
  'Wget/1.8.1+cvs',
  'Wget/1.8.2',
  'Wget/1.9-beta',
  'WhoWhere Robot',
  'Widow',
  'wired-digital-newsbot/1.5',
  'WWW Collector',
  'WWW-Collector-E',
  'www.freeloader.com.',
  'WWWOFFLE',
  'WWWWanderer v3.0',
  'Xaldon WebSpider',
  'Xaldon WebSpider 2.5.b3',
  'Xaldon_WebSpider',
  "Xenu's",
  "Xenu's Link Sleuth 1.1c",
  'XGET/0.7',
  'Yasaklibot',
  'yes',
  'YesupBot',
  'Yeti',
  'Zeus',
  'Zeus 11389 Webster Pro V2.9 Win32',
  'Zeus 11652 Webster Pro V2.9 Win32',
  'Zeus 18018 Webster Pro V2.9 Win32',
  'Zeus 26378 Webster Pro V2.9 Win32',
  'Zeus 30747 Webster Pro V2.9 Win32',
  'Zeus 32297 Webster Pro V2.9 Win32',
  'Zeus 39206 Webster Pro V2.9 Win32',
  'Zeus 41641 Webster Pro V2.9 Win32',
  'Zeus 44238 Webster Pro V2.9 Win32',
  'Zeus 51070 Webster Pro V2.9 Win32',
  'Zeus 51674 Webster Pro V2.9 Win32',
  'Zeus 51837 Webster Pro V2.9 Win32',
  'Zeus 63567 Webster Pro V2.9 Win32',
  'Zeus 6694 Webster Pro V2.9 Win32',
  'Zeus 82016 Webster Pro V2.9 Win32',
  'Zeus 82900 Webster Pro V2.9 Win32',
  'Zeus 84842 Webster Pro V2.9 Win32',
  'Zeus 90872 Webster Pro V2.9 Win32',
  'Zeus 94934 Webster Pro V2.9 Win32',
  'Zeus 95245 Webster Pro V2.9 Win32',
  'Zeus 95351 Webster Pro V2.9 Win32',
  'Zeus 97371 Webster Pro V2.9 Win32',
  'Zeus Link Scout',
  'ZyBorg'
]

Questions about that list:

  • Is it up to date? Probably not.
  • Does the author of the list know what they’re doing? I don’t know.
  • Should I feel more safe after blocking these bots? Defenitely not.
  • Does it actutally do anything? Yes... probably...

Making the robots.txt file

The following allows everything:

User-agent: *
Allow: /

The sytax for disallowing is not much different:

User-agent: userAgentName
Disallow: /

Targeting more than one user agent? Just list them one-by-one before the Disallow: / statement:

User-agent: userAgentName
User-agent: userAgentName2
Disallow: /

See the official documentation at robotstxt.org.

Here’s a simple Node script to create an example robots.txt file; that allows everything and then disallows the bad bots:

import badBots from './badBots.js'

const bots = badBots.map(badBot => `User-agent: ${badBot}`)
const robotsTxt = `User-agent: *
Allow: /

${bots.join('\n')}
Disallow: /`

export default robotsTxt

What are the big guns doing?

Alexa top 25 global sites (as of November 18, 2020):

SiteHas robotsBlocks bots
google.com/robots.txt
youtube.com/robots.txt
tmall.com/robots.txt
baidu.com/robots.txt
qq.com/robots.txt
facebook.com/robots.txt
sohu.com/robots.txt
taobao.com/robots.txt
360.cn/robots.txt
jd.com/robots.txt
amazon.com/robots.txt
yahoo.com/robots.txt
wikipedia.org/robots.txt
zoom.us/robots.txt
weibo.com/robots.txt
sina.com.cn/robots.txt
live.com/robots.txt
reddit.com/robots.txt
xinhuanet.com/robots.txt
netflix.com/robots.txt
microsoft.com/robots.txt
okezone.com/robots.txt
office.com/robots.txt
vk.com/robots.txt
instagram.com/robots.txt

Most of them block paths in their applicaitons and some well-known bots. But nothing major, which makes me think more and more that this is useless. Or they’re doing the blocking at the web server level (NGINX, Apache).

Comments would go here, but the commenting system isn’t ready yet, sorry. Tweet me @hiljaa if you want to make a correction etc.

  • © 2021 Antti Hiljá
  • About
  • Follow me in Twatter → @hiljaa
  • All rights reserved yadda yadda.
  • I can put just about anything here, no one reads the footer anyways.
  • I love u!