Search Engine Spiders identificeren
Geplaatst: 21 feb 2008, 20:59
Voor enkele weken vroeg ik hier om het script voor de installatie van extra bots. Helaas is dat off-line ivm problemen. Het bijhouden van bots is belangrijk omdat een bot die niet bekend is geen goede toegang krijgt zo blijkt in de praktijk. Als je bijvoorbeeld forums afschermt en pas laat lezen als een gebruiker inlogt dan heb je een probleem met de bereikbaarheid van je forum voor de bots die niet bekend zijn in je ACP van je forum. Ook het maken van een sitemap met bijvoorbeeld A1 (gedraagt zich als bot) gaat niet goed meer lukken. Als het je niet uitmaakt dat je goed in de searchengines komt en SEO is niet bekangrijk voor je dan kun je dit vanzelfsprekend allemaal overslaan en laten zitten.
Dus om goede bots hun werk goed te laten doen, hier de lijsten met BOTS die hun werk normaal doen.
Avondje tikken of misschien kan iemand het script aanpassen (dat er al is) met al deze Search Engine Spiders List als MOD. De eerste lijst is wat verouderd zo lijkt het, de tweede met 303 bots redelijk up-to-date.
303 bots met download optie Robots Database, check site , aantal inactief!
Dit is een interessante link als je wilt weten welke bots er rondwaren op site die je ondanks dit allemaal nog steeds niet hebt geïdentificeerd en dus als gast binnekomen.
Tot slot misschien wel de belangrijkste om niet te vergeten, de BAD BOTS, de spammers en degenen die je robots.txt negeren:
en geleend van webmasterworld A Close to perfect .htaccess ban list
Dus om goede bots hun werk goed te laten doen, hier de lijsten met BOTS die hun werk normaal doen.
Avondje tikken of misschien kan iemand het script aanpassen (dat er al is) met al deze Search Engine Spiders List als MOD. De eerste lijst is wat verouderd zo lijkt het, de tweede met 303 bots redelijk up-to-date.
Code: Selecteer alles
http://www.siteware.ch/webresources/useragents/spiders/
303 bots met download optie Robots Database, check site , aantal inactief!
Code: Selecteer alles
http://www.robotstxt.org/db.html
Code: Selecteer alles
http://www.crawltrack.fr/
Code: Selecteer alles
http://www.kloth.net/internet/badbots.php
en geleend van webmasterworld A Close to perfect .htaccess ban list
Code: Selecteer alles
<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
Code: Selecteer alles
Here's my latest bad bot list. Post if there are more, or if any shouldn't be there.
I know there's a couple that are duplicates with the [NC} option inplace...
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bullseye.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BDFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPickerSE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPickerElite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent.*Internet.*ToolPak.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*[Ee]mail.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archive [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^internetseer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^microsoft.url [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*NEWT.*ActiveX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Siphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Superbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SurfWalker.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport*28 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^TV33_Mercator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Cc]ollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb.*[Mm]ole [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Ww]easel [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [NC,OR]
RewriteCond %{REQUEST_URI} formmail.* [NC,OR]
RewriteCond %{REQUEST_URI} FormMail.*
RewriteRule ^.*$ byebye.php [L]