Web Site Robots.txt

Hundreds of web robots crawl the Internet and build search engine databases, but they generally follow the instructions in a site’s robots.txt.

Web robots and spiders have been crawling the Internet since it’s inception. With the development of the web and the growth of search engines through the late 1990’s, the number of automated robots wandering the web grew quickly, so that it was not unusual to find several dozen visiting in the course of a day.

Recognizing the need, to help control these automatic visitors, Martijn Koster designed the file “robots.txt” for use by a web site to tell search engines which directories and files to exclude from their robot scans, either for functional or privacy reasons. Robots are not obliged to follow the instructions in a robots.txt file, but most do. For example, a web site might exclude one or more directories from one or more robots simply because it cannot afford too much bandwidth expenditure, or because it contains very dynamic information, like topical news data, and the pages would either be missing or changed if retrieved later through a search engine.

You can read the robots.txt file for a site by opening the site name appended with “robots.txt”, such in the examples below:

Resources. The following sites provide more information about web robots: