Hundreds of web robots crawl the Internet and build
search engine databases, but they generally follow the instructions in
a site's robots.txt.
Web robots and spiders have been crawling the Internet since it's inception.
With the development of the web and the growth of search engines through the
late 1990's, the number of automated robots wandering the web grew quickly,
so that it was not unusual to find several dozen visiting in the course of
a day.
Recognizing the need, to help control these automatic visitors, Martijn
Koster designed the file "robots.txt" for use by a web site
to tell search engines which directories
and files to exclude from their robot scans, either for functional or privacy
reasons. Robots are not obliged to follow the instructions in a robots.txt
file, but most do. For example, a web site might exclude one or more directories
from one or more robots simply because it cannot afford too much bandwidth
expenditure, or because it contains very dynamic information, like topical
news data, and the pages would either be missing or changed if retrieved
later through a search engine.
You can read the robots.txt file for a site by opening the site name appended
with "robots.txt", such in the examples below:
Resources. The following sites provide more information about web robots: