A web-spider also known as web-crawler is a program that crawls through websites and indexes the web pages present in them to feed their search engine database.
Spiders, depending on their proprietary algorithm checks every webpage, weighs them, assigns a page rank based on the page's relevancy and writes the rank over to their Search Engine Index. This index is then used by Search Engines to list the web pages for every search that is being conducted over their search portal.
Robots.txt is a file kept in the root of every website and is used to instruct a search engine spider as to what areas of the website, they are allowed and not allowed to crawl.
Spiders crawl a website based on the instructions written down in the Robots file. The Robots file is built with specific commands that a spider will look for, that lays down directions for the crawler. The spider then crawls the website collecting information and then indexing the words on its pages and following every link found within the website. Spiders can read mostly every text on a webpage (except for text within images, videos and Flash), headings, alt tags, link titles, keywords, hidden texts etc. It can even traverse through URL's that goes to another domain outside the current website.
Robots and spiders together would help a search engine analyze and fetch accurate page information for every website.
|