Robots.txt is a file kept in the root of every website and is used to instruct a search engine spider as to which web pages of your website should be indexed and which web pages should be ignored.
The Robots file is built with specific commands that a spider will look for, that lays down directions for the crawler.
Let's analyze what is contained in a robots.txt file:
• User-agent
• Disallow
• Wildcards "/" or "*"
User-agent: Refers to any search engine spider.
For eg:
User-agent: googlebot , allows only Google to spider the website
User-agent: *, allows every search engine's spiders to crawl the website.
Disallow: Used to specify folders or files which a spider is not allowed to crawl.
For eg:
Disallow: /, specifies that none of the files in the website should be crawled.
Disallow: /images/, specifies that any file within the folder 'images' should not be crawled.
A few command sets that could be useful for a webmaster while creating Robots.txt is given below:
Allow all search engine spiders to index all files
User-agent: *
Disallow:
Allow only Google's search engine spiders to index the website
User-agent: googlebot
Disallow:
To ignore all files in a specific directory
User-agent: *
Disallow: /images/
To ignore only a specific file
User-agent: *
Disallow: /images/sample.jpg
If you don't want any search engines to index any files on your website, use the following:
User-agent: *
Disallow: /
|