Robots.txt Guide
Posted by Mike | Filed under Miscellaneous, SEOWhat is robots.txt?
Robots.txt is a TEXT FILE which instructs search engine spiders or crawlers on what to do. It tells specific web spiders on which specific web pages to index. Just by tweaking this file, you will be able to command the spiders to crawl or not to crawl certain files in your site.
For example, you would like Google to crawl and index everything that is in your wonderful site except the directory called /private because you wouldn’t want people who use Google as their search engine to be able to find your stuff and waste your bandwidth. Well, your robots.txt will take care of that!
What is its Use in SEO?
Your robots.txt file plays a pretty good role in SEO. If you set the wrong values in it, chances are that some search engines may not be able to index your site! What’s the use of building links, targeting good keywords and all that if YOUR robots.txt file is hindering the search engine spiders to crawl and index your files and pages?
Creating and Tweaking the robots.txt File
Let’s start by creating a text file. Name it “robots.txt”, don’t use any other name! After creating the said file, upload it into your site’s root directory.
Now that you’ve created and uploaded it, it’s time to enter some stuff into your robots file. Here are some entries, together with descriptions, which you can put in your robots.txt.
This will allow all spiders to crawl and index everything in your website. The asterisk is a symbol for “all robots”.
User-agent: *
Disallow:
This will NOT allow all the spiders to crawl and index anything in your website. The slash means all directories, so when you enter this in your robots.txt file, your site will not be searchable.
User-agent: *
Disallow: /
This will NOT allow all the spiders to crawl and index the directories /mystuff, /secret, and /cgi-bin. It will also not all them to crawl the page hmm.html of the directory /huh.
User-agent: *
Disallow: /mystuff
Disallow: /secret
Disallow: /cgi-bin
Disallow: /huh/hmm.html
This will NOT allow all spiders, except for Yahoo’s bot, to crawl and index your site. The Yahoo bot is allowed to crawl everything except for the directory called /mystuff.
User-agent: *
Disallow: /
User-agent: Yahoo-slurp
Disallow: /mystuff
Some crawlers like Googlebot support the “Allow:” command. This command instructs the crawler to index the files and folders that you specify. The following entry disallows other crawlers except for Google to crawl and index your site.
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
List of Bot Names
Now that you know how to instruct the search engines’ crawlers, you must know what to put after “User-agent:”! Here’s a list of bot names for your use.
- Ask/Teoma: Teoma
- Alexa: ia_archiver
- DMOZ: Robozilla
- GigaBlast: Gigabot
- Google: Googlebot
- Google Image: Googlebot-Image
- MSN: Msnbot
- MSN Pic-Search: PSbot
- Scrub The Web: Scrubby
- Yahoo!: Yahoo-slurp
- Yahoo! Blogs: Yahoo-blogs/v3.9
- Yahoo! MM: Yahoo-MMcrawler
Robots META Tag
Sometimes, web hosts do not allow the users to upload a robots.txt file to the root directory. Don’t fret! There’s a META tag for crawlers, and you can use it to instruct them to crawl certain files. Here are the META tags that you can use for your files.
This will tell the crawlers to NOT index the file and to NOT follow the links on the specific file.
<META name=”robots” content=”noindex,nofollow” />
This will tell the crawlers to NOT index the file but follow the links that can be found in it.
<META name=”robots” content=”noindex,follow” />
This will tell the crawlers to index the file but not follow the links that can be found in it.
<META name=”robots” content=”index,nofollow” />
This is just another META tag that instructs the crawlers to NOT index the file and NOT follow the links in it. It’s basically the same as “noindex,nofollow”.
<META name=”robots” content=”none” />
From our sponsors…
In this era, all of us have become quite dependant on the internet and computer things as we get everything right there. It is now up to us that how we utilize this technology. Search engine marketing, also called SEM, is actually a form or a type of internet marketing which seeks to promote websites by increasing their visibility in the search engine result pages.
If you are planning to market your product, use the internet and enjoy the promotion of your product well. The advertising has become an essential part to promote and increase the sales of any product. There are various mediums that can be used to market your product and the internet is becoming one of the best and popular mediums to be used for advertising.
The web hosting companies are responsible for providing the space on the web as well as domain registration to their customers who want to own a site on the web to market their products. The web hosting company maintains and manages the site on behalf of their customer for which they are paid. StartLogic is one of the good and quite cheap hosting companies, and it is quite popular among the top ten cheap and best quality web hosting companies.












[…] Robots.txt Guide […]
i never edit my robot.txt before since i have my website almost 7 month, thank for your tips
It supprises me how powerful the robots.txt file is, yet not many people actually know how to works.
This article is what i have been looking for. I was looking at Google Webmasters Tools the other day and it said something about these robots.txt files and that there is a problem with both of my websites. I have printed this out to help me fix the problem. Thanks so much for this info, it is just what i needed and it seems to be written in simple english so a newbie like myself may even be able to fix the problem. Good on you.
Nice read.
robots.txt is absolutely necessary and should be the first step to start your SEO work.
I’m still really new to this whole “blogging” thing and how to “optimize” my website, etc. I have found various things regarding robots.txt and wordpress, so we’ll see how it all plays out. I still don’t have a pagerank yet, so it’s hard to see any changes.
PublicRecordsGuy: Don’t worry, PR isn’t that much relevant (anymore) in terms of your website’s search engine rankings.
create a good content and follow SEO tech will make our site have a good ranking on SE
Some very good tips here and a great way to help yourself. Personally I really like the way you put a lot of information into this and how clear it is. I knew a fair bit about Robots.txt having edited and added a lot in it but this did help and for a newbie would be great. A great new insight to this for anyone and I recommend for people to read it. I think that if you want to do some successful SEO to any of your sites a Robots.txt is not only a must, but also something that will help and probably get you off to the best start possible.