Preventing web crawlers from indexing everything



Ok, so we’ve seen how to password protect directories to keep the web crawlers out, but I don’t want to go through that. I want to keep the page open, but I don’t want it spidered and indexed by the bots.

There are ways for doing this too. In fact there are several. The most commonly accepted and respected way of telling a bot not to crawl certain areas of a website is with what’s called a robots.txt file. Usually this is put in the same folder as your main site index and looks like this.

User-agent: *
Disallow: /

The above will keep all robots out of your site. This might be too heavyhanded though. Let’s say the msnbot has been a bit too voracious with your downloads area

User-agent: msnbot
Disallow: /downloads/

That should be enough to keep it out of that folder. Here’s another example.

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

You can get more complicated than this if you need to.
Here’s a link to googles robots.txt file

To exclude a specific file from being indexed, you might try the following meta tag in your document.
you can also use index and follow to fine tune what you want to restrict or allow.

I’m not certain that the respecting of a meta tag is widely held. robots.txt is more likely to be followed.

To just exclude the googlebot, you might try this…

according to Googles page on removing pages from the index

Google apparently will respect that tag and that would allow other bots through.

Related Posts

Blog Traffic Exchange Related Posts
  • Make an autorun cd show a web document on autoplay... There's a utility called Thumbs that looks like a good quick way to make a cd launch a web documented on autoplay in Windows 95/98/ME/NT/2000/XP/ ...Of course, autoplay under windows is fairly easy to setup. If you have a program on the disk you can just have autorun.inf in the......
  • Saving you from yourself or specifying which index file to use with apache As I said, I mistakenly uploaded a page of links that I use for the main administration across many sites to this domain. Unfortunately, the server preferred using the index.html to the index.php that serves up the USUAL home page. So, for about an hour after my slipup.... the main......
  • The Google Problem, or why I'm starting to use MSN and Yahoo more. This weekend has been a bit of an introspective for me on why google is still the primary search engine I use. I know, I've been a big "fan(?)" of google for quite some time, I've obviously incorporated many of their products into my pages and used Google for 99%......
Blog Traffic Exchange Related Websites
  • Astonishing Tricks Of A Small Recognised Targeted Visitors Generation Grasp Traffic Bandits ReviewSo, you have create your foremost site while in the hope which you could appeal to far more visitors in your home business. You have worked some months setting up, creating, developing and having to pay to get a wonderful world-wide-web host. All that stays to become accomplished......
  • Select A Search Engine Optimization Company Tips A Search Engine Optimization Company is definitely an invaluable asset with your Online marketing campaign. They specialize in knowing how to improve your search engine positions,monitoring those positions on the regular basis, and adjusting their approaches to account for undesirable results in any given month. Because this requires a great......
  • SEO Principles for WordPress Blogs WordPress is a relatively SEO friendly blogging tool to begin with. It offers a linking structure that makes it relatively easy for spiders to crawl your pages, and the code contains very little validation errors if any at all. However, there are a few additional steps that you can follow......
en.pdf24.org    Send article as PDF   

Similar Posts


See what happened this day in history from either BBC Wikipedia
Search:
Keywords:
Amazon Logo

Comments are closed.


Switch to our mobile site