Preventing web crawlers from indexing everything
Ok, so we’ve seen how to password protect directories to keep the web crawlers out, but I don’t want to go through that. I want to keep the page open, but I don’t want it spidered and indexed by the bots.
There are ways for doing this too. In fact there are several. The most commonly accepted and respected way of telling a bot not to crawl certain areas of a website is with what’s called a robots.txt file. Usually this is put in the same folder as your main site index and looks like this.
User-agent: *
Disallow: /
The above will keep all robots out of your site. This might be too heavyhanded though. Let’s say the msnbot has been a bit too voracious with your downloads area
User-agent: msnbot
Disallow: /downloads/
That should be enough to keep it out of that folder. Here’s another example.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
You can get more complicated than this if you need to.
Here’s a link to googles robots.txt file
To exclude a specific file from being indexed, you might try the following meta tag in your document.
you can also use index and follow to fine tune what you want to restrict or allow.
I’m not certain that the respecting of a meta tag is widely held. robots.txt is more likely to be followed.
To just exclude the googlebot, you might try this…
according to Googles page on removing pages from the index
Google apparently will respect that tag and that would allow other bots through.
Popularity: 1% [?]
Related Posts - Google Analytics under the microscope I've spent some time this evening looking at Google Analytics. (Now the data is being collected.) And I've got to say I'm impressed with the scope of what I'm seeing. First, since last night, more stats have been collected, there seem to be some missing from today yet (maybe ~12......
- Search engines to blame for malware spread? There are a couple news stories about a McAfee SiteAdvisor report about the search engines responsibility for sites that distribute malware. McAfee said Friday that the epidemic of spyware and viruses could be linked to search engines. According to research from the company, even seemingly benign search terms could bring......
- SSH, Proxies (Proxy's?), Tor and Web Browsing For quite some time I've been making use of a dd-wrt modified linksys box on my home network as an openvpn endpoint so that when I'm out and about in the world, I connect the vpn, switch firefox to route through a squid proxy server on the home network and......
Related Websites - Redirects - What to Use and When This is a guest post! If you want to write for us, check out the Guest Post section. To define redirect broadly would be seen as a way of sending a browser (or search engine) from one web address to another. Some commonly used redirects would be: Manual redirects –......
- SEO Tips for Blog Traffic Generation While it may be true to say content is king when it comes to blog publishing, the truth is that writing your blog content is not by far the only thing that you should be focusing on when it comes to attracting a readership following. Quality SEO, or search engine......
- Google Loves You: 10 Top Tips for a Google-Friendly Website Many small businesses believe good Google listings are beyond them. This simply isn’t true. Find out how to be able to say “Google Loves You” and your website. 1. Domain Name Choose a domain name that contains two or three keywords that are the most important keywords for your......
Similar Posts
- Roll your own search engine… sort of…
- Interesting spyware push download tactic…
- Saving you from yourself or specifying which index file to use with apache
- Wget user agent avoidance
- Creating a redirect page