The Google Problem Part 2

If you know me…. you know I have a HARD time putting down a problem that’s unsolved. Even if it’s a problem that really doesn’t have a solution (in my control at least)… I have a tendency to look and analyze, turn it over and try and find out as much as I can about it. Maybe it’s because I’m so used to being able to find solutions to problems, or at least workarounds by gathering enough information… Anyway, after saying I was tired of trying to figure out why google doesn’t like a site and tired of trying to fix things “for google”….. well, I’ve spent more time “investigating”… or should I say “wasted” more time… I’m not sure which, but I did discover a couple interesting things.

Well… for starters… I did a few variations on the site: searches I’ve been testing. So for, site: and I saw consistently the same results (~1350 pages) , but when I did a search for or, then I saw only ~76 results (This varied as low as 32 at one point in time this morning, and as high as 77 or so, but now is constant.) I suspect the variation in the IDENTICAL search can likely be blamed on different datacenters coming up/going offline. I read on one site that there are some datacenters that are not in the mix 24 hours a day. Searching for site: was also different from site: (although there was no difference in adding/substituting the http:

What’s odd to me with all the above is, that while MSN and Yahoo do show variations with the above trailing / added or missing and www vs no subdomain… the variation is not nearly so wide. 3997 pages with the slash 4111 without at msn and adding the www in front has results falling somewhere between 3998 and 3968, a very small variation in the big scheme of things. Yahoo is even more stable in this regard with 2340 results for www variations on and 2350 for no-subdomain site: searches of (trailing slash made no difference.)

What this leads me to suspect is that google is having problems right now recognizing that the different addresses are the SAME page and not different pages. It’s been said that duplicate content is something they’re working to get rid of, well… if they’re having a hard time getting their algorithm to distinguish between a true duplication and merely two ways of arriving at the same place, that could explain why many pages on my site (and others) have disappeared.

So…. as to the most noticable “problem” I’ve seen from all of this… drastically lower traffic. Initially I chalked it up to turning up lower in the search results for terms that had previously done well. After all, I did a search for “usual keyword” and didn’t see my page, but drastically different results… I dug several pages and still couldn’t “find myself”. So, I set about to see how many pages were actually indexed. Back in December of last year, it was fairly easy to search for pages here. The permalinks have year/month/date… so for instance if I publish this today… it should have 2006/05/16 in the url… so I used a bit of the old Google magic…. inurl:2005/12 – I could even pick a specific day to see if Google had “caught” up to my current posts…. inurl:2005/12/23 – ok – so using the technique I started working backwards to see if Google has spidered ANY of my May 2006 content… inurl:2006/05 nothing…. April, March, February ALL nothing. January 25th was the “most recent” page from my site in the index. Now, this doesn’t mean they don’t have a more recent cache of either the main page, or some of the OTHER subpages (category summary for instance.) If I back up and just look for 2006 in the url “inurl:2006” I get 48 pages (or 64 if I leave out the www). However, admittedly, one of those had 2006 in the article title, (not the actual date..) ALSO, there are many “comments” pages cached (in other words, the article feeds…) so, I adjusted my search yet again to screen out anything with feed in the url…. inurl:2006 -inurl:feed I get 32 results that way now. (An hour ago it was 14.) Now about 5 of the 32 were 2005 articles talking about Mandrive 2006 or something with 2006 in the article name and I saw 1 chronological archive page something like 2006/01/page6…. So, in total 26 articles from this site have been crawled by google in all of 2006. (I did a rough guess that would be around 10% of the articles posted in 2006).

What’s frustrating, of course, is that before February 1….. Most everything through January 20th had been crawled and indexed. I pursued my searching back into 2005 and saw only partial coverage (maybe 50% as a rough estimate) of those articles. I’m not sure at this point if I should be re-assured that I’m not getting traffic from Google because the pages aren’t there, or if I’d rather they be there, just 5-10 pages deep in the rankings. I mean IF they were at least in the index, then I could make some use of the Google Sitesearch tool, unfortunately as it stands, that is pointless. (Which is why I’ve tried to place the MSN sitesearch tools on every location that I had the Google tool.)

So, I’ve managed to establish that Google has REALLY poor coverage of the pages on this site compared to several months back (January). What else? Well, I noticed something else interesting in looking at my referrer logs. I had a visit from, so I did some searching there and…. inurl:2006 turns up about 263 results, which is about what I would expect. (Including at least one post from earlier today.) Unfortunately, the two other site’s I’ve had headaches with ( and do not appear in either place (in spite of pagerank of 5 and 1 respectively (which at least seems to indicate they’re not banned for some reason.)

So, at the moment I’m down to a couple theories. 1) The duplicate content filter is OVERACTIVE. Perhaps, the archive pages with the first paragraph of text is seen as being TOO similar to the article page. That would make sense, Category pages seem to have been crawled. The only problem I see with that is not all the category pages have been crawled (maybe I should look for htings that just show up in one category, there is some duplication of paragraphs where articles appear in two or more categories…) 2) The deep crawler is just not working, or is overwhelmed. Maybe there ISN’T the deep crawler as previously conceived anymore. (Previously google had the quick crawler that would check known pages for updates, then the deep crawler that would discover new links and get NEW pages.) Judging from what I’ve seen, the quick crawler concept seems to be working, there are relatively recent (May 13th) visits to cache pages that it already knows about. But, there doesn’t seem to be AS frequent DEEP crawling to fill in the links that it doesn’t know about. Of course, it leaves the question WHY did so many pages suddenly VANISH. Pages that WERE indexed in late January, simply are not now, you would think if it were a “deep crawl” issue, that the pages would still be there, just new pages wouldn’t be added. Which leads me back to the duplicate content issue.

I read in one forum post, where a frustrated e-commerce site owner said that he had found that his phone number and address was duplicated on the bottom of each of his pages, so he removed that in the hopes google would like it better. (!?!) IF DUPLICATION like that could cause pages to vanish from Google they have a SERIOUS problem with the duplication threshold… Likewise, IF a paragraph on the main page, prevents the DETAIL post page from getting crawled, that’s a problem as well. Of course, I don’t know they may have some serious architectural challanges going on right now. Maybe the storage on the main database (or the proxy cache that it’s crawling thorugh) is in dire straights. That’s kind of hard to believe though.

Another little annoyance I’ve noticed – when I do the search, the title gets returned as Parker, Avery J. – ….. which I wonder if that might be part of the difficulty I’m having in “finding my main site” by searching for avery j parker ( comes up early in the search results, but given that has that in the domain, you would expect it to be a bit more relevant.) IF I search under averyjparker (the domain you would expect would come first… in actuality I get LOTS of places where I’ve been referred to, posted, linked to, etc. (Page 3).

So, what to do. Not much, I suppose until there’s a more clear picture of what’s going on. At this point, my attitude is that it’s google’s issue and over time things will change. What’s really strange is that googlebot shows up in the traffic logs as spidering pages that don’t exist in the index. (Much as it’s visiting the complete sites which are missing.) Ultimately, I don’t think there is anything TO do, but wait and perhaps give feedback to google. I really, don’t see the wisdom in designing a site specifically to make one search engine happy. I didn’t do that before, when the pages were being spidered by google and I would be VERY reluctant to do it now.

But that’s the little bit more light I was able to shed on the issue.

Oh – one other thing I’ve noticed. The mediabot hasn’t really “refreshed” the main page in at least a week now, because many of the ads even today mention epson scanners…. I had an article about an epson scanner issue on the 9th, that quickly got pushed from the home page (it’s now on page 4..) This I would think would not bode well for Google’s bread and butter – pay-per-click advertising. Supposedly there are higher click throughs when the ads are MORE relevant and thus MORE revenue for google. Well, if the ad-based content crawler isn’t keeping up??? Less relevant ads, less revenue… etc.

   Send article as PDF   

Similar Posts