Yahoo! Telling Porky Pies About Index? Google thinks so..

Source Title:
In This Battle, Size Does Matter: Google Responds to Yahoo Index Claims
Story Text:

This is kind of funny, as i was talking to a chap at Google who "joked" that maybe Yahoo! had just counted all the urls in their DB before de-duping them. Now, i see John's been talking to GOOG and they're "officially baffled".

I spent an hour or so on the phone with a group of Google folks, and they shared a lot of information about how they measure index size, how they deal with issues of duplicate URLs and documents, and why they are baffled by Yahoo's claim.


"Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn't support the 19.2 (billion page) claim and we're confused by that."

I've got to say that I find 20bn a hard figure to swallow also, but Google's comments do strike a certain "sour grapes" chord at the same time.

Now, the question is, are Yahoo! stuffing socks down their trousers, or is it really a whopper?


Searching for "the"

Searching for "the" produces 11 billion results. Of course, there are non-English pages to consider, images and other non-linguistic objects.... but are there really another 8 billion of those?

Is there a group of Japanese webmasters with massive recursive dynamic sites clogging up Y crawl control? Or have Y started to crack the "deep web" problem, and genuinely stolen a march on Google?

It's over a year old but is

It's over a year old but is linked to from John's blog in the comments and is still a damn good read. The Technology Review article called Google and Akamai: Cult of Secrecy vs. Kingdom of Openness

Could the new version be called "Google, Akamai and Yahoo: Cult of Secrecy" ?

Personally I don't care how big the index is, but I do believe it matters to Wall Street and Joe Public where big generally is perceived a better

Fact is...

...that no-one can be arsed to count them all. So it's a cool claim. Maybe not accurate, but plausible. I don't really buy it, but what the hey... ;-]

the calculation is simple...

Spider all domains, multiply result by 1.8 for sites that don't redirect non-WWW to WWW, multiply that result by 2 to account for all the sites that allow strings like, have the PR department intern slip a decimal point and voila, 19.2 billion docs.

Seriously, I am amazed that Y is claiming that number. I have a site that has been online since March 26th, and has 4400 pages indexed by G, 1005 indexed by MSN, and only the home page indexed in Y. Wouldn't they have been hammering any and all sites that they came across for the past several months in order to reach this volume, yet there has certainly not been any sort of increase of spidering that would have foreshadowed this announcement.

Remember the month or two before G announced their new index? Everyone's logs were hammered with gBot tracks.

whenever i do a backlink

whenever i do a backlink search on yahoo it always returns many more results than google, i guess that proves yahoo is bigger ;) hehehe

About 15 times bigger...


Subscription Content

They wouldn't be counting their new Subscription Content, too, would they?

Would they!?!

Comparing raw hits to raw hits

NOTE: I am not using the quote marks shown below in the queries. These links go to FIND ALL queries, not EXACT FIND queries. I do not compete for any of these terms on any of the Web sites that I control or assist people with. These queries are, from my point of view, random.

Yahoo! for "real estate" (597,000,000):

Google for "real estate" (110,000,000):

Yahoo! for "travel" (1,410,000,000):

Google for "travel" (400,000,000):

Yahoo! for "britney spears" (69,000,000):

Google for "britney spears" (4,600,000):

Yahoo! for "news" (4,120,000,000):

Google for "news" (1,670,000,000):

Yahoo! for "university" (983,000,000):

Google for "university" (855,000,000):

(NOTE: Harvard is now beating out Stanford on that search. I don't know when that happened.)

Yahoo! for "napoleon bonaparte" (2,610,000):

Google for "napoleon bonaparte" (685,000):

Yahoo! for "care of elephants" (3,050,000):

Google for "care of elephants" (698,000):

Yahoo! for "specimen" (16,300,000):

Google for "specimen" (6,050,000):

Yahoo! for "brazen hussy" (63,000):

Google for "brazen hussy" (17,800):

Yahoo! for "horticultural exchange" (708,000):

Google for "horticultural exchange" (346,000):

Yahoo! for "experimental design change" (8,360,000):

Google for "experimental design change" (10,100,000):

Yahoo! for "corporate headquarters" (18,800,000):

Google for "corporate headquarters" (12,000,000):

Yahoo! for "kiddie rides" (586,000):

Google for "kiddie rides" (142,000):

Yahoo! for "spontaneous combustion" (933,000):

Google for "spontaneous combustion" (409,000):

Yahoo! for "course curriculum" (36,400,000):

Google for "course curriculum" (28,100,000):

Yahoo! for "our wedding" (250,000,000):

Google for "our wedding" (25,600,000):

Yahoo! for "i wrote this song" (32,000,000):

Google for "i wrote this song" (7,470,000):

Yahoo! for "dog and pony show" (3,210,000):

Google for "dog and pony show" (802,000):

You make the call.

yahoo makes up urls for some

yahoo makes up urls for some of my domains.




and more

all of which don't, have or ever will exist. Since I have a mod rewrite going any file or folder that doesn't exist automatically shows the site map. So if you typed in .com/seomike-rules/ you'd get a page. As I discover these made up urls I add rules to trigger 404's yet they still exist in the yahoo index... odd.

I think yahoo is way off on it's index count.

Maybe Google's not counting

Maybe Google's not counting all those sandboxed websites (ducks and runs for cover)


The cool thing is that Y only has about 60% of all my pages.

So, if they ever get to the rest, and my situation is representative, their number could easily climb to 32 billion. Whoa.


MM's queries just prove that
1. Google doesn't index garbage.
2. Both engines guess the number of matches.
3. The real answer is 42.

It is not possible to estimate the size of a SE index by quering anything. Period.

Although I doubt it, Yahoo may have crawled 20 billion pages in infinite loops on session IDs and other unproductive cycles. Probably every fetched 'page' (plus all embedded objects) got an UUID assigned. Then counting the UUIDs gives that useless number. I bet that only a fraction of those crawled pages made in the index. Google seems to count indexed pages which can appear in searches, but their published number of searchable pages isn't accurate.


It's easy, just count each post in all the darn "Google Dance" threads ever made as a page and yer at 15 bil. easy. That leaves 5 bil. for all the rest of the internet people use.

Look at the 2nd result in

Look at the 2nd result in this query

Sergey says its cobblers

Article in NYT quotes the great man . Funny thing is G will prostitute themselves talkingg to NYT!

Sergey Brin, Google's co-founder, suggested that the Yahoo index was inflated with duplicate entries in such a way as to cut its effectiveness despite its large size.

"The comprehensiveness of any search engine should be measured by real Web pages that can be returned in response to real search queries and verified to be unique," he said on Friday. "We report the total index size of Google based on this approach."

But Yahoo executives stood by their earlier statement. "The number of documents in our index is accurate," Jeff Weiner, senior vice president of Yahoo's search and marketplace group, said on Saturday. "We're proud of the accomplishments of our search engineers and scientists and look forward to continuing to satisfy our users by delivering the world's highest-quality search experience."

Nick google has it and

Nick google has it and that one


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.