|
The Critical Anatomy of Google
A press release was released on September 02, 2004.
And within a couple of months, the issue was addressed by Google
men. Not apparent though, the exact reason that triggered the updation
of Google’s database, but that too by a staggering 3,772,844,877,
but what that definitely did was raised a few eyebrows, not only
from the Web marketing
companies but also the people from Google.
As the Press Release brought into light the number of web pages been indexed by google over a span of time, it made so many things clearer.
According to it: Between Aug 04, 2003 and August 25, 2003(Just 21 days), google added a little over 1.2 billion web pages to their index. But since then, google hasn’t added one single web page to its index (At least According to the google they haven’t).
As should be apparent from the table below:
Date |
No. of Pages indexed |
2 Months after the release of the Press release. ( Nov 10, 2004) |
8,058,044,651 |
At the time of press release (Sep 02, 2004) |
4,285,199,774 |
1 Year Before the Press Release (Aug 25, 2003) |
4,285,199,774 |
20 Days Back (Aug 04, 2003) |
3,083,324,652 |
So, what does this mean? It means either Google has been lying to us all, or they have been dropping as many pages as they have been adding them.
Our guess is that in Aug 25, 2003 Google’s index was full. Why do we say this….? Because Google’s white papers were freely available to anyone. This meant that you could access the actual documents published by Google Founders before Google became public and get a glimpse of how Google was created. According to these documents, Google was written in C and C++ using ANSI C and Linux. The database was constructed using a Document_ID that is associated with each web page. The Document_ID was published as being 4-byte unsigned long integer. This means that for every single web page Google has in their index, an ID was created to identify this page. But like everything, there is a limit and a 4-byte unsigned long integer has a maximum value of 4,294,967,296. (2^32).
Does, this number strikes in you some remarkable co-relation between the second and third entries of the table above. I am sure it did. So, if no changes are made to their database structure, it would mean Google has probably reached this threshold. And as new pages are added, old pages are removed. Quite alarming isn’t it?
This may also be one of the reasons; pages appear to be dropping from google’s index at an alarming rate (tens of thousands of search results where I can prove this happening). They may have already run out of space and the Document_ID is no longer associated with the content stored in the database which in turn will return empty results for a particular URL.
Can this problem be corrected? Sure, it can, but Google has 15,000+ Linux servers and 4.2 billion Document_IDs to convert. This is not going to be an easy task at this point, as it would be adding to the list of expenditure for the Google Company. Also, every single word in their inverted index is associated with a Document_ID so the conversion will probably take months if not even a great deal longer.
Given that Google returns currently “popular” pages at the top of the search results, only proves Google is unfairly penalizing newly created pages that are not yet “popular.” While this statement may be an exaggeration, it does contain an alarming bit of truth.
While Google takes more than 100 different factors into account in determining the final ranking of a web page, the core of heir ranking algorithm is based on a metric called PageRank, which is nothing more than a “Link Popularity” metric. It is important to understand the distinction between the “importance or quality” of a web page and the relevance of “Popularity”.
Since popular pages are repeatedly returned by Google as top results, they are also the easiest for users to discover, which increases their popularity even further.
As is evident for many resources, 98% of Google’s
revenues come from their advertisers. This would mostly consist
of Adwords and Adsense. But all it would take a firewall company,
Virus Protection Company, AOL, or Microsoft to simply create a google
ad blocker and it will be the end of Google over night. These companies
as well as Google already provide pop up and pop under blockers
and writing a Google Ad Blocker would be even simpler to do.
Google was built and still uses cheap Linux desktop machines (about 15,000 of them) and open source C and C++ as well as Python. These were and most likely still are 32 bit CPU machines. In effect you have 32 bits of data to play around with and every document has a unique representation “DocID”. Unfortunately you cannot represent fractions, or numbers greater than 4,294,967,295 (2^32-1).
Just do a search at google on almost anything and I am sure you’ll find empty pages in the results.
Google keeps incredible amounts of pointless pages just created for the sake of spamming it and probably making some click through business (including Adsense), while content rich and much focused pages sometimes disappear. If Google’s capacity to identify links had a top, why would they keep so many duplicates in their index at different domains?
Beyond content duplication, Google is the only engine which can
afford displaying aliases (http://domain and http://www.domain )
for those sites which deliver on both paths. Would a search engine
near the limit of its index capacity accumulate pages that don’t
exist anymore, broken links, different versions of the same URL,
and the like? Would it eradicate pages with hundreds or even thousands
of inbound links and keep tons of pages from totally unpopular sites?
Do the Google Guys have a technically reasonable explanation that would not ruin their 32 bit theory?
Few questions that summarizes the whole story so far!!
- Why after several months does Google proudly display 4,285,199,774 web pages, but yet they seem to have the time to update their logos on a daily basis?
- Why are still valid and active pages dropped after being in Google’s index for years?
- Why does Google give us results for empty pages of the same URLs for months at a time?
It's not like that we are asking Google to share their trade secrets
or anything like that. We just want to know why Google is lying
and hurting so many businesses that rely on Google traffic. Google
placed themselves in the public and asked us all to invest in them
which we have. Google should now answer to the public and tell us
why they are destroying our business?
Continued in part-II, Click the arrow button....!!
Comments
<%dim objRs2
Set objRs2=objConn.execute("select * from features where cpage='4' order by id_num desc")
While not objRs2.EOF
%>
| <%=objRs2("comments")%> |
Posted on <%=objRs2("sdate")%> by <%=objRs2("pname")%> |
| |
<% objRs2.movenext
Wend
%>
|