|
The Critical Anatomy of Google
Few Lines from the Google’s Webmaster Guidelines:
- If no other site links to yours, it may be difficult for our crawler to find you. Conversely, if many sites link to your site, there is a good chance we will find you.
- If we have not picked up your site and it has been several months, then it is likely that our spiders are not able to find your site. If you increase the links pointing to the page, Google is likely to find your site in the future.
Excerpt from the above two Pointers.
So even though Google says to NOT
request links in order to increase your PageRank, they also state
that increasing your links may be the only way to be included in
Google.
So, if you are a new site with tons of useful information, chances are you won’t be found in google until someone else links to
you who are already in Google’s Index. Now that’s what makes a great search engine right?
If Google’s capacity to identify links had a top, why would they keep so many duplicates in their index at different domains?
- As per their publication: It is important for our crawler to
visit “important” pages first, so that the fraction
of the web that is visited (and kept up to date) is more meaningful.
So, basically only a “fraction” of the index is kept
up to date and only the “popular” gets refreshed often.
In addition, new pages are discovered during those crawls and
are placed in queue. This method of crawling can cause chaos simply
because a single web page can often be found in many different
ways. Example: http://123.456.789.012, http://www.somedomain.com,
http://somedomain.com. Let’s take this example and discover
and index these.
Google starts out possibly finding http://123.456.789.012, first.
It doesn’t matter here how Google discovered this page, all
that matters is that Google has to or is about to visit this page.
Google now visits this page and indexes it. Days, weeks or even
months may go by and google now discovers the http://www.somedomain.com.
When google visits this page the author made some text changes.
May be something as small as a copyright year in the footer of the
page. An MD5 checksum of this page does NOT find that it is a clone
or duplication of the http://123.456.789.012 simply because the
content is now different. And because only “popular”
pages are re-visited often, the http://123.456.789.012 may not be
re indexed or crawled for months or even years later.
Next Google discovers the http://somedomain.com, but as in our
second example, the author made some text changes. Because of this
google does not find that this is a duplication of either of the
first two it has already indexed. This now causes their index to
store three different versions of the same page. And if you continue
to make changes, you may never find their index cleaning up or removing
the duplications.
The problem can still exist even if you never make changes to the web page. Why? Google could easily consider two of the above pages as clones. It will then decide based on PageRank and content computations, which is not the original page and instead deliver that particular page in the results. And because Google does NOT actually delete duplicate content, all three URLs, while really the same, are still in Google’s index and only the one with the highest PageRank ever gets re-visited.
- According to another publication of Google, “Furthermore,
Advertising income often provides an incentive to provide poor
quality search results. For example, we noticed a major
search engine would not return a large airline’s homepage
when the airline’s name was given as a query. It so happened
that the airline had placed an expensive ad, linked to the query
that was its name. A better search engine would not have required
this ad, and possibly resulted in the loss of the revenue from
the airline to the search engine. In general, it could be argued
from the consumer point of view that the better the search engine
is the ewer advertisements will be needed, for the consumer to
find what they want. This of course erodes the advertising supported
business model of the existing search engines. However there will
always be money from advertisers who want a customer to switch
products, or have something that is genuinely new. But we believe
the issue of advertising causes enough mixed incentives that it
is crucial to have a competitive search engine that is transparent
and in the academic realm.
- So, then how could it be that their advertising
revenue has risen so dramatically if Google always returns
top relevant results? It’s very possible that the dropping
of URLs could easily have something to do with this dramatic increase
in revenues. You be the judge…..Either way you look at
it, I feel Google should be required to publicly address those
problems and tell us the real reason behind these sums.
- On most broad commercial topics we search, link manipulation
often stinks. On the other hand very narrow topics, with rather
uncommon words, often show pages that use “natural”
text for links.
- In 2004, during the Olympics, when one was searching to find
everything he could possibly find about the Olympics and he turned
to Google for assistance. Sites like cnn.com and usatoday.com
would certainly contain allot of authoritative information about
this subject. So, instead of doing a blind search and just entering
Olympics in the search box, he instead restricted his search to
specific sites. The first one he tried was usatoday.com by entering
the following query:
Site: www.usatoday.com
Olympics
- And 50% of the top 10 results were empty results with no titles,
descriptions, or excerpts. What was explicitly stated in their
robots.txt that these pages were off limits to all search engines
and robots: User-Agent: * Disallow:/Olympics.
- So, Google didn’t index them which is a good thing. But
why in the world does Google show the URLs in their search results
and why were those not important than the other 29,590 Google
says is available? Didn’t usatoday.com already tell Google
NOT to do anything with these URLs?
- Now, this in itself is not a big deal for usatoday.com, but
it is a big deal to the researcher. Now instead of getting 10
results of something, he could quickly scan and decide if he wants
to visit or not, He had to click the “next” button
to see more content which of course displayed more Google Ads
(which were much better targeted to the query). As a web site
owner. Would you like Google showing URLs that you told search
engines not to fetch?
Few questions that summarizes the whole story so far!!
- Why does Google give us results for empty pages of the same URLs for months at a time?
- Why do empty pages Google claims to not have indexed and empty results, rank higher than pages which have been indexed?
- Why does Google crawl sites that are clearly restricted to robots?
- Why does Google include URLs in their SERPs that again, are restricted from all robots including Google?
It’s not like that we are asking Google to share their trade secrets or anything like that. We just want to know why Google is lying and hurting so many businesses that rely on Google traffic. Google placed themselves in the public and asked us all to invest in them which we have. Google should now answer to the public and tell us why they are destroying our business?
Comments
<%dim objRs2
Set objRs2=objConn.execute("select * from features where cpage='5' order by id_num desc")
While not objRs2.EOF
%>
| <%=objRs2("comments")%> |
Posted on <%=objRs2("sdate")%> by <%=objRs2("pname")%> |
|
<% objRs2.movenext
Wend
%>
|