I was wondering, how does a Web Crawler [wikipedia.org] actually work.

Did a search and found out…

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.


Advertisement



Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Google Guide > Part II: Understanding Results > How Google Works
Next: Results Page »

How Google Works

If you aren’t interested in learning how Google creates the index and the database of documents that it accesses when processing a query, skip this description. I adapted the following overview from Chris Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

* Googlebot, a web crawler that finds and fetches web pages.
* The indexer that sorts every word on every page and stores the resulting index of words in a huge database.
* The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

Let’s take a closer look at each part.
1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web browser, by sending a request to a web server for a web page, downloading the entire page, then handing it off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through finding links by crawling the web.

Screen shot of web page for adding a URL to Google.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways, domains, or sub-domains with substantially similar content, sending automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling, also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can reach almost every page in the web. Because the web is vast, this can take some time, so some pages may be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Such crawls keep an index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current.

hm.. next question was..

how often does crawler actually come and do their work?
.. someone had monitoring it and concluded that..
Below you will find the 10 latest Google Web Crawler events on Google Dance Tool after August 20, 2003 (the date the script was installed). This will help you develop ideas of your own concerning how Google operates, their crawling patterns, and maybe help figure out the specific function of each google crawler.

As time goes on more information will be collected from Google’s crawlers and more information on Google’s crawlers will be posted… :) so check back frequently for updates.

Bot Date Crawled IP Address Crawler
May 15, 2009
6:53pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
3:16pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
2:15pm
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
11:20am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
10:04am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
9:19am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
6:04am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
5:02am
66.249.72.52 crawl-66-249-72-52.googlebot.com
May 15, 2009
4:38am
213.199.128.149 tide75.microsoft.com
May 15, 2009
3:02am
88.54.127.81 host81-127-static.54-88-b.business.telecomitalia.it

third question.. what adsense had to do with crawler ?

How do your crawler and site diagnostic reports work?
Print

Here are some facts about our crawler and Site Diagnostic reports that you may find helpful:

* The crawler report is updated weekly.
The crawl is performed automatically and we’re not able to accommodate requests for more frequent crawling.
* The AdSense crawler is different from the Google crawler
The two crawlers are separate, but they do share a cache. We do this to avoid both crawlers requesting the same pages, thereby helping publishers conserve their bandwidth. Similarly, the Sitemaps crawler is separate.
* Resolving AdSense crawl issues will not resolve issues with the Google crawl.
Resolving the issues listed on your Site Diagnostics page will have no impact on your placement within Google search results. For more information on your site’s ranking on Google, review our entry on getting included in Google search results.
* The crawler indexes by URL.
Our crawler will access site.com and www.site.com separately. However, our crawler will not count site.com and site.com/#anchor separately.
* The crawler won’t access pages or directories prohibited by a robots.txt file.
Both the Google and AdSense Mediapartners crawlers honor your robots.txt file. In case it prohibits access to certain pages or directories, they will not be crawled.
* The crawler will attempt to access only URLs that request Google ads.
Only pages displaying Google ads should be sending requests to our systems and being crawled.
* The crawler will attempt to access pages that redirect.
When you have “original pages” that redirect to other pages, our crawler must access the original pages to determine that a redirect is in place. Therefore, our crawler’s visit to the original pages will appear in your access logs.

fouth.. how do I properly monitor what Google crawler stats and all?

I use the following :
1. for realtime .. i just see the “User Online” and see if got Google Bot.
2. for just curious.. i sometime examine the access.log of the httpd server itself (was wondering if other webadmin also doing so..)
3. nice graph from the awstats.
4. hmm.. using a tool provided by google ( Google Webmaster Tools) at http://www.google.com/webmasters/
submitted a sitemap and it generate all the nice graph and all.
5. manually query for the expected keyword in google search and see the result. (and sometime asked other user to search for a keyword from their own location and compare the result.. interestingly it is varied..)

fifth .. does my trend of updating my blog/site will affect Google crawler ?

1. from ahmadazwan.com : he said.. crawler will take longer time period to crawl my blog if didn’t update for so long.
more frequent update will invite more frequent crawler visit..

2. i think some engine in wordpress do notify the crawler whenever new post created and published.
avoid immediate editing after published.. it will have longer time to fully cached and indexed..
make use of draft.. if the content is not ready to be accessed by your reader.. do not publish. (as i tend to publish it immediately .though..)

3. WordPress scheduler will have some sort of early notification but not visible prior to date of publishing..

4. my visitor sometime report the time it got indexed into google.. so can guess.. is about 2hour is normal..

from my observation.. the earliest index i ever got is .. 5minute after update…kinda posting with something from the news.google.com
it was midnite.. and the topic is quite hot..
the normal was about 2 hour.. and if with editing after publishing and all sort of non-sense.. will took 8hours.. hahaha..

5. if putting the word Google somewhere in title or in text.. it tend to get indexed faster. hahahaa. this is just assumption

p/s : if the right word search could bring the needy user to the exact page where that the information lies.. it tend to invite nice comment.. afterward.. and also spambot.. (especially if put twitter as keyword..) .. 8-)

References :

1. http://www.googleguide.com/google_works.html
2. http://www.google-dance-tool.com/google_crawler_history.html
3. http://en.wikipedia.org/wiki/Web_crawler
4.http://www.ahmadazwan.com