The Truth About the Web
By Z Smith
We've been crawling the "public" Web the sites you don't need to give a password for or pay to view. We set about building our crawler in August 1996, it became operational October 1st, and by October 18th we were crawling at 440 MB/hr over our T-1 line. In January we added a second T-1 line. When gathering just HTML, our crawler gathers about 4 million pages per day, comparable to Scooter, the famed AltaVista crawler.
While one could start at any well-connected site (say, Yahoo!) and just follow the links, we had some data which gave us a head start donated text crawls from two text search-engine companies and a university research project. We also looked at the lists of URLs served by a major third-level cache (18 million Web-object requests).
Adding these sources together, we were able to build a master list of sites with well over one million site names. We then used DNS to find out how many of these names were actually valid sites and how many were aliases.
Following is a list of interesting facts about the Web, mostly based on data gathered by Internet Archive, but augmented with some stats from Larry Page of Stanford University and public documents from the Web.
We estimate there are 80 million HTML pages on the public Web as of January 1997. The figure is fuzzy because some sites are entirely dynamic (a database generates pages in response to clicks or queries). The typical Web page has 15 links (HREFs) to other pages or objects and five sourced objects (SRC), such as sounds or images. Moreover:
The upshot of this data is that it takes about 400 GB to store the text of a snapshot of the public Web and about 2000 GB (2 TB) to store nontext files.
Z is vice president of engineering for Internet Archive. He can be contacted at email@example.com.