"The Internet Is Becoming a Mall"?
An Empirical Investigation Through Web Directories

[ Overview - Methodology - Data & Results - Analysis & Conclusions - Future Work ]

Overview

As use of the Internet grew during the 1990s, various analysts and scholars came to note its apparent increased emphasis on commercial content. (Lessig, Nesson) Relative to the Internet's 1980s origins as a defense and government communications system, there can be no doubt that Internet content came to include an increased amount of business sites. However, a review of web sites over time suggests that the trend of commercialization may be somewhat less pronounced than Lessig and Nesson had suggested and feared.

In recent research, I have tracked the change in size of category pages of leading web directories Yahoo, Mozilla Open Directory, and Google Directory. From this data, it is possible to draw inferences about trends in availability of certain types of web content - inferences that speak to the claims of commercialization of the Internet.

 

Methodology

Web directories Yahoo, Mozilla Open Directory, and Google Directory each classify substantial portions of the web into hierarchical categories. Using scripts written by the author and invoked on a variety of instances over time, as well as using data available from the archive.org "Wayback Machine," it is possible to determine the number of sites and categories in the various top-level categories of each directory at multiple points in time. Comparing the sizes of particular categories across time, it is possible to draw inferences about the rate of change of available content of each type.

 

Data & Results

Directory Data Range Data & Charts
Category sizes over time Category growth over time Year-over-year growth
Yahoo 1997-2002 Table / Chart Table / Chart Table / Chart
Mozilla Open Directory 1999-2002 Table / Chart Table / Chart Table / Chart
Google Directory 2001-2002 Table / Chart Table / Chart Omitted - more info

A portion of the changes visible in these data are properly attributed not to changes of the Internet but instead to changes in the structure of available web directories. Learn more about this data interpretation problem.

 

Analysis & Conclusions

This data speaks to a number of questions raised by Lessig, Nesson, and others:

Is commercial content coming to dominate the rest of the web?

This data suggests that commercial content is in fact decreasing as a proportion of total web sites. According to Yahoo, the proportion of business sites on the web peaked in 2000 at 47%, but has since declined to 44%. Google concurs in finding a decline between 2001 and 2002, and while the Mozilla Open Directory shows increases through the start of 2001, it too shows a fall during 2001. See the chart of the proportion of business sites to total sites.

Has non-commercial content "caught up" with commercial content on the web?

It depends -- in two different respects.

First, it depends who you ask. According to Yahoo and Google, Business sites remain more prevalent than any other single category. (See Yahoo, Google.) In contrast, the Mozilla Open Directory lists more sites in Arts than in Business.

Second, it depends what you mean by "caught up." The preceding answer speaks to absolute category size; measured in that way, Business still seems to remain larger than any single other category. But certain other categories now seem to be growing faster. Taking as a baseline Yahoo as it stood in 1997, Recreation, News & Media, Society & Culture, Arts, and Reference have each grown more quickly than Business. (See Yahoo Growth.)

Doesn't today's Internet have more commercial content than in the past?

Yes, for two different reasons.

First, the original Internet was an academic and research network -- precisely lacking in commercial content. So any amount of commercial content reflects an increase over the Internet's 1970s origins. (More Internet history from Yahoo.)

Second, with the growing size of the Internet, there is more content of every type than was previously the case. In the context of a huge increase in number of site listings in, for example, Yahoo, it is to be expected that there would come to be an increase in commercial content.

 

Assumptions, Improvements, and Future Work

The discussion above seeks to draw conclusions about web content availability on the basis of listings in leading web directories. As a result, these conclusions require the assumption that web directories do not systematically or sporadically underrepresent relevant portions of the web. The use of three separate directories addresses a portion of this concern, and it is notable that the separate sources generally yield the same implications. In future work, it would nonetheless be desirable to verify that the results described above truly reflect the actual web site availability, rather than merely capturing artifacts of web directory design, reorganization, or selection effects. In particular, future work might attempt to extend analysis to other similar or related data sources -- providing further verification that web content availability is properly and accurately summarized via trends in directory size.

The discussion above is relatively loose in its definition of "content." To improve the specificity of results, it would be desirable to investigate and quantify the connection between listings in web directories, registrations of domain names, provision of actual content on the web, and retrieval and use of that content. To the extent that these quantities are not directly proportional -- and they surely are not, since sites listed in directories are likely to be more frequently accessed than sites not listed -- measurements of one quantity do not perfectly map to measurements of another, limiting possible inferences from the data analyzed here.

Certain critiques of online content have focusses not on an alleged dearth of non-commercial content but instead on an alleged shortage of high-quality content suitable for use in, for example, schools. To the extent that directory staff and volunteers serve as gatekeepers to their respective directories, denying listings to sites with perceived low quality content, listings in web directories may be closely related to availability of high-quality content.

 


Ben Edelman
Last Updated: July 2, 2002 - Notify me of major updates and additions to this page.

This page is hosted on a server operated by the Berkman Center for Internet & Society at Harvard Law School, using space made available to me in my capacity as a Berkman Center affiliate for academic and other scholarly work. The work is my own, and the Berkman Center does not express a position on its contents.