Empirical Analysis of Internet Filtering in China
Zittrain* and Benjamin
Berkman Center for Internet & Society
Harvard Law School
- Methodology - Analysis & Summary
Statistics - Conclusions - Technical
[ Specific Blocked Sites - Highlights - Analysis by Google Keyword (details | chart) ]
The authors are collecting data on the methods, scope, and depth of selective barriers to Internet access through Chinese networks. Tests from May 2002 through November 2002 indicate at least four distinct and independently operable methods of Internet filtering, with a documentable leap in filtering sophistication beginning in September 2002. The authors document thousands of sites rendered inaccessible using the most common and longstanding filtering practice. These sites were found through connections to the Internet by telephone dial-up link and through proxy servers in China. Once so connected, the authors attempted to access approximately two hundred thousand web sites. The authors tracked 19,032 web sites that were inaccessible from China on multiple occasions while remaining accessible from the United States. Such sites contained information about news, politics, health, commerce, and entertainment. See highlights of blocked pages. The authors conclude (1) that the Chinese government maintains an active interest in preventing users from viewing certain web content, both sexually explicit and non-sexually explicit; (2) that it has managed to configure overlapping nationwide systems to effectively -- if at times irregularly -- block such content from users who do not regularly seek to circumvent such blocking; and (3) that such blocking systems are becoming more refined even as they are likely more labor- and technology-intensive to maintain than cruder predecessors.
The government of the People's Republic of China has a longstanding set of policies restricting the information to which citizens are exposed, and that which they may themselves publicly say. The Internet poses a new challenge to such censorship, both because of the sheer breadth of content typically available, and because sources of content are so often remote from Chinese jurisdiction, and thus much more difficult to penalize for breaching restrictions on permissible materials. There is some evidence that the government has attempted to prevent the spread of unwanted material by preventing the spread of the Internet itself, but a concomitant desire to capture the economic benefits of networked computing has led to a variety of strategies to split the difference. For example, the government might encourage Internet access through cybercafes rather than in private spaces so that customers' surfing can be physically monitored by others in the cafe. As a technical matter, anecdotal reports have described a shifting set of barriers to surfing the web from Chinese points of access -- sites that are reported unavailable or domain names that are unknown to the system or that lead to unexpected destinations, individual pages that are blocked, and the use of search keywords that results in temporary limits to further searches.
As with most filtering regimes, whether implemented at the client, ISP, or government level, no list is made available of the sites blocked or of the methodologies used to block them. Further, while the government-connected Internet Society of China (not a chapter of the international Internet Society) has asked Internet service providers and content creators to sign a pledge including self-filtering, few official statements document the existence of government-maintained web filtering, much less the criteria employed and thresholds necessary to elicit a block. We therefore sought to investigate the growing set of methods by which Internet filtering is accomplished, and to collect and distribute a list of blocked sites and pages -- a list that is large in absolute terms even if small relative to the size of the Internet and to the total amount of blocked content, and a list that is diverse even if not perfectly representative of all blocked content. Such a list allows us and others to begin to assess the nature and scope of filtering within China, with particular attention to non-sexually explicit web sites rendered inaccessible there.
Having requested some 204,012 distinct web sites, we found more than 50,000 to be inaccessible from at least one point in China on at least one occasion. Adopting a more conservative standard for determining which inaccessible sites were intentionally blocked and which were unreachable solely due to temporary glitches, we find that 18,931 sites were inaccessible from at least two distinct proxy servers within China on at least two distinct days. We conclude that China does indeed block a range of web content beyond that which is sexually explicit. For example, we found blocking of thousands of sites offering information about news, health, education, and entertainment, as well as some 3,284 sites from Taiwan. A look at the list beyond sexually explicit content yields insight into the particular areas the Chinese government appears to find most sensitive.
This report is intended as a milepost, part of an ongoing empirical investigation documenting filtering levels and methods over time. As we continue to collect data on the evolving accessibility of a diversified "basket" of web sites, we will seek to say more about overall trends in Chinese web filtering, and further see if such trends are credibly linked to government statements of Internet policy and, for particular categories of sensitive sites, whether shifts in the Chinese government's substantive policy (for example, a noted change in tension levels with Taiwan) are reflected in levels of web filtering. This, in turn, can shed light on how important a priority web filtering is to the government.
In other work, the authors will expand analysis to Internet filtering systems in other countries and will generate additional URLs to test based on queries invoked in the local language. Sign up to receive updates. The authors are also developing a distributed application for use by Internet users worldwide in testing, analyzing, and documenting respective Internet filtering regimes. Get more information and sign up to get involved. The authors previously provided access to a web-based system to test web filtering in China which remains available. Finally, the authors prepared screenshots documenting the September 2002 redirection of requests for google.com to other search engines.
Testing Methodology & Technical Notes on Chinese Filtering Systems
Our testing relied on two separate methods of data collection. From March 20 to May 6, 2002, we connected by modem with an international telephone call to dialup accounts with several Chinese ISPs. After May 6 our modems were unable to negotiate a "handshake" with modems answering at any Chinese ISPs, a failure consistent across multiple phone lines and locations, and multiple ISPs and points of presence in China. From August 14 to November 12, 2002, we connected to open proxy servers in China. We selected open proxies with assistance from Ronald F. Guilmette, and we determined their respective listed locations for tabulation purposes using IP-WHOIS.
We conducted testing of only one URL per web host based on our background knowledge, confirmed in subsequent testing, that when the default page of a site was filtered, the entirety of that site was typically filtered. Our appendix contains more about this hypothesis, its support, and our level of confidence in it. As a result, when we report a site as inaccessible, it is typically the case that the entirety of that site was inaccessible -- not just the site's default page or "front page."
On the basis of our testing, both automated and manual, we have reached an increased understanding of the design of filtering systems used to restrict Internet access in China. Our appendix discusses the details of this filtering, including the details we have inferred as to the implementation of filtering systems and the prospects for circumventing them, as well as possible regional variations in filtering and their impact on concluding that a given site is "blocked in China."
Specific Sites Found to be Blocked
During testing, we requested 204,012 distinct sites drawn from various web indices (such as sites listed within Yahoo! Taiwan's directory categories) and search results (such as Google's top 100 results for a search on "China freedom"). Most sites were accessible from China just as from our standard Internet connection in the United States, but we found that certain URLs were consistently unavailable. By attempting to retrieve these sites repeatedly over time, from multiple locations within China, we drew inferences on which specific sites among them were intentionally blocked by Chinese network staff. Our subsequent analysis considers a site to be blocked if it was found to be inaccessible by our testing system on at least two distinct occasions from at least two distinct testing locations in China, and if at those times it was simultaneously reachable from our main testing location in the United States.
Filtering of Sexually Explicit Content
A preliminary round of testing examined 795 distinct URLs containing sexually explicit images. These URLs had been used as the basis for a portion of one author's expert testimony in Multnomah County Public Library, et al. v. United States, 201 F.Supp.2d 401 (E.D.Pa., 2002). He generated this list by collecting all 797 results from Google in response to an October 2001 web search using the search criteria "free adult sex," less two pages removed because they were found not to include sexually explicit images. Of the 752 of these pages still providing content at the time of this testing, 101 (13.4%) were blocked in China. In contrast, the authors previously found 695 (86.2%) of these same sites to be blocked in Saudi Arabia, and one author previously found that leading commercial filtering applications blocked 70% to 90% of these sites.
Filtering of Other Content
Our main testing examined 203,217 web sites drawn from categories other than sexually explicit content. We seeded this list of sites from multiple sources. For example, we extracted from Yahoo all web sites in certain categories (including those specifically about education, entertainment, news, major world governments, and politics) as well as all sites in the non-English regional versions of Yahoo that specifically concern China and Taiwan. We conducted searches on terms likely to yield sensitive results and thus candidates for blocking, both in English and in Chinese, using the Google search engine, and placed the top results into our list of URLs to test. We tracked approximately 5,000 additional sites submitted to our Real-Time Testing System through September 2002, and we received email suggestions of further sites to test. The result of these data sources was a list of 203,217 distinct host names.
Using the definition of "blocked" specified above, we found that a total of 18,931 of these sites (9.3%) were blocked in China. Given the large number of sites blocked, we have organized our listing of specific blocked pages into highlights -- some blocked pages that are well known or otherwise of possible interest -- followed by the full list. Where available, each page's listing includes its HTML title, its META keywords and description, and its Yahoo Directory and Google Directory classifications. These details are as retrieved in November 2002.
Specific web sites blocked in China
Highlights of blocked sites - sites that are well known or otherwise of particular interest
Complete list of 18,931 blocked sites, sorted alphabetically by URL
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z numbers
Content Not Filtered
Within the context of the large number of sites found to be restricted in China, many other sites are not blocked in China, whether because they have yet to be passed upon by the authorities that determine blocks or because they have been affirmatively found to be non-sensitive. Sites not blocked may assist in drawing inferences about what content among the blocked sites is responsible for the differential treatment. For example, filtering of the United States Federal Courts (uscourts.gov and all subdomains) might indicate a desire to prevent access to information about the American judicial system, its processes, and its rulings -- but Findlaw, LexisNexis, and Westlaw all remain accessible. Similarly, blocking of well-known sexually explicit sites such as Playboy and Penthouse suggests a purposeful decision to restrict sexually-explicit material -- yet the well-known sites of Hustler Magazine and whitehouse.com were consistently accessible in the authors' testing.
Additional hosts tested but not found to be inaccessible (.ZIP file)
Analysis & Summary Statistics
Among the specific blocked pages are the following categories of content:
The authors obtained selected sites from Google searches on designated keywords. The graph and tables linked below report the proportion of sites found to be blocked in response to searches on specific keywords.
Blocking of search results by Google Keyword
Blocking of search results by Google Keyword - with blocked site details
Blocking of search results by Google Keyword - chart
Blocking was found to vary across locations in China. However, the authors lack sufficient data to draw conclusions about systematic variations in blocking across geographic locations; current data is consistent both with intentional variations in blocking and with delays in updating block lists in certain regions.
The authors previously made available to the public a real-time testing site whereby interested Internet users could submit URLs for immediate testing through Chinese filtering systems. Between August 28 and November 21, 2002, this system received a total of 100,563 requests to test 13,569 distinct URLs on 12,335 distinct host names. More than 5,000 of these hosts had not previously been selected by the authors for testing.
Having previously examined Internet filtering in Saudi Arabia, the authors tested through Chinese filtering systems all sites previously tested in Saudi Arabia. The authors had previously tested 49,586 distinct hosts through filtering systems in Saudi Arabia and had found that country to restrict access to at least certain content on a total of 582 of these hosts (1.2% of sampled hosts). China also filtered 101 of these hosts (0.2%), while China filtered 5,903 additional hosts (11.9% of the sample) not filtered in Saudi Arabia. The chart below depicts the extent of overlap between filtering in China and in Saudi Arabia. (Note that the representation of hosts not blocked is not to scale, relative to the rest of the chart.)
From our data, it appears that the set of sites blocked in China is by no means static: whoever maintains the lists is actively updating them, and certain general-interest high-profile sites whose content changes frequently appear to be blocked and unblocked as those changes are evaluated. (This is particularly noticeable with news sites such as CNN and Slashdot.) Some new sites with sensitive content do not appear to take long to be blocked. However, even some longstanding sites of apparent sensitivity remain unblocked. This is most easily noticed in our data with respect to sexually-explicit sites -- we found blocking of only 13.4% of our sample of well-known sexually-explicit sites -- but is also anecdotally apparent from our data, as one notes blocking of some US intelligence sites but not others, etc. Further data collection will be geared at determining the extent to which the basket of sites blocked reflects shifting substantive government policies -- whether, for example, a sea change in relations with Taiwan, whether positive or negative, is reflected in blocking, and if so, how quickly.
China's Internet filtering efforts remain opaque, and in the absence of government cooperation or admission of filtering methods, data probing of the sort used in our study remains a useful tool in determining the scope of filtering. The authors have previously studied filtering in Saudi Arabia and in American public libraries; in these locations, blockage of a web page leads to an error message clearly explaining that the requested page is unavailable due to intentional blockage. In contrast, China's systems make it difficult for a user to distinguish between an intentional block and a temporary network or server glitch. This may be intentional or may reflect technical happenstance -- that this implementation was easier or cheaper, given the size and design of China's network infrastructure. But some newer forms of Chinese filtering -- namely, redirection of a request for a sensitive web site to another web site -- can be either more or less obvious to the user than an apparent network glitch, depending on whether the substitution is noticed.
The primary and most longstanding means of blocking is at the router level, and on the basis of IP address -- the crudity of which means that those implementing filtering must choose between blocking an entire site on the basis of a small portion of its content, or tolerating such content. This would explain why, for example, the www.mit.edu server is sometimes wholly inaccessible even though Chinese officials likely have no objection to most content on that server. To the extent that the entirety of that server is nonetheless inaccessible, China's filtering system is properly considered to be overblocking, and we believe our data indicates extensive overblocking of this form. This may account for the rise of still-rare forms of blocking that allow more refined content filtering -- such as blocking by keywords or phrases in any particular HTML page requested by a user, whether or not the site hosting the page is present on an ex ante block list. Such blocking is likely far more technology-intensive, in principle even slowing overall network response time as packets are analyzed by sniffers and the results passed to filters. Aside from allowing more refined content filtering, such newer forms of blocking appear to be linked to disabling Internet access for an arbitrary amount of time for a user who requested a page with forbidden content -- enabling a penalty for attempting access to sensitive material beyond simply denying the very material requested. Other nascent but growing forms of filtering appear to be targeted to limit the information that can be gleaned from search engines -- enabling the automated blocking of search results that may not (yet) have been filtered through human placement on a "forbidden" list.
The Chinese government and associated network authorities are clearly continuing to experiment with different forms of blocking, indicating that -- unlike Saudi Arabia, which appears to have a single, declared method of blocking and a much more constant (and apparently smaller) list of non-sexually-explicit blocked sites -- Chinese network filtering is an important instrument of state Internet policy, and one to which significant technical and human resources continue to be devoted.
Additional details on data collection and interpretation are available in the technical appendix. The authors have also indexed related work by others.
The authors are grateful to Ronald F. Guilmette for assistance with locating proxy servers in China, to Joshua Rosenzweig of the Dui Hua Foundation for assistance in locating routing glitches, and to Nongji Zhang of the Harvard Law School Library for assistance with Chinese translations.
A version of this document was included in the March/April 2003 edition of IEEE Internet Computing.
* Jack N. and Lillian R.
Berkman Assistant Professor of Entrepreneurial Legal Studies, Harvard Law School.
** J.D. Candidate, Harvard Law School, 2005.
Support for this project was provided by the Berkman Center for Internet & Society at Harvard Law School.
Last Updated: March 20, 2003 - Sign up for notification of major updates and related work.