Empirical Analysis of Google SafeSearch
Benjamin Edelman - Berkman Center for Internet & Society - Harvard Law School
This research is part of a series of projects with Professor Jonathan Zittrain.

[ Background - Affected Pages - Test a Keyword or Site - Analysis - Conclusions - Support & Extend This Work ]

Abstract

Google offers interested users a version of its search engine restricted by a service it calls SafeSearch, intended to omit references to sites with "pornography and explicit sexual content." However, testing indicates that SafeSearch blocks at least tens of thousands of web pages without any sexually-explicit content, whether graphical or textual. Blocked results include sites operated by educational institutions, non-profits, news media, and national and local governments. Among searches on sensitive topics such as reproductive health, SafeSearch blocks results in a way that seems essentially random; it is difficult to construct a rational non-arbitrary basis for which pages are allowed and which are omitted. See highlights of pages omitted from SafeSearch seemingly inconsistent with SafeSearch's stated filtering policy.

 

Related Projects

Background

As Internet use increased over the past decade, Internet users have become increasingly concerned with unexpected and unintentional exposure to sexually-explicit content. Such exposure may result in part from ambiguous or misleading domain names, in part from domain names being converted to porn sites after domain expiration (ref: "Tina's Free Live Webcam"), and in part from mistyped domain names (ref: Zuccarini typographic variations).

Whatever the causes, Internet pornography has become a sufficiently serious problem to prompt three U.S. federal laws (ref: CDA, COPA, CIPA) and numerous U.S. state laws. Some countries, like Saudi Arabia, attempt to block Internet pornography at a national level (ref: Documentation of Internet Filtering in Saudi Arabia), and private parties have also attempted to address the problem through filters to be installed on individual computers or on central network infrastructure (ref: N2H2 and others). Common to these filtering efforts are serious questions of filtering accuracy, for filters have been found to prevent access to large bodies of content beyond the sexually-explicit materials intended to be blocked (ref: Sites Blocked by Internet Filtering Programs). Filters have therefore faced continued controversy and, when mandated by the U.S. government, even constitutional challenges (ref: ALA CIPA Archive).

Notwithstanding the difficulty of accurate filtering, search engines face pressure to be "family friendly" -- making their services more useful to children and to anyone who would prefer not to receive unexpected and undesired sexually-explicit content. Such filtering by a search engine is conceptually distinct from actual blocking by a program like N2H2 -- for while N2H2 and its kin affirmatively deny access to designated web sites, Google merely fails to list such sites in its results, though the affected sites remain accessible if specifically requested by a knowledgeable user. Nonetheless, the practical effect may be similar -- the failure of a user to reach, or even know about, certain kinds of content.

Facing calls for exclusion of sexually-explicit content from its search results, Google in 2000 implemented a feature called SafeSearch, intended to "eliminate ... sites that contain pornography and explicit sexual content" (ref: Google Help). Google's help site suggests that SafeSearch is primarily driven by automated systems -- computer classification of sites that are sexually-explicit versus not, in contrast to the human review purportedly used by companies like N2H2 (ref: N2H2's "Human Review Advantage"). For lack of human review of SafeSearch's filtering decisions, Google readily admits that its system is not completely accurate. But while a number of researchers have previously evaluated the accuracy of client-side and network-based filtering, the author knows of no independent investigation of Google's SafeSearch filtering. This project seeks to be a first step in such an evaluation -- raising issues for further study, discussion, and improvement, and bringing into focus the apparent overblocking by SafeSearch of at least tens of thousands of pages, and more likely hundreds of thousands or even millions, that do not meet SafeSearch's stated filtering criteria.

 

Affected Pages

This section provides a sampling of web pages (URLs) found to be omitted from Google searches using the SafeSearch feature.

Some of the URLs listed in the listings below might be construed as sexually-explicit -- for example, for the purpose of providing information about health or sexual education, or describing efforts to regulate pornography. However, most of the URLs listed seem to be misclassified by Google as sexually explicit. For example, it is unlikely that there is sexually-explicit content on thomas.loc.gov (the Library of Congress's index of federal legislation), pmo.gov.il (the Israeli Prime Minister's Office), nmsa.org (the National Middle School Association), or neu.edu (the main index page of Northeastern University), but all four are excluded from Google searches using SafeSearch (screenshots: thomas.loc.gov, pmo.gov.il, nmsa.org, neu.edu).

The results below reflect testing through the ordinary Google site and through Google with SafeSearch enabled, conducting searches for approximately 2,500 search terms chosen by the author, admittedly in ad hoc fashion. Ordinarily, SafeSearch results precisely match regular Google results. However, when SafeSearch blocks a URL from reporting in its results, URLs come to be listed in ordinary results that are omitted from SafeSearch results. For each search term, this system therefore checked for URLs listed in ordinary results but absent from SafeSearch results; any such URLs are properly considered omitted from SafeSearch.

The links below report a total of 15,796 distinct URLs found to be omitted from Google SafeSearch. In at least some instances, the entire corresponding site is excluded from SafeSearch. Reporting focuses on URLs likely to be wrongly omitted from SafeSearch, i.e. omitted inconsistent with stated blocking criteria.

Omitted URLs: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z numbers (sorted by domain name)

A sample of omitted URLs are flagged in the more concise listing linked below. These URLs are among those likely to be of particular interest because they are produced by official government agencies or other well-known organizations or because it is particularly unlikely that they contain sexually-explicit content.

Omitted URLs - Highlights

The following links report keywords for which SafeSearch results omit URLs included in ordinary Google results.

Keywords with Omitted URLs: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

The following table reports keywords with SafeSearch omissions, grouping keywords by substantive categories. The right two columns report the proportion of keywords (within each grouping) for which SafeSearch results omit at least one of the top 10 and top 100 results, respectively, from an ordinary Google search on the corresponding keyword.

Keyword group
click for full listing of omitted URLs
Proportion of keywords with at least one SafeSearch omission
among top 10 resultsamong top 100 results
American newspapers of national scope (list from Newslink)
54.2%
100.0%
American presidents
23.3%
97.7%
American states and state capitals
16.0%
98.0%
Countries
25.8%
99.0%
Fortune 1000 companies (list)
23.7%
92.6%
Most selective American colleges and universities (list from Morgan Park Academy)
20.0%
96.0%
House and Senate committees (lists from House of Representatives and Senate)
8.3%
97.9%

Screenshots preserve and display the omission of selected URLs from SafeSearch via side-by-side comparisons of SafeSearch results and ordinary Google search results.

Thomas - Legislative Information on the Internet (congress.gov and thomas.loc.gov)
National Middle School Association (nmsa.org)
Israeli Prime Minister's Office (pmo.gov.il)
Northeastern University (neu.edu)
Musicals Dot Net (musicals.net)
MP3.COM Easy Listening (mp3.com)

 

Submit a Keyword or URL for Testing

This site provides two separate methods for testing omission from Google SafeSearch. Use the "Type of test" drop-down box to choose between testing a single specified URL, or testing all pages resulting from a specified keyword search.

Note that uses of this system are logged for future study, analysis, and publication. The results of selected requests may be merged with the author's prior data and reported via the links above.

 Name:
 Email address:
 Special interests / comments:           
   Receive updates about this project
Receive information about the author's other projects
 Type of test:
 Keyword to test:
(e.g. breast cancer)
(required)

 

 

Analysis and Summary Statistics

Omission of web pages without obvious sexual content.

Testing indicates that SafeSearch causes the omission of numerous web pages with no apparent sexually-explicit content.

Omitted pages include US government sites (congress.gov, thomas.loc.gov, shuttle.nasa.gov), sites operated by other governments (Hong Kong Department of Justice, Canadian Northwest Territories Minister of Justice, Israeli Prime Minister's Office, Malaysian National Vocational Training Council), political sites (Vermont Republican Party, Stonewall Democrats of Austin, Texas), news reports (including New York Times articles about blogs, deflation, and US military strategy, as well as other articles published by the BBC, c|net news.com, the Washington Post, and Wired), educational institutions (a chemistry class at Middlebury College, Vietnam War materials at Berkeley, University of Baltimore Law School, Northeastern University), and religious sites (Biblical Studies Foundation, Modern Literal Bible, Kosher for Passover).

Of omitted sites without obvious sexual content, a few seem to be blocked based on ambiguous words in their titles (like Hardcore Visual Basic Programming), but most lack any indication as to the rationale for exclusion.

Omission of web pages affirmatively targeted at or helpful to children.

Of the omitted URLs without obvious sexual content, many are legitimately targeted at children. Such URLs include numerous entries from Grolier Encyclopedia as well as numerous primary and secondary schools.

Frequency of omission of useful content.

Users of SafeSearch are likely to face omissions in their search results frequently when conducting research in many fields unrelated to sex. The specific rate of omissions varies according to search genre, as detailed in the chart above of omission frequencies by keyword category.

For example, among searches for American states and state capitals, 16% of searches yielded at least one omission among the first ten results, and fully 98% had at least one omission within the first 100 results. Reviewing the specific URLs omitted from searches on states and state capitals, it is clear that the overwhelming majority of omitted URLs are not sexually explicit; for Iowa, for example, omitted URLs include the following: Virtual Hospital: University of Iowa Family Practice Handbook (#10), Iowa Geological Survey Bureau (#35), samuel.igsb.uiowa.edu/ (#64), Iowa Sex Offender Registry - Iowa Sex Offender Registry (#73), US District Court Northern District of Iowa (#95).

In addition, rates of SafeSearch omission are higher in other fields tested (including colleges, countries, and newspapers) than among U.S. states and capitals.

Arbitrary listing and omission of sensitive web pages that might be construed as sexually explicit.

Among URLs that might be considered "borderline," Google SafeSearch seems to lack a principled or rational basis for allowing certain pages while blocking others. For example, six of the top ten "sexuality" results are blocked while four are allowed, and it is difficult to understand why the peer-reviewed Electronic Journal of Human Sexuality should be omitted while the Society for Scientific Study of Sexuality remains listed. Similarly, nine of the top ten results for an ordinary Google search on "pornography" are omitted from SafeSearch reporting; a National Academy of Sciences report on Internet pornography is listed while a book published by the National Academies Press on the same subject is omitted. A manual review of additional sensitive search results indicates that this apparent arbitrariness extends to a large number of search terms including searches about sexual health, pornography, and gay rights.

Omission of controversial web pages that are not sexually explicit.

SafeSearch omits certain other URLs that, while controversial, are not sexually explicit and do not seem to meet SafeSearch's stated blocking criteria ("explicit sexual content"). Such URLs include information about drugs (Drug Free America and White House Office of National Drug Control Policy, as well as sites advocating drug use), online gambling sites (1-800-Gambler, Online Gambling & Casino Guide, NCAA Gambling), and sites providing term papers or essays for download (Apex Term Papers, Term Paper Assistance). Also blocked are sites that discuss pornography (Adult Sites Against Pornography, Pittsburgh Coalition Against Pornography), efforts to restrict and filter pornography (including the author's prior work studying filtering in China and Saudi Arabia, as well as related work by the Censorware Project), and software programs that attempt to block pornography (N2H2 Internet Filtering, Websense).

Failure to exclude all explicit content.

Testing indicates that Google's SafeSearch fails to block all sexually-explicit content. For certain keywords, SafeSearch simply blocks results altogether -- for example, a SafeSearch query for 'playboy' yields the error message "No standard web pages containing all your search terms were found. Suggestions: Try different keywords" (sic). However, for most keywords used to seek out sexually explicit content, SafeSearch blocks some results while leaving others.

Considered under SafeSearch's stated blocking criteria ("explicit sexual content" -- whether graphical or textual), many of the results returned in response to searches such as "penis" and "condom" may be considered improperly allowed where they should have been blocked. SafeSearch also offers numerous sites with sexually-explicit content in response to searches that unambiguously seek such materials, even as the majority of sexually-explicit content does seem to be blocked.

Possible one-sided focus in SafeSearch tradeoffs, emphasizing avoiding overblocking rather than avoiding underblocking.

The author's prior research on Internet filtering (in the context of a lawsuit challenging Internet filtering in public libraries) indicates a tradeoff between overblocking (preventing access to content that does not meet filtering criteria) and underblocking (failing to prevent access to content that does meet filtering criteria). Research indicates that a given filtering system typically cannot reduce its overblocking rate without increasing its rate of underblocking, and vice versa.

Google's SafeSearch system seems to reflect a focus on avoiding underblocking (i.e. avoiding linking to sexually-explicit content) at the expense of increased overblocking (omitting listings of content that is not sexually-explicit). This focus is borne out in Google's SafeSearch documentation, which advises of a procedure for requesting that Google exclude a sexually-explicit site mistakenly included in SafeSearch results, but which fails to mention any procedure for requesting that SafeSearch include a non-explicit site mistakenly excluded. Of course, as described in the transparency section below, it is considerably more difficult for users to find out what SafeSearch excludes than what it includes -- for a SafeSearch user would ordinarily never see those sites that SafeSearch excludes.

Update, 4/10/03: SafeSearch documentation has been modified to add a procedure for reporting a site mistakenly omitted by SafeSearch.

SafeSearch's emphasis on avoiding underblocking is also borne out in research, described above, as to which sites are blocked and which are not -- i.e. the blocking of the overwhelming majority of sexually-explicit content along with a significant amount of non-sexual content.

SafeSearch's emphasis on avoiding underblocking, even at the expense of increased overblocking, is likely consistent with the stated preferences of those who use SafeSearch. However, such users may express this preference without a full understanding of the possible scope of overblocking. In future research, the author hopes to survey SafeSearch users (or would-be users) to determine their perceptions as to an acceptable level of overblocking in exchange for omission of all sexually explicit content.

Transparency of filtering implementation as to affected users.

When a user enables SafeSearch, an additional phrase at the top of every Google results page notes that SafeSearch is engaged and active. This transparency is particularly notable because certain other filtering systems affirmatively avoid reporting their existence or effect to users (ref: Op-ed by the author as to lack of transparency in filtering systems used in China). In this sense, then, Google's SafeSearch is significantly more transparent than other filtering systems.

But in another sense, Google's SafeSearch is less transparent: While traditional filtering systems like N2H2 provide error pages when a user requests a blocked site, Google SafeSearch simply moves up the next-highest ranked site. (If the ordinary number two entry for a given search is blocked, then the third site takes position two, four becomes three, and so forth.) A user is never notified of this specific invocation of SafeSearch exclusions. A more transparent implementation would report the extent of SafeSearch invocation and could even offer an immediate option to bypass the filtering ("Three results were excluded by SafeSearch. Click here to view the full results.").

Transparency of filtering implementation as to affected sites.

Like commercial Internet filtering companies, Google does not warn affected sites of their exclusion from SafeSearch, nor offer such sites any opportunity to dispute SafeSearch classification status.

In addition, while commercial filtering companies typically offer interested sites the ability to check whether their URLs are blocked by companies' filtering software (ref: N2H2 URL Checker, SmartFilterWhere), Google offers no such system. Commercial filtering companies' site checkers in many ways fall short of content providers' desires -- conducting testing only on a URL-by-URL basis, not allowing inspection of the entire block list; failing to provide notification of updates or changes; failing to provide a designation of the scope, reason, or duration of blocking. But the complete absence of such a testing system for SafeSearch leaves affected web sites even worse off than under the simple testing systems typically provided by commercial Internet filtering companies.

Effect of robots.txt and Google caching failures on SafeSearch listings.

Empirical analysis and comments from Google staff indicate that when Google cannot or fails to retain a cached copy of a web page, SafeSearch will omit that page from its results. Such cache failures can result from at least four distinct factors:

Robots.txt. Some web sites site use a configuration file ("robots.txt") to instruct web "bots" (used by search engines) not to visit the site. Under these circumstances, the site may nonetheless remain in ordinary Google listings, but Google will fail to cache it and its pages will therefore disappear from SafeSearch. This result is surprising for at least two reasons:

First, if a robots.txt file instructs web bots not to visit a site, it is unclear how Google came to index that site in the first place. Google's documentation indicates that the company supports and abides by robots.txt, so if a site uses a robots.txt file to exclude systems like Google's, it is unclear why the site would nonetheless remain listed in Google.

Update (4/10/03): Google staff report that sites with strict robots.txt files may nonetheless remain in Google thanks to references and descriptions in links from other sites and/or in the Google Directory / Mozilla Open Directory.

Second, Google's documentation of SafeSearch and its instructions to webmasters lack any mention that certain robots.txt files will cause a site to be omitted from SafeSearch results even as it remains listed in ordinary Google results. Since an affected site remains listed in the ordinary Google search results, its designers are likely to think that there is no problem -- that the site is correctly and fully indexed by Google -- even as the site remains omitted from SafeSearch.

Other reasons for caching failure. Google staff report that Google may also decline to cache a page or be unable to cache a page when Google fails to crawl that page due to low pagerank, when the page's web server is unreachable, when the page bears a "noindex" instruction in its meta tags, when the page is of 0 bytes in length, or when it redirects users to another page or site.

To assess the rate at which caching failures cause omissions from SafeSearch, the author determined which of the web pages listed above are included in Google's cache. Of the URLs listed, approximately 26.7% were not included in Google's cache. These URLs are designated in web page result listings with wording "Site may be omitted due to caching failure by Google, redirect, or 0-byte response."

SafeSearch's requirement of cached copies of the pages at issue likely results from SafeSearch's need to analyze content on that site in order to determine, to the best of its ability, whether the site should be blocked by SafeSearch. But this requirement is both undocumented and unrelated to the reason why users rely on SafeSearch. Because users typically depend on SafeSearch for accurate and comprehensive results, web pages are listed in the results linked above even if their omission from SafeSearch results from, for example, a robots.txt configuration rather than from other errors made by SafeSearch.

Scope of testing and levels of SafeSearch configuration.

Analysis in this project considers only the "Filter Using SafeSearch" option offered in Google's ordinary Advanced Search form, a feature which is not engaged by default.

Google's Global Preferences dialog box offers an additional level of filtering, namely filtering that purports to filter explicit images but not explicit text, which is Google's default setting for users who do not specifically request an alternative. However, enabling this "moderate filtering ... images only" mode had no apparent effect even on searches for sexually-explicit content. For example, the top ten results for "playboy" were identical, and included a numerous sites with extensive sexually-explicit images, whether SafeSearch was disabled or was set to "moderate" (see screenshot).

Scope of omissions.

When a result is omitted from Google SafeSearch, the omission sometimes reflects omission of the entire domain name hosting that result, as if SafeSearch had deemed all content on that host to be pornographic. In other instances, omission reflects apparent targeting of a specific page, while other pages on its web host remain listed in SafeSearch and sometimes even in the search results at issue. When SafeSearch mistakenly omits a page from a given host but includes other results from that host, the omission is in a certain sense less serious -- for an interested user might click through the other results and thereby reach the page wrongly omitted. However, the initial exclusion nonetheless reduces the quality of results provided to the user -- denying the user direct access to the result that the ordinary Google considered most relevant to the search at issue.

 

Conclusions

Google does not claim perfection of SafeSearch, instead commenting that "no filter is 100% accurate" (ref: Google Help). But in making SafeSearch available to the public, with the knowledge that many thousands of Internet users will rely on it day in and day out, Google puts its name and reputation behind this specific implementation of content filtering. Given Google's large market share among search engines, its perceived leadership position, and its vague statements to date as to the specific methods used and the specific level of accuracy sought or achieved, evaluation of the accuracy of SafeSearch seems particularly desirable.

Initial research indicates cause for concern via the omission from SafeSearch of substantial bodies of valuable web content -- such that ordinary Internet users relying on SafeSearch will systematically miss relevant web sites due to SafeSearch's errors. Further research will attempt to better quantify the scope of these errors. For example, future research might compare SafeSearch's erroneous omissions to the rate at which sites unreachable due to random network-level or server-level failures. Research might also compare the frequency of SafeSearch's errors to the rate of overblocking by commercial blocking programs or to the proportion of content filtered by countries that seek to limit Internet access.

Some omissions from SafeSearch may result from unanticipated and undocumented interaction between Google's policies and web publishers' configurations (i.e. robots.txt). Additional documentation by Google may help address these problems. Google might inform users in greater detail as to the level of accuracy they can expect from SafeSearch and as to the kinds of omissions likely to be made by SafeSearch. Google might also inform webmasters as to steps they can take to assist with the proper categorization of their content.

SafeSearch's errors confirm the author's sense, reflected in prior research, that accurate Internet filtering is an extraordinarily difficult task still well beyond the reach of current algorithms and methods.

 

Support and Extend This Work

Partial support for this project was provided by the Berkman Center for Internet & Society at Harvard Law School. Additional financial support will assure the continuation of this and related projects. Please contact the author with suggestions.

This project developed in part from discussions with Jay Bregman of the Oxford Internet Institute.

 


Last Updated: April 14, 2003 - Sign up for notification of major updates and related work.