Google Web Toolbar Data: Applications for Herdict

Introduction

This spring, Google provided to Herdict a collection of anonymized Google Web Toolbar usage metrics. This data spanned from mid-November 2011 to May 1, 2012. The data Google provided included the toolbar's records of aggregate visits, DNS errors, connection errors, and 404 errors sorted by domain and country, and binned by minute, hour, and day when each bin was valued at greater than 125.

This document summarizes our work in interpreting this data. Although Google informed us that the data was not useful to researchers at Georgia Tech, our experience was quite different. We found the data to be extremely useful at identifying both site outages and non-outage related changes in site traffic. We believe that should Google resume collection of this data, even if it is provided only on a weekly basis, it would be incredibly beneficial to Herdict and Internet-health research more generally.

Due to data collection problems inherent in first two-thirds of the data, we focused our analysis on only the 48-day period from mid-March to the first of May 2012. Additionally, we used only the data bucketed by minute, as this data appeared the most regular.

Our objective for this data was to create a system for identifying anomalies--deviations in traffic or errors from what would be expected. To do this, we developed an algorithm in the R language that uses trends to estimate expected data and identifies when the actual data exhibits statistically signficiant deviations. We then manually reviewed these anomalies and located external sources that corroborated that we were identifying actual events and not just aberrations in the data collection. Next, we classified corroborated anomalies in order to build a rough anomaly profile which aids in subsequent identification.

This document highlights many of these anomalies. Every single anomaly listed below was automatically identified by our algorithm. We have included only a selection of anomalies, some of which we corroborated, some we have not. The fact that some anomalies have not been corroborated is not a failing of the algorithm; instead, it highlights our ability to identify short outages that end before they reach a significant level of public consciousness. The data we have selected highlights the validity, value, and flexibility of our approach through the inclusion of a broad sample of anomaly patterns and the various events they represent.

How to Read the Graphs

The graphs below typically plot one of the metrics that Google provided (e.g., visits, 404s, connection errors, or DNS errors). On all graphs, black lines represent the raw time-series data. Our algorithm adds in blue and red highlights, indicating areas in which the data deviates from expectations. Blue indicates areas in which the numbers are abnormally low (e.g., a drop in visits), and red where they are abnormally high (e.g., an increase in DNS errors). The more the data deviates from expected values, the more intense the color coding.

In the data, Google used country code "ZZ" as a placeholder when the country of request origin is unknown. Similarly, Google used "UNKNOWN" when the requested domain is unknown (and therefore represents an aggregation of requests to multiple domains).

Table of Contents

  1. Introduction

  2. How to Read the Graphs

  3. Table of Contents

  4. Results

    1. Site Disruptions

      1. Facebook Down in India

      2. Sahibinden Down in Turkey

      3. The Pirate Bay Down in Multiple Locales

      4. Wer Kennt Wen Down in Germany

    2. Network Disruptions

      1. Turk Telekom Disruption

      2. Russian FastVPS DNS DoS Disruption

      3. Anonymous-related Disruptions in Eastern Europe

      4. Unknown Trinidad and Tobago Disruption

    3. Persistent Issues

      1. Facebook Vietnam

      2. Issues for Unreported Domains

    4. National Holidays

      1. Good Friday and Family Day in South Africa

      2. Russian Spring and Labor Day (and the extra work day to make up for it)

      3. The Combined Holidays of Women's Day and Children's Day in Taiwan

    5. National Events

      1. Large Storm Hits Japan

      2. Picasa Spike

      3. US Republican Primary

    6. Data Collection Errors

      1. April 9

  5. Procedure

  6. Further Work & Conclusion

Results

Below we review selected anomalies and, where applicable, provide links to external sources that corroborate these anomalies, linking the events that our algorithm detected with real world network disruptions. To see a complete list of discovered anomalies, please see this file.

Site Disruptions

The reason why Herdict requested this data from Google was because of the belief that browser data would provide a critical source of additional information regarding site outages that could be used as a way of corroborating or directing the crowdsourced reporting on Herdict. Our work confirms that toolbar data could be a tremendous asset for Herdict in this respect.

Our algorithm was successful at identifying both significant site outages that attracted media and other public attention, as well as smaller disruptions. For any given domain, we looked for anomalous drops in visits that corresponded to an anomalous increase in any of the errors. Using this approach, we identified more than 20 likely site disruptions within the 48 day period we analyzed.

Facebook Down in India

At the beginning of our targeted time window, we saw Facebook become largely inaccessible in India. The Indian tech blog Gizbot corroborated the outage.

DNS errors anomalously high while visits are anomalously low

Sahibinden Down in Turkey

We detected anomalous traffic patterns to www.sahibinden.com (a large Turkish classifieds site). This site itself confirmed the outage.

Difficulty accessing www.sahibinden.com from Turkey

The Pirate Bay Down in Multiple Locales

We saw the Pirate Bay (a torrent search engine) go down for an extended period of time. This appears to be a common occurrence, and this particular event is noted in a post on TorrentFreak.

Two outage events, only one of which is highlighted in the visit data (due to model tuning)

Wer Kennt Wen Down in Germany

We detected an outage for German social networking site Wer Kennt Wen which was confirmed by (now defunct) availability tracker DownScount.

A short, not immediately visually obvious outage

Network Disruptions

As opposed to site disruptions, which affect only a single site, network disruptions are more systemic and can impact an entire ISP or even a country. As a result, we can identify network disruptions by looking for times in which a number of site disruptions co-occur or when multiple sites demonstrate unusual levels of errors. During our limited data window, network disruptions occur less often than site disruptions.

Turk Telekom Disruption

The Turkish ISP Turk Telekom had a short issue on April 27 that affected a large number of users. It's unclear whether this was a technical issue or an attack, but news reports confirm that the event occurred, and it is additionally corroborated through Google's own transparency report data, which evidences a drop off at the same time.

DNS errors increasing and visits decreasing simultaneously for www.facebook.com
DNS errors increasing and visits decreasing simultaneously for www.google.com
DNS errors increasing and visits decreasing simultaneously for www.youtube.com

Russian FastVPS DNS DoS Disruption

Requests for unknown domains from Russian users yielded an anomalously high number of DNS errors on March 29. The likely cause was that the DNS servers of one of Russia's large VPS providers were under a DoS attack during this period.
A large spike in DNS errors for requests to unknown URLs

Anonymous-related Disruptions in Eastern Europe

We detected similar anomalies in DNS error data from several different countries in eastern Europe. It is possible that these errors are related to DNS attacks that the online group "Anonymous" had planned for this same period. It was previously impossible to confirm the effectiveness of these attacks; this data is significant in that it provides a new way to measure the impact of the attacks, and it suggests that they did in fact have noticeable and identifiable repercussions.

Three days of increased DNS errors in response to requests from the Czech Republic
Three days of increased DNS errors in response to requests from Romania
Three days of increased DNS errors in response to requests from Austria
Three days of increased DNS errors in response to requests from Poland

Unknown Trinidad and Tobago Disruption

A disruption was seen in Trinidad and Tobago with a profile similar to that of Turkey's outage (described above). We could not corroborate this event.

Increased DNS errors in response to requests for unknown domains from Trinidad and Tobago
Increased DNS errors in response to requests for www.facebook.com from Trinidad and Tobago

Persistent Issues

Our algorithm can clearly detect site and network disruptions because they correlate with significant changes in traffic and errors over a period of minutes or hours. In places with persistent issues, our algorithm would not identify anomalies because a lack of access is the norm, not a change. However, by adjusting our approach, this data allows us to identify places with persistent issues, even though there is not a dramatic change in the data. This is important because it allows us to create a fingerprint of filtering or connectivity issues and provides data to support theories about where there is persistent filtering.

To find persistent issues, we ranked site-country pairs by their ratio of average errors per minute to average visits per minute. The rankings varied slightly depending upon which error we used in the ratio (DNS, connection, or 404 errors), however, countries with highly suspected on-going filtering issues (Vietnam, Malaysia, Indonesia, Pakistan, and Philippines) were always found at the top of the list. Thus, we have identified a visual fingerprint for persistent filtering.

Facebook Vietnam

Access to Facebook from Vietnam far-and-away led with the highest ratio of errors to visits. This correlates rather well with observations on the ground. The images below compare a relatively "healthy" national network (the US) with ones that use filtering or have other persistent issues; the difference is immediately apparent.

Visit and error data for Facebook requests from the US vs. requests from Vietnam

Issues for Unreported Domains

The other domain-country pairs with the highest error-to-visit ratios were requests for unreported/unknown domains. These ratios were far lower than that seen above (where errors actually outpaced successful visits), but they were still far higher than that of requests for unknown domains from the United States.

Visit and error data for requests for unreported domains from the US vs. requests from the Philippines
Visit and error data for requests for unreported domains from the US vs. requests from Indonesia
Visit and error data for requests for unreported domains from the US vs. requests from Malaysia
Visit and error data for requests for unreported domains from the US vs. requests from Pakistan

National Holidays

Although the data has significant potential for identifying site and network outages, it has potential utility beyond just that. For instance, the data sheds light on how people use the Internet. We can see how national holidays can impact web behavior by identifying times when the number of visits increase or decrease simultaneously across a number of different URLs. Financial websites, in particular, appear to be an good indicator of holidays. By corroborating such events, we can confirm that the algorithm is identifying actual real-world events and does not reflect artifacts of collection.

Our method identified dozens of these anomalies and most were externally corroborated. We have provided the most visually obvious or interesting examples.

Good Friday and Family Day in South Africa

South Africa had an extra long weekend in early April, which led to a drop in traffic to most sites accessed from the country. Google identified a similar pattern.

Anomalously low visit numbers on Friday and Monday for unknown domains
Anomalously low visit numbers on Friday and Monday for www.facebook.com
Anomalously low visit numbers on Friday and Monday for www.google.co.za

Russian Spring and Labor Day (and the extra work day to make up for it)

Russia celebrates Spring and Labor Day on May 1 (which is not included in our data set), but this year April 30 was also given as a holiday. To make up for the extra day off, the Saturday before was a work day.

The extra work day and the public holiday in e.mail.ru visit data
The extra work day and the public holiday in mail.yandex.ru visit data

The Combined Holidays of Women's Day and Children's Day in Taiwan

Markets were closed and most people were off work on April 4 in Taiwan for Women's Day and Children's Day. What is interesting about this data is that certain sites experienced lower than normal traffic (Yahoo! Finance and Yahoo!'s general site) while more recreational sites (YouTube and Facebook) experienced higher than normal traffic.

Very low visits to Taiwan's version of Yahoo finance
Low visits to Taiwan's version of Yahoo
A co-occurring increase in visits to Facebook from Taiwan
A co-occurring increase in visits to YouTube from Taiwan

National Events

Our algorithm can identify events of national importance. We can identify these events by looking for anomalous traffic increases that occur for only a single or a small number of domains within a country. Our algorithm detected hundreds of these events, but they are difficult to corroborate, in part because of language issues.

Large Storm Hits Japan

Japan saw its largest storm since 1959 on April 3. Visits to transportation and weather sites increased, as well as visits to online social platform Pigg (perhaps suggesting some people took the day off from work).

A big jump in visits to Yahoo Weather Japan
A big jump in visits to Yahoo Japan's travel planning service
A moderate increase in visits to social platform Pigg

Picasa Spike

Visits to Picasa saw an unusual spike toward the end of April. While this doesn't appear to correspond with any real-world events, Google reports an almost 80-fold increase in requests from Malaysia during the same period.

Large increase in Picasa requests from unknown countries

US Republican Primary

On April 24, the website for US TV network MSNBC saw an increase in traffic from the United States. This corresponds with the US Republican primary.

A notable increase in visits to www.msnbc.msn.com

Data Collection Errors

The most common anomaly by far occurred on April 9 around 20:00 GMT. This anomaly was detected in different request results, across different domains, and for different countries. As it was so widespread, we believe this to be a data collection error, and as such, April 9 was discarded from most of our analysis.

April 9

Visits to unreported domains from the US drop off completely
Visits to www.facebook.com from unreported countries drop off completely
Visits to unreported domains from Brazil drop off completely
DNS errors for unreported domains from Mexico drop off completely

Procedure

  1. Identify good data.

    We chose March 13, 2012 00:00 to May 1, 2012 05:00 as the time window with the best data and used only the minute-by-minute data from this window.

  2. Trim the fat.

    We took the top 1000 sites by total visits as those sites had the densest data coverage.

  3. Clean up the data.

    We filled in zeros in the data with either 0 or 124 depending upon the application.

  4. Regularize the data.

    We needed a dataset with a consistent sample rate, so we filled in values for missing minutes using linear interpolation.

  5. Decompose the data.

    We used a number of different methods for decomposing the data (ARMA, auto-fitting ARIMA) before deciding upon the simple STL (Seasonal-Trend decomposition by Loess) algorithm. This algorithm is fast, conceptually much simpler than other methods, and works. We assumed a periodicity of 1 week. This process turned the one time series into three: a seasonal component (the weekly pattern), a trend component, and a remainder. We were interested in getting the remainder, as this contains deviations for which the seasonal and trend components cannot account. The first four graphs in Figure 18 below are the output of this algorithm.

  6. Smooth the data.

    We took the remainder component of the above process and binned it into one hour segments, and took the mean of each of those bins. This is the fifth panel in Figure 18.

  7. Filter the data.

    We removed any bins from the above process whose mean fell below a certain threshold (3 times the standard deviation of the time series). The threshold and discarded data are represented graphically in Figure 18 by red lines and grayed-out line segments, respectively.

  8. Output the data.

    Each of the remaining bins were considered significantly anomalous. These were output as both CSV and PNG (as graphs). See the last panel of Figure 18. The line matches that of the first panel, but anomalous highlights have been added.

  9. Identify patterns in the data.

    For each anomaly, we knew whether it was higher than expected or lower than expected and the request result time series in which it occurred (successful visits, connection errors, DNS errors, not found errors). Using these traits, we could roughly categorize anomalies:

    • Site Outage: lower-than-expected visits that co-occurred with higher-than-expected errors for a single domain from a single country typically indicated a site outage.
    • ISP/Cable Disruption: lower-than-expected visits that co-occurred with higher-than-expected errors for multiple domains from a single country typically indicated an ISP issue or cable breakage.
    • National Holiday: higher-or-lower-than-expected visits for an extended period of time (6 or more hours) that did not co-occur with anomalous error rates and co-occurred with similar anomalies for different domains within the same country typically indicated a national public holiday.
    • National Event: anomalies that involved higher-than-expected visits to specific sites or sites that fall within the same category (weather, sports, finance, etc.) typically indicated a national event, such as severe weather, an election, or a major sporting event.
    • Data Collection Error: Extreme anomalousness that occurred across multiple domains and countries typically indicated what we believe to be data collection errors.
The process of detecting anomalies in visits to the thepiratebay.se from unknown countries

Further Work & Conclusion

We are excited about what we identified in less than 50 days worth of usable data. We have created an algorithm that we believe can identify significant anomalies with a high degree of accuracy. As the examples above demonstrate, we have tied much of what our algorithm detected to real-world events, confirming its utility.

Having additional data would allow us to further refine our approach and ensure its accuracy. 40+ days worth of data is a relatively small sample. With a larger data set we could see if the patterns we've identified hold true over an extended period of time. Moreover, we could test whether the the algorithm picks up known outages by looking for anomalies where there has been a significant filtering event (for example, in Egypt, Libya, or Pakistan in the period following the "Innocence of Muslims" video). If the algorithm matched known events, it would further confirm its effectiveness.

This data would be extremely useful for Herdict, even if it was not collected live. If Google could provide this data on a daily or even weekly basis, this data could still help improve Herdict's ability to detect outages. While many Herdict users search for real-time data, many also look for historical trends. For the latter category of users, supplementing our crowdsourced data with detected anomalies would help confirm (or disprove) our crowd reported data, providing a necessary check on the variability of data from an unpredictable source. Moreover, this data would allow us to better tailor the sites that we direct our users to test through our various queues of sites to be tested. Thus, even if not reported live, this data could make Herdict far more useful and responsive.

Additionally, this data can help provide quantitative confirmation for places suspected of having persistent filtering and connection issues. As noted above, we've been able to construct a visual fingerprint for those kind of issues, which provides objective confirmation for what has previously been a largely subjective and somewhat abstract determination. Of perhaps even greater importance, by tracking such fingerprints over an extended period of time, this data could be a way of identifying when countries make changes to their filtering regimes, something that has been challenging to identify up until this point.

Outside of the filtering and Herdict contexts, this data can provide unique insights into how people use the Internet. As we noted above, our algorithm can detect national holidays and when people are concerned about an incoming storm. Although such information is not directly relevant to Herdict's mission, it could be extremely useful for research projects (such as Berkman's Internet Monitor project) that are looking into quantification of how people use the Internet more generally.

Going forward, we believe there is a tremendous amount of work that could be done with more data like this. We have tried to provide here a sense of some of the opportunities, and we look forward to discussing with you these and other approaches.

All code required to reproduce the analysis and images contained in this report can be found here.