Facebook Down in India
At the beginning of our targeted time window, we saw Facebook become largely inaccessible in India. The Indian tech blog Gizbot corroborated the outage.
This spring, Google provided to Herdict a collection of anonymized Google Web Toolbar usage metrics. This data spanned from mid-November 2011 to May 1, 2012. The data Google provided included the toolbar's records of aggregate visits, DNS errors, connection errors, and 404 errors sorted by domain and country, and binned by minute, hour, and day when each bin was valued at greater than 125.
This document summarizes our work in interpreting this data. Although Google informed us that the data was not useful to researchers at Georgia Tech, our experience was quite different. We found the data to be extremely useful at identifying both site outages and non-outage related changes in site traffic. We believe that should Google resume collection of this data, even if it is provided only on a weekly basis, it would be incredibly beneficial to Herdict and Internet-health research more generally.
Due to data collection problems inherent in first two-thirds of the data, we focused our analysis on only the 48-day period from mid-March to the first of May 2012. Additionally, we used only the data bucketed by minute, as this data appeared the most regular.
Our objective for this data was to create a system for identifying anomalies--deviations in traffic or errors from what would be expected. To do this, we developed an algorithm in the R language that uses trends to estimate expected data and identifies when the actual data exhibits statistically signficiant deviations. We then manually reviewed these anomalies and located external sources that corroborated that we were identifying actual events and not just aberrations in the data collection. Next, we classified corroborated anomalies in order to build a rough anomaly profile which aids in subsequent identification.
This document highlights many of these anomalies. Every single anomaly listed below was automatically identified by our algorithm. We have included only a selection of anomalies, some of which we corroborated, some we have not. The fact that some anomalies have not been corroborated is not a failing of the algorithm; instead, it highlights our ability to identify short outages that end before they reach a significant level of public consciousness. The data we have selected highlights the validity, value, and flexibility of our approach through the inclusion of a broad sample of anomaly patterns and the various events they represent.
The graphs below typically plot one of the metrics that Google provided (e.g., visits, 404s, connection errors, or DNS errors). On all graphs, black lines represent the raw time-series data. Our algorithm adds in blue and red highlights, indicating areas in which the data deviates from expectations. Blue indicates areas in which the numbers are abnormally low (e.g., a drop in visits), and red where they are abnormally high (e.g., an increase in DNS errors). The more the data deviates from expected values, the more intense the color coding.
In the data, Google used country code "ZZ" as a placeholder when the country of request origin is unknown. Similarly, Google used "UNKNOWN" when the requested domain is unknown (and therefore represents an aggregation of requests to multiple domains).
Below we review selected anomalies and, where applicable, provide links to external sources that corroborate these anomalies, linking the events that our algorithm detected with real world network disruptions. To see a complete list of discovered anomalies, please see this file.
The reason why Herdict requested this data from Google was because of the belief that browser data would provide a critical source of additional information regarding site outages that could be used as a way of corroborating or directing the crowdsourced reporting on Herdict. Our work confirms that toolbar data could be a tremendous asset for Herdict in this respect.
Our algorithm was successful at identifying both significant site outages that attracted media and other public attention, as well as smaller disruptions. For any given domain, we looked for anomalous drops in visits that corresponded to an anomalous increase in any of the errors. Using this approach, we identified more than 20 likely site disruptions within the 48 day period we analyzed.
At the beginning of our targeted time window, we saw Facebook become largely inaccessible in India. The Indian tech blog Gizbot corroborated the outage.
We detected anomalous traffic patterns to www.sahibinden.com (a large Turkish classifieds site). This site itself confirmed the outage.
We saw the Pirate Bay (a torrent search engine) go down for an extended period of time. This appears to be a common occurrence, and this particular event is noted in a post on TorrentFreak.
We detected an outage for German social networking site Wer Kennt Wen which was confirmed by (now defunct) availability tracker DownScount.
As opposed to site disruptions, which affect only a single site, network disruptions are more systemic and can impact an entire ISP or even a country. As a result, we can identify network disruptions by looking for times in which a number of site disruptions co-occur or when multiple sites demonstrate unusual levels of errors. During our limited data window, network disruptions occur less often than site disruptions.
The Turkish ISP Turk Telekom had a short issue on April 27 that affected a large number of users. It's unclear whether this was a technical issue or an attack, but news reports confirm that the event occurred, and it is additionally corroborated through Google's own transparency report data, which evidences a drop off at the same time.
We detected similar anomalies in DNS error data from several different countries in eastern Europe. It is possible that these errors are related to DNS attacks that the online group "Anonymous" had planned for this same period. It was previously impossible to confirm the effectiveness of these attacks; this data is significant in that it provides a new way to measure the impact of the attacks, and it suggests that they did in fact have noticeable and identifiable repercussions.
A disruption was seen in Trinidad and Tobago with a profile similar to that of Turkey's outage (described above). We could not corroborate this event.
Our algorithm can clearly detect site and network disruptions because they correlate with significant changes in traffic and errors over a period of minutes or hours. In places with persistent issues, our algorithm would not identify anomalies because a lack of access is the norm, not a change. However, by adjusting our approach, this data allows us to identify places with persistent issues, even though there is not a dramatic change in the data. This is important because it allows us to create a fingerprint of filtering or connectivity issues and provides data to support theories about where there is persistent filtering.
To find persistent issues, we ranked site-country pairs by their ratio of average errors per minute to average visits per minute. The rankings varied slightly depending upon which error we used in the ratio (DNS, connection, or 404 errors), however, countries with highly suspected on-going filtering issues (Vietnam, Malaysia, Indonesia, Pakistan, and Philippines) were always found at the top of the list. Thus, we have identified a visual fingerprint for persistent filtering.
Access to Facebook from Vietnam far-and-away led with the highest ratio of errors to visits. This correlates rather well with observations on the ground. The images below compare a relatively "healthy" national network (the US) with ones that use filtering or have other persistent issues; the difference is immediately apparent.
The other domain-country pairs with the highest error-to-visit ratios were requests for unreported/unknown domains. These ratios were far lower than that seen above (where errors actually outpaced successful visits), but they were still far higher than that of requests for unknown domains from the United States.
Although the data has significant potential for identifying site and network outages, it has potential utility beyond just that. For instance, the data sheds light on how people use the Internet. We can see how national holidays can impact web behavior by identifying times when the number of visits increase or decrease simultaneously across a number of different URLs. Financial websites, in particular, appear to be an good indicator of holidays. By corroborating such events, we can confirm that the algorithm is identifying actual real-world events and does not reflect artifacts of collection.
Our method identified dozens of these anomalies and most were externally corroborated. We have provided the most visually obvious or interesting examples.
South Africa had an extra long weekend in early April, which led to a drop in traffic to most sites accessed from the country. Google identified a similar pattern.
Russia celebrates Spring and Labor Day on May 1 (which is not included in our data set), but this year April 30 was also given as a holiday. To make up for the extra day off, the Saturday before was a work day.
Markets were closed and most people were off work on April 4 in Taiwan for Women's Day and Children's Day. What is interesting about this data is that certain sites experienced lower than normal traffic (Yahoo! Finance and Yahoo!'s general site) while more recreational sites (YouTube and Facebook) experienced higher than normal traffic.
Our algorithm can identify events of national importance. We can identify these events by looking for anomalous traffic increases that occur for only a single or a small number of domains within a country. Our algorithm detected hundreds of these events, but they are difficult to corroborate, in part because of language issues.
Japan saw its largest storm since 1959 on April 3. Visits to transportation and weather sites increased, as well as visits to online social platform Pigg (perhaps suggesting some people took the day off from work).
Visits to Picasa saw an unusual spike toward the end of April. While this doesn't appear to correspond with any real-world events, Google reports an almost 80-fold increase in requests from Malaysia during the same period.
On April 24, the website for US TV network MSNBC saw an increase in traffic from the United States. This corresponds with the US Republican primary.
The most common anomaly by far occurred on April 9 around 20:00 GMT. This anomaly was detected in different request results, across different domains, and for different countries. As it was so widespread, we believe this to be a data collection error, and as such, April 9 was discarded from most of our analysis.
April 9
Identify good data.
We chose March 13, 2012 00:00 to May 1, 2012 05:00 as the time window with the best data and used only the minute-by-minute data from this window.
Trim the fat.
We took the top 1000 sites by total visits as those sites had the densest data coverage.
Clean up the data.
We filled in zeros in the data with either 0 or 124 depending upon the application.
Regularize the data.
We needed a dataset with a consistent sample rate, so we filled in values for missing minutes using linear interpolation.
Decompose the data.
We used a number of different methods for decomposing the data (ARMA, auto-fitting ARIMA) before deciding upon the simple STL (Seasonal-Trend decomposition by Loess) algorithm. This algorithm is fast, conceptually much simpler than other methods, and works. We assumed a periodicity of 1 week. This process turned the one time series into three: a seasonal component (the weekly pattern), a trend component, and a remainder. We were interested in getting the remainder, as this contains deviations for which the seasonal and trend components cannot account. The first four graphs in Figure 18 below are the output of this algorithm.
Smooth the data.
We took the remainder component of the above process and binned it into one hour segments, and took the mean of each of those bins. This is the fifth panel in Figure 18.
Filter the data.
We removed any bins from the above process whose mean fell below a certain threshold (3 times the standard deviation of the time series). The threshold and discarded data are represented graphically in Figure 18 by red lines and grayed-out line segments, respectively.
Output the data.
Each of the remaining bins were considered significantly anomalous. These were output as both CSV and PNG (as graphs). See the last panel of Figure 18. The line matches that of the first panel, but anomalous highlights have been added.
Identify patterns in the data.
For each anomaly, we knew whether it was higher than expected or lower than expected and the request result time series in which it occurred (successful visits, connection errors, DNS errors, not found errors). Using these traits, we could roughly categorize anomalies:
We are excited about what we identified in less than 50 days worth of usable data. We have created an algorithm that we believe can identify significant anomalies with a high degree of accuracy. As the examples above demonstrate, we have tied much of what our algorithm detected to real-world events, confirming its utility.
Having additional data would allow us to further refine our approach and ensure its accuracy. 40+ days worth of data is a relatively small sample. With a larger data set we could see if the patterns we've identified hold true over an extended period of time. Moreover, we could test whether the the algorithm picks up known outages by looking for anomalies where there has been a significant filtering event (for example, in Egypt, Libya, or Pakistan in the period following the "Innocence of Muslims" video). If the algorithm matched known events, it would further confirm its effectiveness.
This data would be extremely useful for Herdict, even if it was not collected live. If Google could provide this data on a daily or even weekly basis, this data could still help improve Herdict's ability to detect outages. While many Herdict users search for real-time data, many also look for historical trends. For the latter category of users, supplementing our crowdsourced data with detected anomalies would help confirm (or disprove) our crowd reported data, providing a necessary check on the variability of data from an unpredictable source. Moreover, this data would allow us to better tailor the sites that we direct our users to test through our various queues of sites to be tested. Thus, even if not reported live, this data could make Herdict far more useful and responsive.
Additionally, this data can help provide quantitative confirmation for places suspected of having persistent filtering and connection issues. As noted above, we've been able to construct a visual fingerprint for those kind of issues, which provides objective confirmation for what has previously been a largely subjective and somewhat abstract determination. Of perhaps even greater importance, by tracking such fingerprints over an extended period of time, this data could be a way of identifying when countries make changes to their filtering regimes, something that has been challenging to identify up until this point.
Outside of the filtering and Herdict contexts, this data can provide unique insights into how people use the Internet. As we noted above, our algorithm can detect national holidays and when people are concerned about an incoming storm. Although such information is not directly relevant to Herdict's mission, it could be extremely useful for research projects (such as Berkman's Internet Monitor project) that are looking into quantification of how people use the Internet more generally.
Going forward, we believe there is a tremendous amount of work that could be done with more data like this. We have tried to provide here a sense of some of the opportunities, and we look forward to discussing with you these and other approaches.
All code required to reproduce the analysis and images contained in this report can be found here.