Intro
-----

In this stage we're not trying to create final test lists - we're trying to
build up the full block of marble from which we can carve out test lists.
Hopefully, 99% of that carving can then be done automatically so that
individuals with varying amounts of domain expertise can have decent places to
start when trying to compile, update or review lists. We're operating under the
assumption that it's easier for individuals of varying skill levels to cut down
a large list of candidates than it is for them to start with a blank slate. The
questions at this phase are how best to put together a large dataset that
captures all the useful candidate URLs we can - plus a ton of other useless,
irrelevant stuff
- and then how best to remove most of that irrelevant stuff.

Method
------

We took all the URLs that were on the old lists for each country and category
combination, and for each of these, we followed all the outbound links. We then
followed all the outbound links from those pages as well. For every page we
visited, we kept track of all the links between pages to create a network
graph. We then calculated various metrics on the network graph in an attempt to
find the most "important" pages for each country-category combination. The
default metric by which the URLs are sorted is decent, but all the metrics have
been provided because sorting by other columns can often bring more useful
results to the top.


Results
-------

The directory structure is <country>/<category>. The country codes are the
2-letter ISO codes, which are listed here:
https://en.wikipedia.org/wiki/ISO_3166-2, and the category codes are listed
here:
https://github.com/citizenlab/test-lists/blob/master/lists/00-LEGEND-category_codes.csv.

Each directory should contain four files:
  * <country>_<category>_seeds.txt
  * <country>_<category>_errors.log
  * <country>_<category>_results.csv
  * <country>_<category>_graph.gexf

The seeds.txt file contains all the URLs that were on the old list for that
country and category combination.
The errors.log file contains all the errors we encountered while scraping all
the pages.
The results.csv file contains the ranked list of the all URLs we encountered.
For each, we attempted to identify the language, and we calculated a number of
metrics.
The graph.gexf file contains the network graph we constructed in Gephi format.


Observations
------------

From looking through a few of the lists quickly, it appears that the quality
varies pretty widely. I think categories that have more seeds tend to have
better results. Categories that contain communities more likely to link to each
other also tend to have better results (human rights movements link to each
other more often than dating sites link to each other, for example).

We kept a lot of URLs off the lists. For example, a number of spammy sites and
all the sites on the global list
(https://github.com/citizenlab/test-lists/blob/master/lists/global.csv) were
excluded from the results. This kept a lot of popular but irrelevant URLs off
the lists, but the scrapers still got caught up in popular or spammy sites that
have lots of links between subdomains they own. These sites often end up at the
tops of lists. For these lists, it helps to filter out those sites from the
results, or scroll past them to find more meaningful results. It also helps
sometimes to sort by some of the different metric columns to find other
potentially useful URLs.

The results CSV contains all the results, but it's rarely useful to look past
the first five-or-so screens-worth. Sorting by other columns or filtering by
language is often more useful.

Feedback
--------

If you find patterns that help you find more useful URLs or eliminate cruft,
please share them so we can code them up and improve the process. If there are
sites that keep showing up but are always irrelevant, please share those as
well so we can automatically filter them out. Contact Justin Clark
<jclark@cyber.law.harvard.edu> with feedback unless you've been given a better
contact.