Intro ----- In this stage we're not trying to create final test lists - we're trying to build up the full block of marble from which we can carve out test lists. Hopefully, 99% of that carving can then be done automatically so that individuals with varying amounts of domain expertise can have decent places to start when trying to compile, update or review lists. We're operating under the assumption that it's easier for individuals of varying skill levels to cut down a large list of candidates than it is for them to start with a blank slate. The questions at this phase are how best to put together a large dataset that captures all the useful candidate URLs we can - plus a ton of other useless, irrelevant stuff - and then how best to remove most of that irrelevant stuff. Method ------ We took all the URLs that were on the old lists for each country and category combination, and for each of these, we followed all the outbound links. We then followed all the outbound links from those pages as well. For every page we visited, we kept track of all the links between pages to create a network graph. We then calculated various metrics on the network graph in an attempt to find the most "important" pages for each country-category combination. The default metric by which the URLs are sorted is decent, but all the metrics have been provided because sorting by other columns can often bring more useful results to the top. Results ------- The directory structure is /. The country codes are the 2-letter ISO codes, which are listed here: https://en.wikipedia.org/wiki/ISO_3166-2, and the category codes are listed here: https://github.com/citizenlab/test-lists/blob/master/lists/00-LEGEND-category_codes.csv. Each directory should contain four files: * __seeds.txt * __errors.log * __results.csv * __graph.gexf The seeds.txt file contains all the URLs that were on the old list for that country and category combination. The errors.log file contains all the errors we encountered while scraping all the pages. The results.csv file contains the ranked list of the all URLs we encountered. For each, we attempted to identify the language, and we calculated a number of metrics. The graph.gexf file contains the network graph we constructed in Gephi format. Observations ------------ From looking through a few of the lists quickly, it appears that the quality varies pretty widely. I think categories that have more seeds tend to have better results. Categories that contain communities more likely to link to each other also tend to have better results (human rights movements link to each other more often than dating sites link to each other, for example). We kept a lot of URLs off the lists. For example, a number of spammy sites and all the sites on the global list (https://github.com/citizenlab/test-lists/blob/master/lists/global.csv) were excluded from the results. This kept a lot of popular but irrelevant URLs off the lists, but the scrapers still got caught up in popular or spammy sites that have lots of links between subdomains they own. These sites often end up at the tops of lists. For these lists, it helps to filter out those sites from the results, or scroll past them to find more meaningful results. It also helps sometimes to sort by some of the different metric columns to find other potentially useful URLs. The results CSV contains all the results, but it's rarely useful to look past the first five-or-so screens-worth. Sorting by other columns or filtering by language is often more useful. Feedback -------- If you find patterns that help you find more useful URLs or eliminate cruft, please share them so we can code them up and improve the process. If there are sites that keep showing up but are always irrelevant, please share those as well so we can automatically filter them out. Contact Justin Clark with feedback unless you've been given a better contact.