Data Integration DB

From Berkman Klein Google Summer of Code Wiki
Jump to: navigation, search

Summer of code project

Project: Create a database and data integration system that allows for discovery of hidden connections between existing data sets

Project rationale:

The Berkman Klein Center and affiliated researchers are collecting a great deal of information and unique knowledge about Web sites around the world, including where they are located, if they are filtered or blocked, if they are connected to oppositional political parties in authoritarian regimes, as well as demographic and other data on bloggers. For example the Internet & Democracy Project has created social network maps of Persian, Arabic, Russian, and US blogospheres, and had humans tell us what type of sites political bloggers prefer, the YouTube videos they preferentially link to, as well as a wealth of data on who bloggers are in various foreign countries and what type issues they care about. The OpenNet initative has data on what Web sites are filtered around the world, and in many cases what content they host that might lead them to be filtered (such as prohibited sexual, religious, or political content or affiliation). And Berkman spinoffs like have data on what Web sites are affiliated with badware, as well as location and other data. Currently, this information is stored within project silos, in different formats, and effectively (but unintentionally) walled off from other researchers. There is a tremendous and long overdue need for a central data repository for Berkman collected data that uses best-practices for metadata representation and cross referencing. The ultimate goal of this effort is to allow researchers from various Berkman Klein Center projects to gain insights that they would not have seen before, and to fully utilize the tremendous amount of (largely untapped) knowledge we are collecting about the global Internet and its users. We also would like to allow researchers outside of Berkman to gain access to as much data as we can safely release, so others can build on the work we have started. This effort will ultimately help the Berkman Klein Center to make meaningful connections between sets of data, and to make discoveries that are essentially right in front of us, but invisible due to project silos.


  • creation of a relational database for storing various types of information about given URLs including (but are not limited to): location, language, if the site is blocked/filtered, blogger demographics (male, female, etc.), connected to badware, political orientation, if it is an oppositional news sites, etc.
  • a user interface which allows non-technical researchers to search and sort datasets
  • conformity with common metadata standards, such as Dublin Core.
  • user and group level access control
  • security—encryption, limiting access, granular permissions, etc.
  • use open source database system (MySQL or PostgreSQL)
  • methods for entering datasets from various formats (e.g. csv, xml).


  • interest in helping cutting-edge research teams discover previously unknown patterns and unique connections between data sets
  • knowledge of relational database design
  • familiarity with XML and (preferably) metadata standards