MediaCloud: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 10: Line 10:


''Deliver and Visualize''. Our suite of tools provides opportunities to present data in formats that you can visualize in your own interfaces. These include the use of graphs, geographic maps, word clouds, network visualizations.
''Deliver and Visualize''. Our suite of tools provides opportunities to present data in formats that you can visualize in your own interfaces. These include the use of graphs, geographic maps, word clouds, network visualizations.
===Projects===
====Build a tool to do some cool visualizations====
''Problem Statement''. Since 2008, we have collected more than a half billion news articles that we have post-processed and indexed. We know quite a lot about them -- which news articles were the most linked to from other similar articles, the most and least popular / influential articles (based on shares on Facebook, tweet count, or clicks on an article's Bit.ly shortened link), specific language and terms used to describe the subject matter in each of the articles, etc., and there's a lot of potential to learn much more. Can you use your design and coding skills to help us out in visualising some of this data, e.g. create a cool network map visualization tool?
=====Development Tasks=====
*Build any visualization tool based on our extensive data and tool set:
**Figure out what you'd like to visualise and how are you going to do it
**Use Gephi, a tool of your choice, or create your very own tool to implement your visualisation
====Create PostgreSQL-based job queue====
''Problem Description''. In more than eight (or is it nine by now?) years since we've been running Media Cloud, we have tried multiple job queue tools (e.g. Gearman) that we could use for dividing and conquering our workload. Unfortunately, all the tools (including the current one -- go look into the codebase to figure out which one it is now) have left us deeply unhappy because of one reason or another. If there's one tool which hasn’t let us down, it’s PostgreSQL. So, we'd like to also try running our job queue on Postgres. Can you implement it for us?
=====Development Tasks=====
*Write a spec, complete with code samples, on how to implement the following job queue:
**Preferably programming language-agnostic, i.e. should run as a bunch of PL/pgSQL functions.
***Maybe that's a bad idea, I don't know, you tell us.
*Features:
**Add jobs with names and JSON arguments
**Cancel jobs by their ID
**Track job's progress (and log?) by their ID
**Get job ID using its JSON parameters
**Merge jobs with identical JSON arguments into a single job
**See job stats per task, i.e. how many jobs are queued for every task
**Retry failed jobs
**Report job failure, complete with error messages
**Proper locking (for inspiration, see https://github.com/chanks/que)
**Doesn't catch fire with tens of millions of queued jobs
*(Bonus points) Actually implement the queue! If you don't get to doing this over the summer, it's fine, we would be happy with a proven spec.

Revision as of 14:47, 15 March 2017

Media Cloud is an open source platform for studying media ecosystems.

By tracking hundreds of millions of stories published online or broadcast via television, our suite of tools allows researchers to track how stories and ideas spread through media, and how different corners of the media ecosystem report on stories.

Our platform is designed to aggregate, analyze, deliver and visualize information, answering complex quantitative and qualitative questions about the content of online media.

Aggregate. We have aggregated billions of online stories from an ever-growing set of 25,000 digital media sources. We ingest data via RSS feeds and a set of robots that spider the web to fetch information from a variety of sources in near real-time.

Analyze. To query our extensive library of data, we have developed a suite of analytical tools that allow you to explore relationships between professional and citizen media, and between online and offline sources.

Deliver and Visualize. Our suite of tools provides opportunities to present data in formats that you can visualize in your own interfaces. These include the use of graphs, geographic maps, word clouds, network visualizations.


Projects

Build a tool to do some cool visualizations

Problem Statement. Since 2008, we have collected more than a half billion news articles that we have post-processed and indexed. We know quite a lot about them -- which news articles were the most linked to from other similar articles, the most and least popular / influential articles (based on shares on Facebook, tweet count, or clicks on an article's Bit.ly shortened link), specific language and terms used to describe the subject matter in each of the articles, etc., and there's a lot of potential to learn much more. Can you use your design and coding skills to help us out in visualising some of this data, e.g. create a cool network map visualization tool?

Development Tasks
  • Build any visualization tool based on our extensive data and tool set:
    • Figure out what you'd like to visualise and how are you going to do it
    • Use Gephi, a tool of your choice, or create your very own tool to implement your visualisation


Create PostgreSQL-based job queue

Problem Description. In more than eight (or is it nine by now?) years since we've been running Media Cloud, we have tried multiple job queue tools (e.g. Gearman) that we could use for dividing and conquering our workload. Unfortunately, all the tools (including the current one -- go look into the codebase to figure out which one it is now) have left us deeply unhappy because of one reason or another. If there's one tool which hasn’t let us down, it’s PostgreSQL. So, we'd like to also try running our job queue on Postgres. Can you implement it for us?

Development Tasks
  • Write a spec, complete with code samples, on how to implement the following job queue:
    • Preferably programming language-agnostic, i.e. should run as a bunch of PL/pgSQL functions.
      • Maybe that's a bad idea, I don't know, you tell us.
  • Features:
    • Add jobs with names and JSON arguments
    • Cancel jobs by their ID
    • Track job's progress (and log?) by their ID
    • Get job ID using its JSON parameters
    • Merge jobs with identical JSON arguments into a single job
    • See job stats per task, i.e. how many jobs are queued for every task
    • Retry failed jobs
    • Report job failure, complete with error messages
    • Proper locking (for inspiration, see https://github.com/chanks/que)
    • Doesn't catch fire with tens of millions of queued jobs
  • (Bonus points) Actually implement the queue! If you don't get to doing this over the summer, it's fine, we would be happy with a proven spec.