MediaCloud: Difference between revisions

From Berkman Klein Google Summer of Code Wiki
Jump to navigation Jump to search
(Link to CPAN)
 
(25 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Overview=
[https://mediacloud.org/ Media Cloud] is an open source platform for studying media ecosystems.  Media Cloud is joint project between [https://cyber.harvard.edu The Berkman Klein Center for Internet and Society at Harvard University] and [https://civic.mit.edu The Center for Civic Media] at MIT's Media Lab. 


Media Cloud is a system that lets you see the flow of the media. The Internet is fundamentally altering the way that news is produced and distributed, but there are few comprehensive approaches to understanding the nature of these changes. Media Cloud automatically builds an archive of news stories and blog posts from the web, applies language processing, and gives you ways to analyze and visualize the data.
By tracking hundreds of millions of stories published online or broadcast via television, our suite of tools allows researchers to track how stories and ideas spread through media, and how different corners of the media ecosystem report on stories.


Media Cloud is aims to tracks news content comprehensively – providing open, free, and flexible tools.  This will allow unprecedented quantitative analysis of media trends. For instance, some of our driving questions are:
Our platform is designed to aggregate, analyze, deliver and visualize information, answering complex quantitative and qualitative questions about the content of online media.


# Do bloggers introduce storylines into mainstream media or the other way around?
*'''Aggregate''': We have aggregated billions of online stories from an ever-growing set of 1,200,000+ digital media sources. We ingest data via RSS feeds and a set of robots that spider the web to fetch information from a variety of sources in near real-time.
# What parts of the world are being covered or ignored by different media sources?
# Where do stories begin?
# How are competing terms for the same event used in different publications?
# Can we characterize the overall mix of coverage for a given source?
# How do patterns differ between local and national news coverage?
# Can we track news cycles for specific issues?
# Do online comments shape the news?


Media Cloud is capable of crawling and analyzing arbitrary on-line news media and blogs. At the high level, we monitor RSS feed updates and then direct our crawler to download the corresponding web pages. These web page are saved and then later analyzed. Among other types of analysis, we currently do entity extraction, word frequency analysis, and clustering. But we are open to other types of text analysis ideas.
*'''Analyze''': To query our extensive library of data, we have developed a suite of analytical tools that allow you to explore relationships between professional and citizen media, and between online and offline sources.


*'''Deliver and Visualize''': Our suite of tools provides opportunities to present data in formats that you can visualize in your own interfaces. These include the use of graphs, geographic maps, word clouds, network visualizations.


The main site for Media Cloud is [http://www.mediacloud.org http://www.mediacloud.org].
'''Project URL''': https://mediacloud.org/
There, you can see some simple visualizations generated out of our system, but the project is under very active development and there is much more under the hood.  '''Applicants are encouraged to examine the source forge [http://sourceforge.net/projects/mediacloud/ project page] and [http://sourceforge.net/projects/mediacloud/develop subversion repository].


=Skills Needed=
'''Project on GitHub''': https://github.com/berkmancenter/mediacloud


==Essential Skills==
'''Project Mentors''': [mailto:linas@media.mit.edu Linas Valiukas], [mailto:hroberts@cyber.law.harvard.edu Hal Roberts]


* Perl - The Media Cloud code base is largely Perl so at least some familiarity with Perl is essentially.
=Project Ideas=


* SQL/Database - We use a Postgresql database backend to manage the large amount of data that Media Cloud processes. So a knowledge of SQL is also important. Experience with Postgresql would be ideal but someone with exposure to other databases such as MySql or Oracle could probably get by.
==Create a self-contained, browser-based page HTML -> article HTML extractor==
'''Problem Statement:'''
For every fetched news article, we have to figure out which part of the page HTML contains the article body itself. We currently use readability-lxml (https://github.com/buriy/python-readability) for that task. However, readability-lxml is aging fast and is not necessarily still the best library around to extract body of the article from the HTML page. Also, more and more articles get loaded using JavaScript due to an ongoing "frontend everywhere!" frenzy, and our Python extractor doesn’t execute or support JavaScript. Lastly, various CDNs, e.g. Cloudflare, are blocking our crawler just because our user agent doesn't have JavaScript enabled.


* HTML/Javascript/Jquery - The web based front end portion of Media Cloud uses JavaScript and JQuery extensively. For projects that involve the front end, students should have at least some knowledge of JavaScript, JQuery, and HTML. Obviously, this will be less important for students working primarily on data analysis projects.
I think inevitably we'll have to switch to running a headless browser, loading each and every downloaded story in it, and then applying a well-supported third-party library, e.g. Mozilla's Readability (https://github.com/mozilla/readability), to extract article title, author and body.


==Useful Skills==
===Development Tasks===


These skills would be significant pluses but not necessary skills for students who want to work on Media Cloud
*Set up a chromeless browser
*Set up Readability
*Develop a HTTP service that accepts a parameter URL (and/or HTML body), loads it in the browser, runs Readability's magic, and returns the extracted HTML back to the requester.
*Package everything in a nice Docker image


* Natural Language Processing (NLP) - Some exposure to nlp would be very useful in helping the student understand and contribute to Media Cloud.
There are similar projects like this, e.g. https://github.com/schollz/readable, so you’ll need to do some research into whether such a thing exists already before submitting a proposal. Maybe an existing tool could be improved instead of redoing everything from scratch?


* Big Data - Media Cloud is designed to process large amounts of data. Our main database is nearly a terabyte in size and we have downloaded hundreds of gigabytes of articles. One of the challenges in Media Cloud is designing queries that run quickly on large amounts of data.


* Data analysis - One of the challenges in Media Cloud is making sense of the data that we gather. Therefore, exposure to data analysis or statistics will be a plus.


* Visualization / Design - Often data can best be understood visually. For students working on the front end,  knowledge of design and visualizations will be a plus.


* Media Analysis - Media Cloud was developed to make sense of the changing news landscape. Thus some knowledge of media studies or media analysis would be useful. GSoC projects are technical in nature so this is in no way necessary. However, some basic knowledge of media analysis may be useful, particularly for projects that focus on data analysis.
==Write a spec to a new generation of our API==


=Ideas=
'''Problem Statement:'''
Create a specification for a new version of our API (https://github.com/berkmancenter/mediacloud/tree/release/doc/api_2_0_spec). Our existing API (implemented in Perl) is inconsistent among its different major parts, and is goofily un-REST-ish in several places. We would like to reimplement it to Python and use a modern framework for API specification (OpenAPI), implementation, and testing.


==CPAN Module Creation==
===Development Tasks===
*With the help of the team, identify which API calls can be renamed to more sensible names, extended or deprecated
*With the help of the team, rewrite API call descriptions to make them more comprehensible
*Rewrite a API spec using a chosen tool (e.g. OpenAPI) using best RESTful practices
*Set up API demo (e.g. using Swagger UI)


There are a number of pieces of Media Cloud which could be useful to other programs. A student would go through the Media Code base and work to turn as much code as possible into [http://cpan.org CPAN] modules so that it could benefit the larger Perl community. The ideal applicant would be familiar with the creation of CPAN modules but a student who's knowledgeable on Perl in general could likely learn about CPAN module creation during the summer.


==Linux Distribution Packages==
We have worked extensively to make Media Cloud easier to install but the process is still less automated then we would like. We would like to create Linux distribution packages that would allow Media Cloud to be installed simply by downloading package file(s) (e.g. .debs) and using a package management tool such as apt-get. We are most interested in supporting the Ubuntu / Debian architecture. Packaging a large system like Media Cloud is surprisingly difficult due to the large number of external CPAN module dependencies. A successful project could serve as an important reference for the wider Perl community. The ideal solution would have the following features:


* Automated package generation - There should be a simple automated process to generate new packages when we wish to update our distribution. This could be a script or some other process.
==Rewrite Ultimate Sitemap Parser to yield results instead of returning them==
'''Problem Statement:'''
Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). The current implementation of the sitemap parser fetches all of the sitemap links and returns it to a caller in a single easy-to-use object. However, it turns out that some websites have *massive* sitemap trees! In those cases, the sitemap parser uses up lots of RAM for its operation, and the client is forced to wait for a long time to get sitemap fetching + parsing results. For those reasons, we’d like the sitemap parser to “yield” links found in a sitemap instead of “returning” them while also maintaining a nice, comprehensible interface to the sitemap parser: https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/issues/2


* Support For Modules Not Packaged By OS Distributor - Media Cloud depends on many modules that are not included in the distribution packages. For example, Media Cloud depends on the Algorithm::Cluster module which is not included in the Ubuntu package.
===Development Tasks===
* Rewrite sitemap parser to yield found sitemap links instead of returning them to conserve memory and make results usable faster


* Module Version Consistency Between OS Versions - Installations of Media Cloud on different versions of an operation system should use the same CPAN module versions. For example, installs of Media Cloud on both Ubuntu 11.04 and Ubuntu 12.04 should use the same version of the Catalyst CPAN module.


* Control Over Module Versions - Media Cloud developers should be able to control which versions of CPAN modules Meda Cloud uses. In other words, the developers should be able to use versions that are different (newer or older) than what the OS vendor distributes. For example, developers might wish for Media Cloud to use version 9 of Catalyst even if the OS vendor only packages version 8. Conversely, the developers might wish that Media Cloud use version 8 of Catalyst even on systems in which version 9 is distributed as a vendor package.


* Support For Perl Versions Different Than the System Perl - We may want to be able to require that Media Cloud uses a newer version of Perl than is distributed by the OS vendor. For example, we might want Perl 5.14.2 to be used by the Media Cloud install on systems in which the system Perl is version 5.8.
==Make Ultimate Sitemap Parser use asyncio==
'''Problem Statement:'''
Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). In production, fetching and parsing XML sitemaps it’s mostly a CPU-intensive operation as the most time gets spent on gunzipping sitemaps, parsing XML and creating objects out of them, but my guess is that the sitemap parser could be made 10-20% faster by doing I/O (namely the fetching part) asynchronously.


The last few requirements strongly suggest packaging Media Cloud with Perl libraries that are separate from the standard system Perl packages. However, we would be open to other approaches. Our current approach uses perlbrew and Carton so applicants should make themselves familiar with these tools.
===Development Tasks===
*Rewrite sitemap parser to fetch sitemaps asynchronously
*Find other ways where I/O could be made asynchronous


==Web API==
==Detect sitemap if it’s not linked from robots.txt in Ultimate Sitemap Parser==
'''Problem Statement:'''
Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). Most of sitemaps are being linked to in website’s robots.txt, but some are not. We would like to try common paths of sitemaps (e.g. /sitemap.xml[.gz]) on every site nonetheless: https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/issues/8


Currently Media Cloud allows simple queries through a public web interface and also allows very large data dump files to be downloaded. However, the creation of a web api would allow Media Cloud comment to be used more easily and richly. Possible challenges include: allowing rich queries without overly taxing the underlying system and determining what data is most interesting to others.
=== Development Tasks ===
* Update the module in such a way that it tries common sitemap locations independently from robots.txt


==Multi-language Support==
Our system was originally designed to work with English language sources.  We have since done significant work to internationalize the code (internally everything is stored in UTF-8).  We currently have some support for Russian and limited support for Chinese. But we would be very interested in projects to extend Media Cloud to support other languages. This would likely require the student to have fluency or at least proficiency in the language they wanted to add support for. Among other things this would involve adding support for stemming and stop words for the additional languages.


==Data Visualizations==


Often large amounts of data can best be understood visually. We are interested in students who would like create new visualizations for Media Cloud data or improve existing visualizations.
==Add RSS / Atom sitemap support to our Ultimate Sitemap Parser==
'''Problem Statement:'''
Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). Most of those sitemaps are implemented in Sitemap XML protocol, but a small number of sitemaps are published in RSS / Atom, and we’d like to have support for those too: https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/issues/3


==Allow Rich Queries==
=== Development Tasks ===
We have terabytes of data and millions of archived stories.  How can we construct queries that work efficiently on this data set and generate interesting and compelling results?  For instance, we are currently experimenting with time-sequence analysis of different terms across different media sources (see, for instance, our experimental
*Add RSS / Atom support to Ultimate Sitemap Parser
[http://chart.apis.google.com/chart?chd=s:AABAACDFAAAAAABDNILLPhnqsqYT8xkdZOJWQLKJEDFIHEEDCDDCEDCDEEFGEABCDDCDEDLIMOKEGQPONJEDNMFBFCBFINLJFDLOQSRGGHECCKFCA,AAAAABGBAAAAAAABEFMGRMTekedSROPLLJHGaOJFBBDPFFDCEDCDFCAEDBCCFCCCBMPHELGMHEGCELFKFEJKYLGKBBADCEDCCEIGEHDCBBGAAKEAA,AAAAADBCAAAAAAADHHOGNMSkz6jWXQRQRKJITSQFDBDHFDECGDDDIABFFBBBDACEBFGCEAAAAABABBAABAAAABAAAAAAAAAAAAAABAAAAAAAAAAAA,AABAADCDAAAAAABFFGJHGKOSSNHIURKJQIILKIHGCCEHGEDDCCDCDCBBDCCEDCACDEDCBEEEFGFCDGHIDCCCJGGBCAACCEEFBCEDCEFCCBBABGCBB,AAAABDGGBAAAAABCFFLIPVXOPNLJUKFFGHCIHHJFDCJIFBCCABBADEDCCEBFCAABCBCBCEIFLKGBADGGEEFCMFGDFBACDFCDAACCICCCBCBAAECAA,AAAAABBCAAAAAABEGGNHNNOVVSHJZTQPKIGQPIIHEEIIFFFGCCDDFCBEBCCCBAABCCBBCCCDCDECBEEFDDDBGECBBAABBCCBAABBBCCBABAAACBAA,AAAAADCBBAAAAAAADEIJKNPUeULHTUJHJGDEFCDFCEDFFCEIDBBBCABBCABBBAAAAEDCBGFJFHGDAEDDJCBCIEBBAAABABCEDBECDDEDDDCBBKHCA,AABAAGLOCABAABCDGEIMPRRUSWLLUNJFHFDGGJOECCDECBCCACBBEBBAAABDDAABABCABADDJDEAABCDCBBAFGHBBBABBEECAAACABDBAAAAADBAA,AAAAABABAAAAAAACEEJHLGJNLOFGPPJLMGDJIHGDDCGGFEDEDCCCCCACDBCCDBABDEEDCEDEDDFCCFFHFDEDQIHEBAADCDDDBBDDCCEBBBBABECBA,AAAAABBBAAAAAAAACCGIMOPSTOIGSWMJIGDEDCDCBBDCBABBABBBCBAACBBBBAAABDCDBFEFFFDCBHFIFGDCFDDABAADGJHEDBFGEDECCCBBAFCAA&chdl=bailout|obama|mccain|economies|treasuries|crisis|bush|mortgage|economic|congress&chxt=x&chxl=0:|2008-09-01|2008-10-08|2008-11-15|2008-12-22&chs=600x250&cht=lc&chco=ff0000,00ff00,0000ff,ff8888,88ff88,8888ff,88ffff,ff88ff,ffff88,888888 charts of coverage of the bailout].  How would we go about visualizing some of these questions with the data we have?  We currently use the Google Visualizations API to actually generate our charts.


The other idea in this area is to help us develop a rich API to allow others to access our data.  This is a high priority for us, but we need help.  If you worked with us in this area, you would have quite a bit of latitude to help us define the queries, language, etc.


==Suggest Your Own Project==


Applicants are encouraged to suggest other projects whether some combination of the ideas above or something else entirely.
==Build a tool to do some cool visualizations==
'''Problem Statement''' Since 2008, we have collected more than a half billion news articles that we have post-processed and indexed. We know quite a lot about them -- which news articles were the most linked to from other similar articles, the most and least popular / influential articles (based on shares on Facebook, tweet count, or clicks on an article's Bit.ly shortened link), specific language and terms used to describe the subject matter in each of the articles, etc., and there's a lot of potential to learn much more. Can you use your design and coding skills to help us out in visualising some of this data, e.g. create a cool network map visualization tool?


===Development Tasks===
*Build any visualization tool based on our extensive data and tool set:
**Figure out what you'd like to visualise and how are you going to do it
**Use Gephi, a tool of your choice, or create your very own tool to implement your visualisation


Mentor: [mailto:dlarochelle@cyber.law.harvard.edu dlarochelle@cyber.law.harvard.edu]
==Create PostgreSQL-based job queue==
''Problem Description''. In more than eight (or is it nine by now?) years since we've been running Media Cloud, we have tried multiple job queue tools (e.g. Gearman) that we could use for dividing and conquering our workload. Unfortunately, all the tools (including the current one -- go look into the codebase to figure out which one it is now) have left us deeply unhappy because of one reason or another. If there's one tool which hasn’t let us down, it’s PostgreSQL. So, we'd like to also try running our job queue on Postgres. Can you implement it for us?


General Questions: [mailto:berkmancenterharvard@gmail.com berkmancenterharvard@gmail.com]
===Development Tasks===
*Write a spec, complete with code samples, on how to implement the following job queue:
**Preferably programming language-agnostic, i.e. should run as a bunch of PL/pgSQL functions.
***Maybe that's a bad idea, I don't know, you tell us.
*Features:
**Add jobs with names and JSON arguments
**Cancel jobs by their ID
**Track job's progress (and log?) by their ID
**Get job ID using its JSON parameters
**Merge jobs with identical JSON arguments into a single job
**See job stats per task, i.e. how many jobs are queued for every task
**Retry failed jobs
**Report job failure, complete with error messages
**Proper locking (for inspiration, see https://github.com/chanks/que)
**Doesn't catch fire with tens of millions of queued jobs
*(Bonus points) Actually implement the queue! If you don't get to doing this over the summer, it's fine, we would be happy with a proven spec.
 
 
 
==Implement a method to detect subtopics of a topic==
*Problem Statement.* As described elsewhere, a "topic" is subject discussed by the media that we are researching. Almost every big topic contains subtopics, e.g. the matters of immigration, racism, email server security and a plethora of other subjects were discussed during the last US election. We would like to investigate ways of how we could automatically detect those subtopics, possibly using the [Louvain method](https://en.wikipedia.org/wiki/Louvain_Modularity).
 
===Development Tasks===
Develop a proof of concept (un)supervised ML tool for detecting subtopics of a chosen subject ("topic").
 
 
 
==Do your own freehand project==
Problem Statement. If you had more than half a billion (!) news articles from all around the world stored in a single place, extracted from HTML into text, split into sentences, words, and made searchable, what would you do? Propose us something we didn't think of, and we will surely consider it!
 
===Development Tasks. ===
Left as an exercise to the student.
 
 
 
=Skill Requirements for Potential Candidates=
*Working knowledge of Perl or Python
*Familiarity with relational databases, preferably PostgreSQL
*Some pedantism
*Willingness to propose, debate and object to ideas
*Keen to work with us on writing your GSoC project proposal, as opposed to just submitting a long shot without any feedback and hoping for the best
*Shown effort into learning what Media Cloud is all about; some ideas:
**Make a pull request to our main code repository (https://github.com/berkmancenter/mediacloud),
**Craft us an email with a smart question or two,
**Try our our tools (see https://mediacloud.org/),
**Run Media Cloud yourself and collect some news articles (see https://github.com/berkmancenter/mediacloud/blob/master/doc/vagrant.markdown),
**Sign up and check out our API (see https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/api_2_0_spec.md, https://github.com/berkmancenter/mediacloud/blob/master/doc/api_2_0_spec/topics_api_2_0_spec.md, and the API client at https://pypi.python.org/pypi/mediacloud/).

Latest revision as of 13:37, 2 April 2019

Media Cloud is an open source platform for studying media ecosystems. Media Cloud is joint project between The Berkman Klein Center for Internet and Society at Harvard University and The Center for Civic Media at MIT's Media Lab.

By tracking hundreds of millions of stories published online or broadcast via television, our suite of tools allows researchers to track how stories and ideas spread through media, and how different corners of the media ecosystem report on stories.

Our platform is designed to aggregate, analyze, deliver and visualize information, answering complex quantitative and qualitative questions about the content of online media.

  • Aggregate: We have aggregated billions of online stories from an ever-growing set of 1,200,000+ digital media sources. We ingest data via RSS feeds and a set of robots that spider the web to fetch information from a variety of sources in near real-time.
  • Analyze: To query our extensive library of data, we have developed a suite of analytical tools that allow you to explore relationships between professional and citizen media, and between online and offline sources.
  • Deliver and Visualize: Our suite of tools provides opportunities to present data in formats that you can visualize in your own interfaces. These include the use of graphs, geographic maps, word clouds, network visualizations.

Project URL: https://mediacloud.org/

Project on GitHub: https://github.com/berkmancenter/mediacloud

Project Mentors: Linas Valiukas, Hal Roberts

Project Ideas

Create a self-contained, browser-based page HTML -> article HTML extractor

Problem Statement: For every fetched news article, we have to figure out which part of the page HTML contains the article body itself. We currently use readability-lxml (https://github.com/buriy/python-readability) for that task. However, readability-lxml is aging fast and is not necessarily still the best library around to extract body of the article from the HTML page. Also, more and more articles get loaded using JavaScript due to an ongoing "frontend everywhere!" frenzy, and our Python extractor doesn’t execute or support JavaScript. Lastly, various CDNs, e.g. Cloudflare, are blocking our crawler just because our user agent doesn't have JavaScript enabled.

I think inevitably we'll have to switch to running a headless browser, loading each and every downloaded story in it, and then applying a well-supported third-party library, e.g. Mozilla's Readability (https://github.com/mozilla/readability), to extract article title, author and body.

Development Tasks

  • Set up a chromeless browser
  • Set up Readability
  • Develop a HTTP service that accepts a parameter URL (and/or HTML body), loads it in the browser, runs Readability's magic, and returns the extracted HTML back to the requester.
  • Package everything in a nice Docker image

There are similar projects like this, e.g. https://github.com/schollz/readable, so you’ll need to do some research into whether such a thing exists already before submitting a proposal. Maybe an existing tool could be improved instead of redoing everything from scratch?



Write a spec to a new generation of our API

Problem Statement: Create a specification for a new version of our API (https://github.com/berkmancenter/mediacloud/tree/release/doc/api_2_0_spec). Our existing API (implemented in Perl) is inconsistent among its different major parts, and is goofily un-REST-ish in several places. We would like to reimplement it to Python and use a modern framework for API specification (OpenAPI), implementation, and testing.

Development Tasks

  • With the help of the team, identify which API calls can be renamed to more sensible names, extended or deprecated
  • With the help of the team, rewrite API call descriptions to make them more comprehensible
  • Rewrite a API spec using a chosen tool (e.g. OpenAPI) using best RESTful practices
  • Set up API demo (e.g. using Swagger UI)


Rewrite Ultimate Sitemap Parser to yield results instead of returning them

Problem Statement: Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). The current implementation of the sitemap parser fetches all of the sitemap links and returns it to a caller in a single easy-to-use object. However, it turns out that some websites have *massive* sitemap trees! In those cases, the sitemap parser uses up lots of RAM for its operation, and the client is forced to wait for a long time to get sitemap fetching + parsing results. For those reasons, we’d like the sitemap parser to “yield” links found in a sitemap instead of “returning” them while also maintaining a nice, comprehensible interface to the sitemap parser: https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/issues/2

Development Tasks

  • Rewrite sitemap parser to yield found sitemap links instead of returning them to conserve memory and make results usable faster


Make Ultimate Sitemap Parser use asyncio

Problem Statement: Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). In production, fetching and parsing XML sitemaps it’s mostly a CPU-intensive operation as the most time gets spent on gunzipping sitemaps, parsing XML and creating objects out of them, but my guess is that the sitemap parser could be made 10-20% faster by doing I/O (namely the fetching part) asynchronously.

Development Tasks

  • Rewrite sitemap parser to fetch sitemaps asynchronously
  • Find other ways where I/O could be made asynchronous

Detect sitemap if it’s not linked from robots.txt in Ultimate Sitemap Parser

Problem Statement: Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). Most of sitemaps are being linked to in website’s robots.txt, but some are not. We would like to try common paths of sitemaps (e.g. /sitemap.xml[.gz]) on every site nonetheless: https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/issues/8

Development Tasks

  • Update the module in such a way that it tries common sitemap locations independently from robots.txt


Add RSS / Atom sitemap support to our Ultimate Sitemap Parser

Problem Statement: Ultimate Sitemap Parser (https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser) is our Python module that we use to fetch trees of website sitemaps (https://www.sitemaps.org/). Most of those sitemaps are implemented in Sitemap XML protocol, but a small number of sitemaps are published in RSS / Atom, and we’d like to have support for those too: https://github.com/berkmancenter/mediacloud-ultimate_sitemap_parser/issues/3

Development Tasks

  • Add RSS / Atom support to Ultimate Sitemap Parser


Build a tool to do some cool visualizations

Problem Statement Since 2008, we have collected more than a half billion news articles that we have post-processed and indexed. We know quite a lot about them -- which news articles were the most linked to from other similar articles, the most and least popular / influential articles (based on shares on Facebook, tweet count, or clicks on an article's Bit.ly shortened link), specific language and terms used to describe the subject matter in each of the articles, etc., and there's a lot of potential to learn much more. Can you use your design and coding skills to help us out in visualising some of this data, e.g. create a cool network map visualization tool?

Development Tasks

  • Build any visualization tool based on our extensive data and tool set:
    • Figure out what you'd like to visualise and how are you going to do it
    • Use Gephi, a tool of your choice, or create your very own tool to implement your visualisation

Create PostgreSQL-based job queue

Problem Description. In more than eight (or is it nine by now?) years since we've been running Media Cloud, we have tried multiple job queue tools (e.g. Gearman) that we could use for dividing and conquering our workload. Unfortunately, all the tools (including the current one -- go look into the codebase to figure out which one it is now) have left us deeply unhappy because of one reason or another. If there's one tool which hasn’t let us down, it’s PostgreSQL. So, we'd like to also try running our job queue on Postgres. Can you implement it for us?

Development Tasks

  • Write a spec, complete with code samples, on how to implement the following job queue:
    • Preferably programming language-agnostic, i.e. should run as a bunch of PL/pgSQL functions.
      • Maybe that's a bad idea, I don't know, you tell us.
  • Features:
    • Add jobs with names and JSON arguments
    • Cancel jobs by their ID
    • Track job's progress (and log?) by their ID
    • Get job ID using its JSON parameters
    • Merge jobs with identical JSON arguments into a single job
    • See job stats per task, i.e. how many jobs are queued for every task
    • Retry failed jobs
    • Report job failure, complete with error messages
    • Proper locking (for inspiration, see https://github.com/chanks/que)
    • Doesn't catch fire with tens of millions of queued jobs
  • (Bonus points) Actually implement the queue! If you don't get to doing this over the summer, it's fine, we would be happy with a proven spec.


Implement a method to detect subtopics of a topic

  • Problem Statement.* As described elsewhere, a "topic" is subject discussed by the media that we are researching. Almost every big topic contains subtopics, e.g. the matters of immigration, racism, email server security and a plethora of other subjects were discussed during the last US election. We would like to investigate ways of how we could automatically detect those subtopics, possibly using the [Louvain method](https://en.wikipedia.org/wiki/Louvain_Modularity).

Development Tasks

Develop a proof of concept (un)supervised ML tool for detecting subtopics of a chosen subject ("topic").


Do your own freehand project

Problem Statement. If you had more than half a billion (!) news articles from all around the world stored in a single place, extracted from HTML into text, split into sentences, words, and made searchable, what would you do? Propose us something we didn't think of, and we will surely consider it!

Development Tasks.

Left as an exercise to the student.


Skill Requirements for Potential Candidates