The goal of this is to extract n-grams (bigrams at first) from the stories of hyperpartisan sites. These hyperpartisan sites were identified by Craig Silverman of BuzzFeed (646 sources) and are available in this collection.
The right end of this (471 sources) is available in this collection while the left (175 sources) is available in this collection.
We want to extract the n-grams that are most significant for these hyperpartisan sites relative to a neutral baseline. The neutral baseline we'll be using in this case is all stories from the "green" middle zone of the election study.
I have all the stories from each of these collections in an Elasticsearch cluster on Media Cloud's infrastructure. I have already done some filtering by excluding bigrams that include one of the following tokens:
'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com'
I picked those by looking at preliminary results and pulling out boilerplate that I gauged as uncontroversial.
Outline
Quick Summary of Results
This method picks up a ton of boilerplate stuff that the Media Cloud crawler left in the article text. Once we filter some of it out, which I think we have to do manually (but I think it's easy to justify), results look to be about what we'd expect.
The results for the left are available here and the right are here.
The doc_count is the number of docs including the bigram in the sampled foreground set. bg_count is the number of docs in the (not sampled) background set including the bigram. score is the significance score computed using the chi-squared test. If is_filtered_out is True, it indicates that the phrase should probably be filtered out, but False doesn't necessarily mean it shouldn't be filtered out. I've only manually filtered such that the top 50 non-filtered look decent.
%matplotlib inline
import pandas as pd
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, A, Q
from find_ngrams import make_search
client = Elasticsearch(timeout=600)
INDEX = 'hyperpartisan_ngrams'
Let's see how many stories we have from each collection.
s = Search(using=client, index=INDEX).aggs.metric('ideologies', A('terms', field='ideology'))
resp = s.execute()
_ = pd.Series({a['key']: a['doc_count'] for a in resp.aggs.ideologies.buckets}).plot.bar()
Let's see how the top terms vary with sample size. Once there's some stability, we know our sample size is large enough. I'll try 5k, 20k, 50k, 100k, and then we'll see.
search_config = {
'shard_size': 5000,
'max_docs_per_value': 200,
'size': 500,
'min_doc_count': 100,
'excluded_tokens': [
'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_5k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_5k.head(20)
search_config = {
'shard_size': 20000,
'max_docs_per_value': 800,
'size': 500,
'min_doc_count': 100,
'excluded_tokens': [
'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_20k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_20k.head(20)
search_config = {
'shard_size': 50000,
'max_docs_per_value': 1000,
'size': 500,
'min_doc_count': 100,
'excluded_tokens': [
'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_50k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_50k.head(20)
search_config = {
'shard_size': 100000,
'max_docs_per_value': 2000,
'size': 500,
'min_doc_count': 100,
'excluded_tokens': [
'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_100k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_100k.head(20)
Observations
max_docs_per_value, which is the maximum number of docs to sample per media outlet. I'm doing that so we don't end up oversampling the larger media sources.There are a number of ways to calculate significant terms across two sets of text. Here are what Elasticsearch supports:
JLH was created by Mark Harwood at Elasticsearch as a decent balance between precision and recall. Discussed here.
Chi-sqaured is what Gentzkow Shapiro and Martin Yurukoglu use.
search_config = {
'shard_size': 50000,
'max_docs_per_value': 1000,
'size': 500,
'min_doc_count': 100,
'excluded_tokens': [
'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
search_config['mutual_information'] = {"include_negatives": False, "background_is_superset": False}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_mi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_mi.head(20)
Observations
del search_config['mutual_information']
search_config['chi_square'] = {"include_negatives": False, "background_is_superset": False}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_chi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_chi.head(20)
Observations
bg_count, but most of it looks like it has to be manual.if 'chi_square' in search_config: del search_config['chi_square']
search_config['gnd'] = {"background_is_superset": False}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_gnd = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_gnd.head(20)
Observations
doc_counts look lower. chi_square had things like "hillari clinton" and "fox new" that don't end up here.if 'gnd' in search_config: del search_config['gnd']
search_config['percentage'] = {}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_pct = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
right_pct.head(20)
Observations
I like the look of chi_sqaure the most right now. A bunch of people use it, and the results look fairly decent. gnd looks a little better to me, but no one uses it, so that feels iffy.
Let's do some work filtering the phrases.
The easiest place to start is looking at the bg_count and filtering out ones that are too low (likely related to a single source, like "newsmax right") or too high (too common to make for a meaningful search, like "presid trump").
_ = right_chi['bg_count'].plot.hist(logy=True)
right_chi['bg_count'].describe()
right_chi.sort_values('bg_count').tail(50)
right_chi.sort_values('bg_count')[120:140]
Observations
I'll test out application of those filters below. I'll also manually filter out bigrams that I think are artifacts of the process rather than true signifiers of differing language. After poking through the data, it looks like those artifacts fall into three categories:
I've pulled out bigrams from those three categories until the top 50 bigrams on the left and right looked fairly clean. I have not filtered beyond the top 50.
MIN_BG_DOC_COUNT = 30
MAX_STDDEV_FROM_MEAN = 3
media_source_names = [
'daili caller', 'son liberti', 'free beacon', 'liberti media', 'washington examin', 'gatewai pundit',
'fox new', 'washington free', 'conserv tribun', 'freedom daili', 'right scoop', 'daili wire', 'washington time',
'caller new', 'western journal', 'nation review', 'daili signal', 'video video.foxnews.com', 'caller report',
'breitbart report', 'beacon report', 'breitbart new', 'via breitbart', 'accord breitbart', 'told breitbart',
'conserv fire', 'breitbart texa', 'american new', 'reviv america'
]
author_names = [
'jim hoft', 'steven hayward', 'angri patriot', 'video jim', 'jazz shaw', 'ed morrissei', 'jack davi'
]
boilerplate = [
'h t', 'keep read', 'continu below', 'also washington', 'america sorri', 'get bumper', 'gmt 4'
]
right_exclude = media_source_names + author_names + boilerplate
right_filter = (right_chi['bg_count'] >= MIN_BG_DOC_COUNT) & \
(right_chi['bg_count'] <= right_chi['bg_count'].mean() + (right_chi['bg_count'].std() * MAX_STDDEV_FROM_MEAN)) & \
(~right_chi.index.isin(right_exclude))
right_chi[right_filter].head(50)
search_config['chi_square'] = {"include_negatives": False, "background_is_superset": False}
s = make_search("ideology:left", "ideology:center OR ideology:right", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
left_chi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
media_source_names = [
'york time', 'juan cole', 'mother jone', 'media matter', 'democraci now', 'told thinkprogress',
'everydai femin', 'tomdispatch regular', 'mintpress new', 'common dream', 'wing watch'
]
author_names = [
'ami goodman', 'kevin drum', 'ed kilgor', 'thank b'
]
boilerplate = [
'save favorit', 'load player', 'click reus', 'reus option', 'file to', 'ad polici', 'via flickr'
]
left_exclude = media_source_names + author_names + boilerplate
left_filter = (left_chi['bg_count'] >= MIN_BG_DOC_COUNT) & \
(left_chi['bg_count'] <= left_chi['bg_count'].mean() + (left_chi['bg_count'].std() * MAX_STDDEV_FROM_MEAN)) & \
(~left_chi.index.isin(left_exclude))
left_chi[left_filter].head(50)
Let's take one pass at these filtered lists with a bigger sample size to make sure nothing changes.
search_config.update({
'shard_size': 100000,
'max_docs_per_value': 3000,
'chi_square': {"include_negatives": False, "background_is_superset": False}
})
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
big_right_chi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
.drop('key', axis='columns').sort_values('score', ascending=False)
big_right_filter = (big_right_chi['bg_count'] >= MIN_BG_DOC_COUNT) & \
(big_right_chi['bg_count'] <= big_right_chi['bg_count'].mean() + (big_right_chi['bg_count'].std() * MAX_STDDEV_FROM_MEAN)) & \
(~big_right_chi.index.isin(right_exclude))
big_right_chi[big_right_filter].head(50)
Observations
right_chi.assign(is_filtered_out=~right_filter).to_csv('hyperpartisan_bigrams_right.csv')
left_chi.assign(is_filtered_out=~left_filter).to_csv('hyperpartisan_bigrams_left.csv')