Hyperpartisan n-grams

The goal of this is to extract n-grams (bigrams at first) from the stories of hyperpartisan sites. These hyperpartisan sites were identified by Craig Silverman of BuzzFeed (646 sources) and are available in this collection.

The right end of this (471 sources) is available in this collection while the left (175 sources) is available in this collection.

We want to extract the n-grams that are most significant for these hyperpartisan sites relative to a neutral baseline. The neutral baseline we'll be using in this case is all stories from the "green" middle zone of the election study.

I have all the stories from each of these collections in an Elasticsearch cluster on Media Cloud's infrastructure. I have already done some filtering by excluding bigrams that include one of the following tokens:

'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com'

I picked those by looking at preliminary results and pulling out boilerplate that I gauged as uncontroversial.

Outline

  1. Determine reasonable sample size
  2. Determine reasonable significance test (for selecting phrases)
  3. Determine additional filtering steps

Quick Summary of Results

This method picks up a ton of boilerplate stuff that the Media Cloud crawler left in the article text. Once we filter some of it out, which I think we have to do manually (but I think it's easy to justify), results look to be about what we'd expect.

The results for the left are available here and the right are here.

The doc_count is the number of docs including the bigram in the sampled foreground set. bg_count is the number of docs in the (not sampled) background set including the bigram. score is the significance score computed using the chi-squared test. If is_filtered_out is True, it indicates that the phrase should probably be filtered out, but False doesn't necessarily mean it shouldn't be filtered out. I've only manually filtered such that the top 50 non-filtered look decent.

In [1]:
%matplotlib inline
In [2]:
import pandas as pd

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, A, Q

from find_ngrams import make_search

client = Elasticsearch(timeout=600)
INDEX = 'hyperpartisan_ngrams'

Let's see how many stories we have from each collection.

In [3]:
s = Search(using=client, index=INDEX).aggs.metric('ideologies', A('terms', field='ideology'))
resp = s.execute()
_ = pd.Series({a['key']: a['doc_count'] for a in resp.aggs.ideologies.buckets}).plot.bar()

Determining Sample Size

Let's see how the top terms vary with sample size. Once there's some stability, we know our sample size is large enough. I'll try 5k, 20k, 50k, 100k, and then we'll see.

In [4]:
search_config = {
    'shard_size': 5000,
    'max_docs_per_value': 200,
    'size': 500,
    'min_doc_count': 100,
    'excluded_tokens': [
        'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
        'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
        'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
        'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}

s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_5k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_5k.head(20)
Out[4]:
doc_count score bg_count
sorri data 214 169.093866 4
male mood 124 113.547267 2
son liberti 331 19.251315 84
america sorri 212 16.182389 41
right scoop 163 11.885169 33
jim hoft 362 6.458857 299
reviv america 212 3.522507 188
conserv tribun 103 3.068342 51
liberti media 325 2.536163 612
video jim 104 2.311101 69
data far 215 2.118335 321
swamp drain 158 1.339373 274
continu below 373 1.232769 1647
gender male 130 1.177805 211
christian new 125 1.137486 202
daili caller 523 1.115839 3554
gatewai pundit 213 0.981292 677
top right 263 0.597955 1679
keep read 226 0.527133 1407
right new 267 0.492398 2093
In [5]:
search_config = {
    'shard_size': 20000,
    'max_docs_per_value': 800,
    'size': 500,
    'min_doc_count': 100,
    'excluded_tokens': [
        'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
        'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
        'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
        'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_20k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_20k.head(20)
Out[5]:
doc_count score bg_count
mja uncategor 287 76.034326 0
id 25027 249 57.232419 0
us ditty_news_tick 249 57.232419 0
ditty_news_tick id 249 57.232419 0
tfpp writer 213 41.879301 0
snip origin 213 41.879301 0
hoft jul 211 41.096507 0
robert gehl 208 39.936164 1
hoft aug 201 37.293317 0
christoph age 184 31.251609 0
elder patriot 166 25.436077 0
sorri data 325 24.373120 4
male mood 223 22.950906 2
bfh uncategor 152 21.326448 0
patriotupd patriotupd 144 19.140559 0
read www.rushlimbaugh.com 130 15.599577 0
sticker click 180 14.952887 2
gina cassini 120 13.291855 0
neighborhood smyrnaman 118 12.852465 0
cassini top 117 12.635540 0
In [6]:
search_config = {
    'shard_size': 50000,
    'max_docs_per_value': 1000,
    'size': 500,
    'min_doc_count': 100,
    'excluded_tokens': [
        'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
        'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
        'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
        'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_50k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_50k.head(20)
Out[6]:
doc_count score bg_count
mja uncategor 1013 151.561733 0
newsmax right 1525 57.243271 6
robert gehl 598 52.815936 1
tfpp writer 584 50.371843 0
bfh uncategor 567 47.481853 0
snip origin 549 44.514915 1
ditty_news_tick id 544 43.707751 0
us ditty_news_tick 544 43.707751 0
id 25027 544 43.707751 0
gile clashdaili 439 28.463253 0
elder patriot 274 11.087675 0
sticker click 387 11.058938 2
hoft aug 268 10.607376 0
v saxena 255 9.603212 0
clashdaili publish 250 9.230288 0
below conservativetribune.com 246 8.937266 0
christoph age 244 8.792527 0
sorri data 460 7.811522 4
hoft jul 227 7.609957 0
hoft sep 221 7.212961 0
In [7]:
search_config = {
    'shard_size': 100000,
    'max_docs_per_value': 2000,
    'size': 500,
    'min_doc_count': 100,
    'excluded_tokens': [
        'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
        'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
        'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
        'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_100k = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_100k.head(20)
Out[7]:
doc_count score bg_count
mja uncategor 2374 208.100830 0
newsmax right 4014 99.149520 6
robert gehl 1388 71.135154 1
bfh uncategor 1323 64.628518 0
us ditty_news_tick 1217 54.687002 0
ditty_news_tick id 1217 54.687002 0
id 25027 1217 54.687002 0
tfpp writer 1066 41.957986 0
gile clashdaili 1013 37.889420 1
snip origin 695 17.834382 1
clashdaili publish 584 12.592377 0
v saxena 547 11.047244 0
elder patriot 433 6.922194 0
dr tar 430 6.826601 0
tar uncategor 428 6.763241 0
hoft aug 416 6.389287 0
hoft jul 400 5.907224 0
joshua caplan 561 5.809438 2
christoph age 376 5.219578 0
below conservativetribune.com 371 5.081673 0

Observations

  • The scores seem significantly different between 50k and 100k, but the terms themselves do not. I'll probably stick with 50k for now and try larger when I want more confidence.
  • The terms themselves are pretty garbage. I've looked into how a few of them show up, and they're all boilerplate stuff.
  • The bg_count is very close to zero for all the top scores (which are garbage). Is is safe to assume that highly partisan terms should occur at at least some baseline level in other media sources? I think so. At what level remains to be seen.
  • I could still try tweaking the max_docs_per_value, which is the maximum number of docs to sample per media outlet. I'm doing that so we don't end up oversampling the larger media sources.

Comparing Significance Tests

There are a number of ways to calculate significant terms across two sets of text. Here are what Elasticsearch supports:

  • JLH score
  • Mutual information
  • Chi-squared
  • Google normalized distance
  • Percentage

JLH was created by Mark Harwood at Elasticsearch as a decent balance between precision and recall. Discussed here.

Chi-sqaured is what Gentzkow Shapiro and Martin Yurukoglu use.

In [8]:
search_config = {
    'shard_size': 50000,
    'max_docs_per_value': 1000,
    'size': 500,
    'min_doc_count': 100,
    'excluded_tokens': [
        'twitter', 'tweet', 'facebook', 'reddit', 'youtub',
        'follow', 'join', 'comment', 'content', 'like', 'relat', 'transcript', 'share',
        'subscrib', 'categor', 'stori', 'articl', 'sourc', 'bookmark', 'imag',
        'post', 'trend', 'pm', 'by', '2015', '2016', '2017', 'http', '\.com']
}
In [9]:
search_config['mutual_information'] = {"include_negatives": False, "background_is_superset": False}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_mi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_mi.head(20)
Out[9]:
doc_count score bg_count
hillari clinton 33242 0.002927 337165
presid trump 23228 0.002765 175396
donald trump 51943 0.002223 854060
daili caller 5875 0.002206 4434
h t 8230 0.001799 26104
fox new 15604 0.001777 121611
mainstream media 7247 0.001568 23359
obama administr 13171 0.001358 112595
unit state 34672 0.001314 589763
american peopl 10252 0.001297 70862
presid obama 14643 0.001258 147079
illeg alien 3728 0.001256 4133
white hous 25314 0.001156 389038
illeg immigr 7169 0.001154 36723
son liberti 2093 0.001076 150
free beacon 2511 0.001004 1430
washington examin 3169 0.000995 4343
jim hoft 2087 0.000972 469
liber media 3146 0.000944 4897
barack obama 17120 0.000941 235019

Observations

  • This is what the docs warned of. MI is biased towards high frequency terms, and that's what we see.
  • It is fundamentally different from JLH. JLH would require an additional filtering step because a bunch of terms only show up in the foreground set, and those terms are mostly boilerplate.
In [10]:
del search_config['mutual_information']
search_config['chi_square'] = {"include_negatives": False, "background_is_superset": False}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_chi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_chi.head(20)
Out[10]:
doc_count score bg_count
daili caller 5875 118756.307454 4434
son liberti 2093 71853.306729 150
presid trump 23228 64832.572793 175396
jim hoft 2087 62175.759342 469
illeg alien 3728 61472.176356 4133
h t 8230 61088.865823 26104
hillari clinton 33242 60308.581664 337165
free beacon 2511 57291.378080 1430
newsmax right 1525 56087.278559 6
mainstream media 7247 52954.184836 23359
liberti media 2068 49902.312375 1014
washington examin 3169 45804.123957 4343
gatewai pundit 1970 44581.911014 1145
keep read 2897 44197.434224 3649
liber media 3146 41723.497095 4897
fox new 15604 41381.573629 121611
washington free 1983 41030.996562 1421
continu below 2204 39016.501124 2155
donald trump 51943 37411.378476 854060
mja uncategor 1013 37409.175848 0

Observations

  • These look better than JLH (boilerplate) and MI (highly frequent across both sets).
  • There's a whole bunch of names of media sources that need to get filtered out. The could be filtered out a big with bg_count, but most of it looks like it has to be manual.
  • There's still some biolerplate in here ("h t", "keep read", etc.).
In [11]:
if 'chi_square' in search_config: del search_config['chi_square']
search_config['gnd'] = {"background_is_superset": False}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_gnd = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_gnd.head(20)
Out[11]:
doc_count score bg_count
daili caller 5875 0.577159 4434
son liberti 2093 0.563917 150
jim hoft 2087 0.558610 469
newsmax right 1525 0.557633 6
free beacon 2511 0.553811 1430
illeg alien 3728 0.552810 4133
liberti media 2068 0.550455 1014
mja uncategor 1013 0.547490 0
gatewai pundit 1970 0.546700 1145
conserv tribun 1053 0.545430 90
h t 8231 0.544822 26104
washington free 1983 0.543470 1421
presid trump 17622 0.543120 105361
washington examin 3169 0.542416 4343
continu below 2048 0.542242 1647
keep read 2897 0.542004 3649
freedom daili 893 0.540390 109
mainstream media 7247 0.539388 23359
liber media 3146 0.538666 4897
right scoop 801 0.537987 97

Observations

  • These look largely the same as chi-squared.
  • doc_counts look lower. chi_square had things like "hillari clinton" and "fox new" that don't end up here.
  • Scores look a bit more interpretable as numbers.
In [12]:
if 'gnd' in search_config: del search_config['gnd']
search_config['percentage'] = {}
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
right_pct = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
right_pct.head(20)
Out[12]:
doc_count score bg_count
newsmax right 1299 216.500000 6
hannah bleau 104 104.000000 1
sticker click 197 98.500000 2
doug gile 469 93.800000 5
male mood 159 79.500000 2
sorri data 225 56.250000 4
rush okai 165 55.000000 3
randi desoto 243 48.600000 5
paul mirengoff 538 41.384615 13
linkedin whatsapp 319 39.875000 8
gossip chatter 101 33.666667 3
a.f branco 195 32.500000 6
said snip 156 31.200000 5
t gatewai 143 28.600000 5
caller thank 114 28.500000 4
doug power 253 28.111111 9
b christoph 244 27.111111 9
right caller 134 26.800000 5
nonwhit invad 208 26.000000 8
rush oh 154 25.666667 6

Observations

  • These terms are garbage, which is to be expected because it's mostly terms that don't appear in the background docs.

More Filtering

I like the look of chi_sqaure the most right now. A bunch of people use it, and the results look fairly decent. gnd looks a little better to me, but no one uses it, so that feels iffy.

Let's do some work filtering the phrases.

The easiest place to start is looking at the bg_count and filtering out ones that are too low (likely related to a single source, like "newsmax right") or too high (too common to make for a meaningful search, like "presid trump").

In [13]:
_ = right_chi['bg_count'].plot.hist(logy=True)
right_chi['bg_count'].describe()
Out[13]:
count       500.000000
mean      15110.806000
std       58079.380803
min           0.000000
25%          27.750000
50%         573.000000
75%        5668.750000
max      854060.000000
Name: bg_count, dtype: float64
In [14]:
right_chi.sort_values('bg_count').tail(50)
Out[14]:
doc_count score bg_count
america great 3356 6795.461530 31103
black live 4713 14999.112104 31901
plan parenthood 3600 7436.358900 32947
live matter 4903 15745.810299 32963
execut order 2882 3962.267807 33550
daili mail 3446 6367.237789 33822
trump presid 3026 4336.474616 34437
make america 3811 7322.496218 36519
illeg immigr 7169 32221.971113 36723
marco rubio 3145 4153.367545 37422
support trump 3165 4151.571575 37797
homeland secur 3373 4644.018689 39249
depart justic 3399 4199.048513 41901
terrorist attack 4560 7186.516016 49237
presid elect 4081 4213.821116 55062
clinton campaign 5016 7563.005988 55564
mr trump 4074 3896.825431 56995
presid unit 4893 6445.440510 58375
bill clinton 6604 13089.241741 62182
trump support 6776 13378.066801 63966
men women 4574 3836.273415 67936
american peopl 10252 31882.564692 70862
republican parti 4990 4370.882256 72723
ted cruz 6791 10028.268828 76212
democrat parti 8399 17533.105413 76602
foreign polici 5545 5151.849645 78711
former presid 6098 5990.633880 84377
trump administr 6233 5845.310624 88143
state depart 7475 9938.374872 88906
islam state 5813 4259.597000 91540
middl east 7410 8163.239368 97104
feder govern 7358 7979.009443 97173
berni sander 6434 4720.010923 101318
new report 7899 7574.179755 110694
obama administr 13171 30483.672860 112595
fox new 15604 41381.573629 121611
don know 8092 5330.534820 133251
nation secur 9015 7235.574141 136936
presidenti candid 8199 5184.700719 137195
secretari state 10118 9863.046827 140936
presid obama 14643 26395.247832 147079
law enforc 9479 7130.242178 148055
presid donald 10478 8198.627556 161078
york time 10365 7027.820774 169022
presid trump 23228 64832.572793 175396
barack obama 17120 17403.225857 235019
hillari clinton 33242 60308.581664 337165
white hous 25314 20351.647087 389038
unit state 34672 22136.712400 589763
donald trump 51943 37411.378476 854060
In [15]:
right_chi.sort_values('bg_count')[120:140]
Out[15]:
doc_count score bg_count
dc 0 122 3779.495388 22
pam kei 172 5561.632452 23
john sexton 185 5944.026283 26
t blaze 136 4145.122514 27
rush here 179 5696.592400 27
t breitbart 357 12172.190188 28
chri reynold 449 15554.293643 28
written doug 250 8251.476589 28
john hinderak 460 15924.553295 29
jazz shaw 379 12879.810094 31
freedom outpost 138 4084.769278 32
ago 0 158 4798.597352 32
weasel zipper 160 4870.263328 32
david rutz 204 6427.344703 33
antifa thug 144 4272.426900 33
daniel greenfield 206 6499.700420 33
rush know 196 6138.183763 33
world invas 260 8400.171364 35
rush now 192 5801.401155 40
steven hayward 455 15370.646763 40

Observations

  • On the top end, let's do the standard thing of removing anything outside 3 standard deviations. That's probably too permissive, but it's a start. There are highly scoring terms that are above 1 or 2 standard devations, so I want to keep those for the time being.
  • On the low end, greater than or equal to 30 appearances in the background looks reasonable and is also a commonly selected cutoff. "antifa thug" and "world invas" look like phrases that should be kept, while there doesn't look to be much good below that.

I'll test out application of those filters below. I'll also manually filter out bigrams that I think are artifacts of the process rather than true signifiers of differing language. After poking through the data, it looks like those artifacts fall into three categories:

  1. Names of media sources, or common bigrams that include a name of a media source
  2. Names of author that appear in many articles
  3. Pieces of templates that made it into the scraped text

I've pulled out bigrams from those three categories until the top 50 bigrams on the left and right looked fairly clean. I have not filtered beyond the top 50.

In [16]:
MIN_BG_DOC_COUNT = 30
MAX_STDDEV_FROM_MEAN = 3
In [17]:
media_source_names = [
    'daili caller', 'son liberti', 'free beacon', 'liberti media', 'washington examin', 'gatewai pundit',
    'fox new', 'washington free', 'conserv tribun', 'freedom daili', 'right scoop', 'daili wire', 'washington time',
    'caller new', 'western journal', 'nation review', 'daili signal', 'video video.foxnews.com', 'caller report',
    'breitbart report', 'beacon report', 'breitbart new', 'via breitbart', 'accord breitbart', 'told breitbart',
    'conserv fire', 'breitbart texa', 'american new', 'reviv america'
]
author_names = [
    'jim hoft', 'steven hayward', 'angri patriot', 'video jim', 'jazz shaw', 'ed morrissei', 'jack davi'
]
boilerplate = [
    'h t', 'keep read', 'continu below', 'also washington', 'america sorri', 'get bumper', 'gmt 4'
]
right_exclude = media_source_names + author_names + boilerplate
right_filter = (right_chi['bg_count'] >= MIN_BG_DOC_COUNT) & \
    (right_chi['bg_count'] <= right_chi['bg_count'].mean() + (right_chi['bg_count'].std() * MAX_STDDEV_FROM_MEAN)) & \
    (~right_chi.index.isin(right_exclude))
right_chi[right_filter].head(50)
Out[17]:
doc_count score bg_count
presid trump 23228 64832.572793 175396
illeg alien 3728 61472.176356 4133
mainstream media 7247 52954.184836 23359
liber media 3146 41723.497095 4897
illeg immigr 7169 32221.971113 36723
american peopl 10252 31882.564692 70862
obama administr 13171 30483.672860 112595
presid obama 14643 26395.247832 147079
hussein obama 1303 24345.432192 1151
polit correct 4326 21059.462139 20626
barack hussein 1133 20527.390034 1061
clinton foundat 3308 20350.719264 12669
left wing 4604 17706.292614 26846
democrat parti 8399 17533.105413 76602
anti trump 4631 17442.062805 27466
bizpac review 510 17060.471476 50
open border 2080 16535.128988 6112
christian new 630 15871.661221 272
live matter 4903 15745.810299 32963
gun control 4153 15278.987151 25094
judici watch 1271 15024.790894 2318
black live 4713 14999.112104 31901
justic warrior 935 14789.771153 1110
second amend 2987 14607.043875 14174
georg soro 1849 13763.053029 5830
anti american 1794 13666.217225 5519
new foundat 861 13396.288943 1050
islam terrorist 1701 13390.266954 5051
far left 2358 13387.250074 9751
trump support 6776 13378.066801 63966
bill clinton 6604 13089.241741 62182
leftist media 471 12810.221399 157
islam terror 2326 12450.876962 10163
hillari campaign 1053 12238.305776 1964
scott johnson 420 12068.076633 112
radic islam 2845 11781.234403 15589
anti gun 1014 11463.606839 1961
campu reform 446 10859.990351 213
free speech 3841 10770.393503 28541
sharia law 1431 10764.572427 4461
religion peac 682 10735.863764 816
big govern 1491 10334.264795 5063
muslim migrant 695 10246.416881 923
expos obama 372 10183.537099 121
colleg fix 385 10138.395953 144
ted cruz 6791 10028.268828 76212
state depart 7475 9938.374872 88906
secretari state 10118 9863.046827 140936
email scandal 1355 9401.363303 4596
fake new 3172 9215.589296 22959
In [18]:
search_config['chi_square'] = {"include_negatives": False, "background_is_superset": False}
s = make_search("ideology:left", "ideology:center OR ideology:right", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
left_chi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)
In [19]:
media_source_names = [
    'york time', 'juan cole', 'mother jone', 'media matter', 'democraci now', 'told thinkprogress',
    'everydai femin', 'tomdispatch regular', 'mintpress new', 'common dream', 'wing watch'
]
author_names = [
    'ami goodman', 'kevin drum', 'ed kilgor', 'thank b'
]
boilerplate = [
    'save favorit', 'load player', 'click reus', 'reus option', 'file to', 'ad polici', 'via flickr'
    
]
left_exclude = media_source_names + author_names + boilerplate
left_filter = (left_chi['bg_count'] >= MIN_BG_DOC_COUNT) & \
    (left_chi['bg_count'] <= left_chi['bg_count'].mean() + (left_chi['bg_count'].std() * MAX_STDDEV_FROM_MEAN)) & \
    (~left_chi.index.isin(left_exclude))

left_chi[left_filter].head(50)
Out[19]:
doc_count score bg_count
right wing 15888 124009.704306 61426
anti choic 1696 48824.842351 901
peopl color 4903 42047.225940 17105
presid obama 19571 41507.233016 220132
trump administr 17240 41464.022834 178288
berni sander 14805 38414.732314 145169
regim chang 3368 37200.273392 8850
russia scandal 1612 36192.196577 1497
white supremacist 6599 36172.587633 35654
climat chang 11153 33412.095436 98548
republican parti 12118 32528.091202 115854
foreign polici 10699 31005.473092 96802
fossil fuel 4951 30824.845663 23776
social movement 1646 30658.236393 2119
work class 5770 28457.669652 34205
anti lgbt 1777 28262.345586 2908
civil right 9400 27739.224793 83878
trump campaign 12125 25620.490143 136003
marriag equal 2094 25416.641717 4908
far right 5397 24802.647556 33979
human right 10975 24607.958855 118389
anti gai 2387 23903.928446 7021
militari industri 1703 23799.611464 3332
white supremaci 3186 23534.376130 12952
american peopl 11006 22813.199148 124873
industri complex 2009 22388.521349 5221
religi right 1928 22038.277035 4861
white nationalist 3561 21457.159167 17620
black agenda 546 20458.158598 106
mass incarcer 1589 19942.330747 3572
war terror 2848 19750.349159 12322
health care 13625 19610.377770 192054
neo nazi 3231 19398.896234 16038
w bush 8144 19383.770291 84299
alt right 3310 19247.231908 16923
georg w 8022 19043.629276 83174
african american 9080 18984.956319 102321
bush administr 3432 18626.823673 18679
greater middl 685 18450.685945 431
conspiraci theori 4150 18359.923537 26957
trump regim 689 18099.419142 460
fuel industri 1440 17743.539042 3312
black peopl 4393 17451.791249 31164
al qaeda 4939 17439.813125 38448
wing media 1645 17435.702643 4533
so call 12409 17402.887007 177246
fox new 14554 17399.232726 226399
trump li 1339 17390.286608 2882
us militari 3817 17324.720667 24264
presid unit 7517 17307.284244 79506

Let's take one pass at these filtered lists with a bigger sample size to make sure nothing changes.

In [20]:
search_config.update({
    'shard_size': 100000,
    'max_docs_per_value': 3000,
    'chi_square': {"include_negatives": False, "background_is_superset": False}
})
s = make_search("ideology:right", "ideology:center OR ideology:left", **search_config)
resp = s.execute()
phrases = resp.aggregations.my_sample.partisan_ngrams.buckets
big_right_chi = pd.DataFrame.from_dict({p.key: p.to_dict() for p in phrases}, orient="index")\
    .drop('key', axis='columns').sort_values('score', ascending=False)

big_right_filter = (big_right_chi['bg_count'] >= MIN_BG_DOC_COUNT) & \
    (big_right_chi['bg_count'] <= big_right_chi['bg_count'].mean() + (big_right_chi['bg_count'].std() * MAX_STDDEV_FROM_MEAN)) & \
    (~big_right_chi.index.isin(right_exclude))
big_right_chi[big_right_filter].head(50)
Out[20]:
doc_count score bg_count
presid trump 47098 120045.657401 175396
illeg alien 7091 77660.085306 4133
mainstream media 12834 68527.042834 23359
liber media 5708 51628.993100 4897
illeg immigr 13486 48856.518637 36723
obama administr 24823 47779.672131 112595
american peopl 18297 43691.046060 70862
presid obama 26786 38294.251888 147079
left wing 8938 29004.274898 26846
polit correct 7670 28098.848772 20626
anti trump 8879 27856.033557 27466
clinton foundat 5770 25927.205858 12669
hussein obama 2214 25409.948279 1151
democrat parti 15115 24437.020037 76602
open border 3843 22897.742403 6112
live matter 8792 21800.502079 32963
trump support 12874 21442.177372 63966
black live 8463 20835.058379 31901
barack hussein 1837 20180.646060 1061
georg soro 3500 20083.297778 5830
gun control 7273 20015.786942 25094
last edit 1399 19621.679444 388
far left 4403 19613.502025 9751
new foundat 1773 19266.679665 1050
judici watch 2319 19165.368099 2318
anti american 3328 19165.059631 5519
bill clinton 12034 18825.942993 62182
second amend 5156 18429.501401 14174
radic islam 5379 18145.056734 15589
via daili 1912 18002.798007 1522
islam terror 4294 17923.532644 10163
justic warrior 1691 17536.679322 1110
islam terrorist 3030 17373.152181 5051
free speech 7277 16993.030128 28541
fake new 7561 16857.693438 30694
sharia law 2629 14852.661030 4461
state depart 13696 14342.255046 88906
bizpac review 821 14194.805064 50
hillari campaign 1809 14190.925433 1964
secretari state 18607 14166.932541 140936
sign newslett 3981 14047.353035 11070
big govern 2700 14005.815179 5063
trump administr 13459 13850.755516 88143
american citizen 5417 13705.412675 19969
leftist media 862 13199.785929 157
colleg fix 847 13121.826010 144
read full 4986 13011.623801 17936
muslim migrant 1297 12935.290151 923
anti gun 1697 12777.177440 1961
team twitchyteam 732 12776.514249 38

Observations

  • This looks pretty much the same.
  • Scores are about doubled, which makes sense because our sample size doubled.
  • There are some new messy terms that could be filtered out, but I'm not going to touch them. I'll just use the 50k sample.
In [21]:
right_chi.assign(is_filtered_out=~right_filter).to_csv('hyperpartisan_bigrams_right.csv')
left_chi.assign(is_filtered_out=~left_filter).to_csv('hyperpartisan_bigrams_left.csv')