Breaking up Media Sources into Buckets

Updated Jan 31, 2019

The goal here is to be able to draw divisions between media sources so that we have a left, center-left, center, center-right, and right. I'm using a few different methods here. For each, I'm outputting a distribution of the number of media sources within each bucket and then showing the top domains within each bucket.

Table of Contents

Changes

  • Fixed issue where absence of left or right sharers resulted in no bucket estimate
  • Added the section on political party followers
  • Added section on seeing how close we can get buckets to match PP&D

Two Pole Media Buckets

Instead of using the estimated ideology scores directly, which are difficult to partition, let's use a metric that is easier to partition. We'll try two: relative sharing by either half of the ideology score distribution and relative sharing by users that preferentially follow the two parties.

Two Groups by Halving the Barbera Ideology Range

First, we'll take the ideology score spectrum we've already created and separate it into two groups of users: those left of center and those right of center. In a separate notebook we've estimated center to be around 0.395.

In [1]:
%matplotlib inline
import gzip, pickle, collections, itertools, random

import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
import scipy.stats

sbs.set(style='white')
plotly.init_notebook_mode()
In [2]:
CENTER = 0.395

all_acct_ideos = pd.read_csv('data/cleaned_user_ideology_estimates_20180705.csv.gz', index_col=0)
acct_ids_in_sample = pd.read_pickle('data/all_samples_combined/user_ids.pkl')
all_acct_ideos['ideo'] = all_acct_ideos['normed_theta'] - CENTER
acct_ideos = all_acct_ideos.reindex(acct_ids_in_sample).dropna()
acct_ideos['pole'] = acct_ideos.apply(lambda r: 'left' if r['ideo'] < 0 else 'right', axis=1)
/berkman/home/jclark/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:522: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Sample of account ideology data

In [3]:
display(acct_ideos.sample(5))
theta normed_theta ideo pole
id
33592757 0.400661 -0.320489 -0.715489 left
938088519964209158 -0.085667 0.145319 -0.249681 left
3677525128 0.452772 -0.370400 -0.765400 left
32400040 0.428739 -0.347382 -0.742382 left
819919248319508480 1.132305 -1.021260 -1.416260 left

Number of accounts in both poles

In [4]:
display(acct_ideos['pole'].value_counts())
left     23246
right    10549
Name: pole, dtype: int64
In [5]:
ax = acct_ideos['ideo'].plot.hist(bins=200, alpha=0.5, figsize=(12, 6))
ax.set_title(f'Account ideology distribution - center at {CENTER}')
_ = ax.vlines(0, 0, 700, color='red')
In [6]:
%%time
MIN_USERS = 30

site_to_accts = pd.read_pickle('data/all_samples_combined/subdomain_to_users.pkl')
site_to_accts = {k: v for k, v in site_to_accts.items() if len(v) >= MIN_USERS}
acct_to_pole = dict(acct_ideos['pole'].items())

max_length = max(len(v) for k,v in site_to_accts.items())
s1 = pd.DataFrame.from_dict({k:[acct_to_pole.get(u, np.nan) for u in v] + ([np.nan] * (max_length - len(v))) for k,v in site_to_accts.items()})
CPU times: user 1min 13s, sys: 6.97 s, total: 1min 20s
Wall time: 1min 20s
In [7]:
%%time
site_share_counts_by_pole = s1.apply(lambda c: c.value_counts()).T
site_share_counts_by_pole = site_share_counts_by_pole.fillna(0)
site_share_counts_by_pole['num_sharers'] = site_share_counts_by_pole['left'] + site_share_counts_by_pole['right']
CPU times: user 1min 12s, sys: 385 ms, total: 1min 13s
Wall time: 1min 12s

Sample of media ideology estimates

In [8]:
source_ideo = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
pct_shared = site_share_counts_by_pole / acct_ideos['pole'].value_counts()
site_sharing = site_share_counts_by_pole.assign(
    barbera_ideo=source_ideo['mean_sharer_ideo'],
    left_pct=pct_shared['left'],
    right_pct=pct_shared['right'])
site_sharing['left_pct/right_pct'] = site_sharing['left_pct'] / site_sharing['right_pct']

# Following quintiles breaks on page 45 of "Partisanship, Propaganda, & Disinformation"
def ratio_to_group(ratio):
    if ratio <= 1.0/4.0:
        return 'right'
    if 1.0/4.0 < ratio <= 2.0/3.0:
        return 'center-right'
    if 2.0/3.0 < ratio <= 3.0/2.0:
        return 'center'
    if 3.0/2.0 < ratio <= 4.0/1.0:
        return 'center-left'
    if 4.0/1.0 < ratio:
        return 'left'
        
site_sharing['ideo_group'] = site_sharing.apply(lambda r: ratio_to_group(r['left_pct/right_pct']), axis=1)
site_sharing.sample(5)
Out[8]:
left right num_sharers barbera_ideo left_pct right_pct left_pct/right_pct ideo_group
en.vogue.fr 24.0 15.0 39.0 0.16000 0.001032 0.001422 0.726078 center
theprovince.com 86.0 38.0 124.0 0.05807 0.003700 0.003602 1.027018 center
at.cmt.com 24.0 19.0 43.0 0.41100 0.001032 0.001801 0.573219 center-right
relm.ag 25.0 9.0 34.0 0.13750 0.001075 0.000853 1.260551 center
baldersonforcongress.com 1.0 31.0 32.0 1.83700 0.000043 0.002939 0.014639 right
In [9]:
ax = site_sharing['ideo_group'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])\
    .plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Number of media sources per bucket')

Observations

  • It's kind of surprising how similarly sized the groups are.
  • This looks markedly different from Figure 17 on page 46 of P, P & D.
  • Center-right is still the smallest group.
  • Center is the largest, which is surprising.

Top ten sites by most sharers per bucket

In [10]:
site_sharing.groupby('ideo_group').apply(lambda g: g.sort_values('num_sharers', ascending=False).head(10))
Out[10]:
left right num_sharers barbera_ideo left_pct right_pct left_pct/right_pct ideo_group
ideo_group
center twitter.com 20315.0 9200.0 29515.0 0.037870 0.873914 0.872121 1.002056 center
youtube.com 14270.0 7201.0 21471.0 0.083800 0.613869 0.682624 0.899279 center
nytimes.com 13769.0 4849.0 18618.0 -0.085940 0.592317 0.459664 1.288586 center
facebook.com 12153.0 5731.0 17884.0 0.061900 0.522800 0.543274 0.962313 center
washingtonpost.com 13187.0 4685.0 17872.0 -0.075200 0.567280 0.444118 1.277319 center
cnn.com 12163.0 4378.0 16541.0 -0.066400 0.523230 0.415016 1.260747 center
thehill.com 10487.0 4476.0 14963.0 0.024350 0.451131 0.424306 1.063223 center
instagram.com 9676.0 4470.0 14146.0 0.054660 0.416244 0.423737 0.982316 center
politico.com 9436.0 3743.0 13179.0 -0.006554 0.405919 0.354820 1.144014 center
nbcnews.com 9237.0 3106.0 12343.0 -0.101200 0.397359 0.294435 1.349561 center
center-left huffingtonpost.com 9455.0 2475.0 11930.0 -0.172700 0.406737 0.234619 1.733602 center-left
npr.org 8804.0 2538.0 11342.0 -0.171900 0.378732 0.240592 1.574169 center-left
theatlantic.com 7715.0 2174.0 9889.0 -0.180400 0.331885 0.206086 1.610421 center-left
vox.com 7762.0 1657.0 9419.0 -0.225300 0.333907 0.157077 2.125760 center-left
buzzfeednews.com 7235.0 2045.0 9280.0 -0.175500 0.311236 0.193857 1.605493 center-left
thedailybeast.com 7131.0 2135.0 9266.0 -0.156600 0.306762 0.202389 1.515708 center-left
newyorker.com 7481.0 1569.0 9050.0 -0.238800 0.321819 0.148734 2.163714 center-left
slate.com 6804.0 1343.0 8147.0 -0.260700 0.292696 0.127311 2.299066 center-left
thinkprogress.org 6775.0 834.0 7609.0 -0.338100 0.291448 0.079060 3.686433 center-left
msnbc.com 6404.0 1039.0 7443.0 -0.313200 0.275488 0.098493 2.797041 center-left
center-right foxnews.com 4816.0 4938.0 9754.0 0.611300 0.207175 0.468101 0.442587 center-right
dailymail.co.uk 4680.0 3443.0 8123.0 0.369400 0.201325 0.326382 0.616839 center-right
nypost.com 4618.0 3502.0 8120.0 0.396000 0.198658 0.331975 0.598413 center-right
washingtonexaminer.com 3893.0 3734.0 7627.0 0.591300 0.167470 0.353967 0.473122 center-right
whitehouse.gov 3222.0 2959.0 6181.0 0.520500 0.138604 0.280501 0.494133 center-right
nationalreview.com 2540.0 3177.0 5717.0 0.745000 0.109266 0.301166 0.362810 center-right
washingtontimes.com 2249.0 3223.0 5472.0 0.822000 0.096748 0.305527 0.316659 center-right
insider.foxnews.com 1847.0 3251.0 5098.0 0.960000 0.079455 0.308181 0.257818 center-right
reddit.com 2373.0 1679.0 4052.0 0.349900 0.102082 0.159162 0.641372 center-right
realclearpolitics.com 1692.0 2062.0 3754.0 0.701000 0.072787 0.195469 0.372370 center-right
left motherjones.com 6487.0 704.0 7191.0 -0.371300 0.279059 0.066736 4.181521 left
actblue.com 6404.0 342.0 6746.0 -0.453100 0.275488 0.032420 8.497443 left
dailykos.com 5430.0 599.0 6029.0 -0.402800 0.233589 0.056783 4.113733 left
shareblue.com 5123.0 244.0 5367.0 -0.525000 0.220382 0.023130 9.527909 left
healthcare.gov 4234.0 232.0 4466.0 -0.530000 0.182139 0.021993 8.281823 left
politicususa.com 4123.0 261.0 4384.0 -0.540000 0.177364 0.024742 7.168625 left
actionnetwork.org 3524.0 219.0 3743.0 -0.566400 0.151596 0.020760 7.302219 left
teenvogue.com 3416.0 296.0 3712.0 -0.553000 0.146950 0.028060 5.237080 left
hillreporter.com 3178.0 182.0 3360.0 -0.623000 0.136712 0.017253 7.924020 left
iwillvote.com 3096.0 86.0 3182.0 -0.665000 0.133184 0.008152 16.336746 left
right breitbart.com 2087.0 3879.0 5966.0 0.962000 0.089779 0.367713 0.244155 right
dailycaller.com 2060.0 3780.0 5840.0 0.947800 0.088617 0.358328 0.247308 right
dailywire.com 819.0 3247.0 4066.0 1.302000 0.035232 0.307802 0.114463 right
thefederalist.com 1054.0 2954.0 4008.0 1.187500 0.045341 0.280027 0.161917 right
thegatewaypundit.com 629.0 3201.0 3830.0 1.387000 0.027058 0.303441 0.089172 right
townhall.com 806.0 2890.0 3696.0 1.291000 0.034673 0.273960 0.126561 right
zerohedge.com 880.0 2627.0 3507.0 1.243000 0.037856 0.249028 0.152015 right
freebeacon.com 810.0 2568.0 3378.0 1.260000 0.034845 0.243435 0.143137 right
theblaze.com 897.0 2387.0 3284.0 1.201000 0.038587 0.226277 0.170531 right
judicialwatch.org 482.0 2777.0 3259.0 1.464000 0.020735 0.263248 0.078765 right

Barbera ideology distribution per bucket

In [11]:
site_sharing.groupby('ideo_group')['barbera_ideo'].describe().reindex(['left', 'center-left', 'center', 'center-right', 'right'])
Out[11]:
count mean std min 25% 50% 75% max
ideo_group
left 3427.0 -0.586275 0.086410 -0.806000 -0.65200 -0.5957 -0.526400 -0.1805
center-left 3113.0 -0.285073 0.110645 -0.593300 -0.36820 -0.2837 -0.202000 0.0906
center 3453.0 0.078155 0.137530 -0.353800 -0.02563 0.0683 0.182500 0.5317
center-right 2202.0 0.552879 0.205550 -0.003464 0.39265 0.5320 0.698925 1.1630
right 3263.0 1.611344 0.346016 0.354000 1.38450 1.7580 1.870000 2.1640
In [12]:
source_ideo.join(
    site_sharing.rename(
        {'left': 'num_sharers_left', 'right': 'num_sharers_right', 'left_pct': 'pct_shared_left', 'right_pct': 'pct_shared_right'}, 
        axis=1).drop(['num_sharers', 'barbera_ideo'], axis=1)).to_csv('data/all_samples_combined/media_source_ideo_with_buckets.csv')

Two Groups by Political Party Followers

Instead of breaking users into two groups using Barbera ideology scores, we can use who they follow. Specifically, we can break users into two groups: those that follow predominantly Democratic politicians and those that follow Republican politicians. I've done this in another notebook.

Here's a quick summary of that notebook: There are about 1 million accounts that follow mostly Democratic politicians and about 5 million accounts that follow mostly Republican politicians. That's a big difference (evidence of right insularity?), especially given there are significantly more left-leaning accounts than right-leaning. I did one version where I did not consider the different group sizes and just looked at media sharing ratios by the straight count of people. That's ideo_group_by_count. I also did a version where I took the percentage of each group sharing each domain. That's ideo_group_by_pct.

I'm only showing results of ideo_group_by_count here. Somewhat surprisingly, the percentage results looked crazy. Everything was pushed far to the left, such that Breitbart ended up in the "center" bucket. Breitbart had ~5,000 R followers share it versus ~1,000 D followers but there are ~5M R followers and ~1M D followers, so the percentages end up the same. I haven't looked very far into this phenomena yet.

Sample of media ideology estimates

In [13]:
source_ideo_from_party_pole = pd.read_csv('data/media_source_ideo_with_party_pole_buckets.csv', index_col=0)
source_ideo_from_party_pole.sample(5)
Out[13]:
R D R_pct D_pct R+D R_pct+D_pct D/R+D D_pct/R_pct+D_pct ideo_group_by_count ideo_group_by_pct
www-03.ibm.com 4 10 8.024127e-07 0.000009 14 0.000010 0.714286 0.919336 center-left left
us13.list-manage.com 0 6 0.000000e+00 0.000005 6 0.000005 1.000000 1.000000 left left
7-eleven.com 1 3 2.006032e-07 0.000003 4 0.000003 0.750000 0.931864 center-left left
kget.com 22 11 4.413270e-06 0.000010 33 0.000014 0.333333 0.695069 center-right center-left
koin.com 90 183 1.805429e-05 0.000167 273 0.000185 0.670330 0.902626 center-left left
In [14]:
ax = source_ideo_from_party_pole['ideo_group_by_count'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])\
    .plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Number of media sources per bucket')

Top ten sites by most sharers per bucket

In [15]:
source_ideo_from_party_pole.groupby('ideo_group_by_count').apply(lambda g: g.sort_values('R+D', ascending=False).head(10))
Out[15]:
R D R_pct D_pct R+D R_pct+D_pct D/R+D D_pct/R_pct+D_pct ideo_group_by_count ideo_group_by_pct
ideo_group_by_count
center twitter.com 12780 10467 0.002564 0.009572 23247 0.012136 0.450252 0.788752 center center-left
youtube.com 9181 7579 0.001842 0.006931 16760 0.008773 0.452208 0.790065 center center-left
facebook.com 7082 6717 0.001421 0.006143 13799 0.007564 0.486774 0.812167 center left
nytimes.com 5457 7754 0.001095 0.007091 13211 0.008186 0.586935 0.866271 center left
washingtonpost.com 5233 7441 0.001050 0.006805 12674 0.007855 0.587107 0.866353 center left
cnn.com 5078 6837 0.001019 0.006253 11915 0.007271 0.573815 0.859905 center left
instagram.com 5770 5436 0.001157 0.004971 11206 0.006129 0.485097 0.811141 center left
thehill.com 4925 6058 0.000988 0.005540 10983 0.006528 0.551580 0.848660 center left
politico.com 4021 5580 0.000807 0.005103 9601 0.005910 0.581189 0.863507 center left
pscp.tv 5267 4223 0.001057 0.003862 9490 0.004919 0.444995 0.785187 center center-left
center-left nbcnews.com 3310 5430 0.000664 0.004966 8740 0.005630 0.621281 0.882058 center-left left
huffingtonpost.com 2475 5799 0.000496 0.005303 8274 0.005800 0.700870 0.914395 center-left left
theguardian.com 3042 5162 0.000610 0.004721 8204 0.005331 0.629205 0.885531 center-left left
npr.org 2491 5250 0.000500 0.004801 7741 0.005301 0.678207 0.905733 center-left left
medium.com 2611 4607 0.000524 0.004213 7218 0.004737 0.638265 0.889428 center-left left
latimes.com 2817 4399 0.000565 0.004023 7216 0.004588 0.609618 0.876833 center-left left
theatlantic.com 2168 4742 0.000435 0.004337 6910 0.004772 0.686252 0.908854 center-left left
time.com 2360 4473 0.000473 0.004091 6833 0.004564 0.654617 0.896272 center-left left
thedailybeast.com 2253 4458 0.000452 0.004077 6711 0.004529 0.664283 0.900205 center-left left
newsweek.com 2373 4310 0.000476 0.003942 6683 0.004418 0.644920 0.892242 center-left left
center-right foxnews.com 6353 2469 0.001274 0.002258 8822 0.003532 0.279869 0.639215 center-right center-left
dailymail.co.uk 4205 2785 0.000844 0.002547 6990 0.003390 0.398426 0.751204 center-right center-left
washingtonexaminer.com 4407 2292 0.000884 0.002096 6699 0.002980 0.342141 0.703350 center-right center-left
whitehouse.gov 3533 1925 0.000709 0.001760 5458 0.002469 0.352693 0.712969 center-right center-left
nationalreview.com 3662 1495 0.000735 0.001367 5157 0.002102 0.289897 0.650489 center-right center-left
washingtontimes.com 3783 1357 0.000759 0.001241 5140 0.002000 0.264008 0.620537 center-right center-left
ijr.com 2641 897 0.000530 0.000820 3538 0.001350 0.253533 0.607595 center-right center-left
realclearpolitics.com 2352 1092 0.000472 0.000999 3444 0.001470 0.317073 0.679138 center-right center-left
weeklystandard.com 1931 1138 0.000387 0.001041 3069 0.001428 0.370805 0.728753 center-right center-left
rt.com 1997 925 0.000401 0.000846 2922 0.001247 0.316564 0.678626 center-right center-left
left thinkprogress.org 723 4423 0.000145 0.004045 5146 0.004190 0.859503 0.965385 left left
msnbc.com 996 4103 0.000200 0.003752 5099 0.003952 0.804668 0.949444 left left
rawstory.com 944 4096 0.000189 0.003746 5040 0.003935 0.812698 0.951879 left left
motherjones.com 567 4267 0.000114 0.003902 4834 0.004016 0.882706 0.971678 left left
actblue.com 226 4286 0.000045 0.003920 4512 0.003965 0.949911 0.988566 left left
dailykos.com 507 3665 0.000102 0.003352 4172 0.003453 0.878476 0.970549 left left
thenation.com 723 3340 0.000145 0.003054 4063 0.003200 0.822053 0.954670 left left
aclu.org 454 3196 0.000091 0.002923 3650 0.003014 0.875616 0.969782 left left
shareblue.com 148 3467 0.000030 0.003171 3615 0.003200 0.959059 0.990723 left left
propublica.org 459 2975 0.000092 0.002721 3434 0.002813 0.866337 0.967265 left left
right breitbart.com 4973 1074 0.000998 0.000982 6047 0.001980 0.177609 0.496109 right center
dailycaller.com 4637 1151 0.000930 0.001053 5788 0.001983 0.198860 0.530869 right center
insider.foxnews.com 4194 929 0.000841 0.000850 5123 0.001691 0.181339 0.502442 right center
dailywire.com 4071 344 0.000817 0.000315 4415 0.001131 0.077916 0.278095 right center-right
thegatewaypundit.com 4066 256 0.000816 0.000234 4322 0.001050 0.059232 0.223018 right center-right
thefederalist.com 3552 570 0.000713 0.000521 4122 0.001234 0.138282 0.422490 right center
townhall.com 3479 413 0.000698 0.000378 3892 0.001076 0.106115 0.351151 right center-right
zerohedge.com 3197 436 0.000641 0.000399 3633 0.001040 0.120011 0.383373 right center-right
judicialwatch.org 3415 173 0.000685 0.000158 3588 0.000843 0.048216 0.187617 right right
truepundit.com 3293 194 0.000661 0.000177 3487 0.000838 0.055635 0.211714 right center-right

Observations

  • The top things in each bucket look reasonable but are closer to the other Barbera estimates than the PP&D estimates.
  • foxnews.com is still in the center-right
  • The distribution looks totally different than other methods.
  • I think for these methods, "far left", "left", "center", "right", and "far right" might be better labels, though "far" means different things.

Directly Breaking into Quintiles

We broke up the user ideology distribution, but how does that translate into breaking up the media source ideology distribution? We have a center from the user distribution. It's where self described "liberals" and "conservatives" cross. It's around 0.395. Let's try something to what we did with the user distribution: pick the center and break up things somewhat evenly from there.

Now let's break the left side into 2.5 groups and the right side into 2.5 groups. That's the same as breaking into 5 groups on each side and then merging 4 groups into 2.

In [16]:
df = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
media_ideos = df['mean_sharer_ideo']
In [17]:
CENTER = 0.395
media_ideos -= CENTER

_, left_bins = pd.qcut(media_ideos[media_ideos < 0], 5, retbins=True, labels=[i for i in range(-5, 0)])
_, right_bins = pd.qcut(media_ideos[media_ideos > 0], 5, retbins=True, labels=[i for i in range(1, 6)])
all_bins = list(left_bins)
all_bins.extend(list(right_bins))
indices_to_keep = [0, 2, 4, 7, 9, 11]
print('Bin edges:')
merged_bins = [all_bins[i] for i in indices_to_keep]
merged_bins
Bin edges:
Out[17]:
[-1.201,
 -0.7573000000000001,
 -0.3052,
 0.25758000000000003,
 1.264,
 1.7690000000000001]
In [18]:
ax = media_ideos.plot.hist(bins=300, alpha=0.5, figsize=(12, 6))
ax.set_title('Media source ideology distribution with bins')
#ax.vlines(0, 0, 200, color='red')
_ = ax.vlines(merged_bins, 0, 200, linestyles='dotted', alpha=0.5)
In [19]:
df['cut_ideo_bin'] = pd.cut(media_ideos, merged_bins, labels=['left', 'center-left', 'center', 'center-right', 'right'])
_ = df['cut_ideo_bin'].value_counts().plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])

Top ten sites by most sharers per bucket

In [20]:
df.groupby('cut_ideo_bin').apply(lambda g: g.sort_values('num_sharers', ascending=False).head(10)).loc[:,['num_sharers', 'mean_sharer_ideo']]
Out[20]:
num_sharers mean_sharer_ideo
cut_ideo_bin
left motherjones.com 7191 -0.766300
actblue.com 6746 -0.848100
dailykos.com 6029 -0.797800
shareblue.com 5367 -0.920000
aclu.org 5104 -0.828000
propublica.org 5074 -0.814700
mccain.senate.gov 4493 -0.860800
healthcare.gov 4466 -0.925000
politicususa.com 4384 -0.935000
mediamatters.org 4360 -0.790000
center-left twitter.com 29515 -0.357130
youtube.com 21471 -0.311200
nytimes.com 18618 -0.480940
facebook.com 17884 -0.333100
washingtonpost.com 17872 -0.470200
cnn.com 16541 -0.461400
thehill.com 14963 -0.370650
instagram.com 14146 -0.340340
politico.com 13179 -0.401554
nbcnews.com 12343 -0.496200
center pscp.tv 11653 -0.238100
wsj.com 11558 -0.293250
foxnews.com 9754 0.216300
cnbc.com 9474 -0.281900
google.com 8841 -0.276800
apple.news 8188 -0.300460
dailymail.co.uk 8123 -0.025600
nypost.com 8120 0.001000
amazon.com 7896 -0.281600
washingtonexaminer.com 7627 0.196300
center-right breitbart.com 5966 0.567000
dailycaller.com 5840 0.552800
nationalreview.com 5717 0.350000
washingtontimes.com 5472 0.427000
insider.foxnews.com 5098 0.565000
dailywire.com 4066 0.907000
thefederalist.com 4008 0.792500
thegatewaypundit.com 3830 0.992000
realclearpolitics.com 3754 0.306000
townhall.com 3696 0.896000
right waynedupree.com 2246 1.266000
bluntforcetruth.com 2101 1.291000
saraacarter.com 2101 1.330000
ilovemyfreedom.org 2046 1.296000
truthfeednews.com 1998 1.308000
100percentfedup.com 1980 1.314000
theconservativetreehouse.com 1941 1.311000
therightscoop.com 1921 1.285000
chicksonright.com 1870 1.320000
lifenews.com 1667 1.280000

The problem with this is that is captures nice separations in this data, but not nice separations in the world. CNN is -0.46. Wall Street Journal is -0.29, Fox News is 0.22, Breitbart is 0.57, Gateway Pundit is 0.99, Wayne Dupree is 1.27. The center seems too far right, and the scale doesn't seem linear in the ways we usually think about it. That would make rank order correct but the distances between things wrong. If that were the case, we should still be able to draw lines in the distribution to get buckets that make sense.

Bucketing Data from 2016 Election for PP&D Comparison

In progress

Minimizing Difference from PP&D Buckets

I want to gauge how far off we are from the buckets from Partisanship, Propaganda, and Disinformation. To do that, I'm going to try to use the positions of the sites within the old buckets to draw bucket boundaries against the new distribution. I'm going to try to find the bucket boundaries that minimize difference from the old buckets. The function I'm going to be minimizing is the Jaccard distance between the concatenated columns of a boolean matrix where each row is a site and each column is a bucket. It's not a principled method of forming buckets, but it can give us a bound on the consistency between Barbera and PP&D we could expect if we use any method that just chops up the distribution into buckets.

The question is this: Where can we draw bucket boundaries on the distrubtion graph such that the Barbera buckets and the the PP&D buckets agree most? I'll start with a baseline set of bucket edges that I just pick visually.

In [21]:
def bin_guess_to_all_edges(edges):
    return np.sort(np.concatenate((edges, np.array([-1, 2.25]))))

from sklearn.metrics import confusion_matrix, classification_report
def draw_confusion_matrix(bins, df, title='Confusion matrix'):
    bins = bin_guess_to_all_edges(bins)
    barbera = pd.cut(df['mean_sharer_ideo'], bins=bins, labels=[1, 2, 3, 4, 5])
    ppd = df['partition']
    classifications = pd.DataFrame({'barbera': barbera, 'ppd': ppd})
    cm = pd.DataFrame(confusion_matrix(classifications['ppd'], classifications['barbera']), index=range(1, 6), columns=range(1, 6))
    ax = sbs.heatmap(cm, annot=True, cmap='Blues', fmt='d')
    ax.set_xlabel('Barbera bins')
    ax.set_ylabel('PPD bins')
    ax.set_title(title)
    
    print('Accuracy metrics if we consider PP&D as "true" buckets')
    print(classification_report(classifications['ppd'], classifications['barbera']))
    display(ax)

def draw_dist(bins=None, title='Media source ideology distribution'):
    ax = df['mean_sharer_ideo'].plot.hist(bins=200, alpha=0.4, density=True, figsize=(12, 6))
    joined['mean_sharer_ideo'].plot.hist(bins=200, ax=ax, alpha=0.3, density=True)
    ax.set_title(title)
    ax.legend(['Media in Barbera', 'Media in Barbera and PPD'])
    if bins is not None:
        ax.vlines(bin_guess_to_all_edges(bins), 0, ax.get_ylim()[1], linestyles='dotted', alpha=0.5)

from scipy.spatial import distance
def bins_to_jaccard_dist(bins, df):
    bins = bin_guess_to_all_edges(bins)
    num_sites = df.shape[0]
    num_buckets = 5
    if len(set(bins)) < num_buckets + 1:
        return 1
    barbera_vect = pd.get_dummies(pd.cut(df['mean_sharer_ideo'], bins=bins)).values.reshape((num_sites * num_buckets, 1))
    ppd_vect = pd.get_dummies(df['partition']).values.reshape((num_sites * num_buckets, 1))
    return distance.jaccard(barbera_vect, ppd_vect)

def bins_to_pct_dist(bins, df):
    bins = bin_guess_to_all_edges(bins)
    num_buckets = 5
    if len(set(bins)) < num_buckets + 1:
        return 1
    barbera = pd.cut(df['mean_sharer_ideo'], bins=bins, labels=[1, 2, 3, 4, 5])
    ppd = df['partition']
    return (ppd != barbera).sum() / ppd.count()

def bins_to_KL_dist(bins, df):
    bins = bin_guess_to_all_edges(bins)
    num_buckets = 5
    if len(set(bins)) < num_buckets + 1:
        return 1
    barbera = pd.cut(df['mean_sharer_ideo'], bins=bins, labels=[1, 2, 3, 4, 5])
    return scipy.stats.entropy(barbera.value_counts().sort_index().values, df['partition'].value_counts().sort_index().values)

from scipy.optimize import differential_evolution
def optimize(df):
    #bounds = Bounds(-1, 2.5)
    #x0 = np.array(BASELINE_BIN_EDGES)
    #res = minimize(bins_to_dist, x0, args=(df,), bounds=bounds)
    #bh_res = basinhopping(bins_to_pct_dist, x0, niter=50000, minimizer_kwargs={"args": joined, "bounds": bounds})
    bounds = [(-1, 2.25)] * 4
    return differential_evolution(bins_to_KL_dist, bounds=bounds, args=(df,))
In [22]:
BASELINE_BIN_EDGES = np.array([-0.5, -0.25, 0.5, 1.5])
In [23]:
print(f'Baseline bucket edges: {bin_guess_to_all_edges(BASELINE_BIN_EDGES)}')
Baseline bucket edges: [-1.   -0.5  -0.25  0.5   1.5   2.25]
In [24]:
ppd_scores = pd.read_csv('data/benkler_ppd_media_source_ideo.csv', index_col='subdomain') 
ppd_scores = ppd_scores[~ppd_scores.index.duplicated()] # pajamasmedia.com, townhall.com, and univision.com are duplicated
In [25]:
df = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
joined = ppd_scores.join(df['mean_sharer_ideo'], how='inner')

Sample of PP&D partitions and Barbera estimates

In [26]:
joined.sample(5)
Out[26]:
media_id ppd_ideo partition mean_sharer_ideo
cincinnati.com 26590 -0.637972 1 0.04828
hollywoodlife.com 24621 0.166600 3 -0.19180
burlingtonfreepress.com 367169 -0.699498 1 -0.13490
billboard.com 19194 -0.386920 2 -0.06122
salon.com 1757 -0.789768 1 -0.34280
In [27]:
draw_dist(BASELINE_BIN_EDGES, 'Media source ideology distribution - baseline buckets')
In [28]:
draw_confusion_matrix(BASELINE_BIN_EDGES, joined, 'Bucket agreement between PP&D and Barbera w/ baseline edges')
Accuracy metrics if we consider PP&D as "true" buckets
              precision    recall  f1-score   support

           1       0.90      0.41      0.56       139
           2       0.20      0.13      0.16       112
           3       0.22      0.82      0.35        71
           4       0.22      0.29      0.25        78
           5       0.97      0.67      0.79       332

   micro avg       0.51      0.51      0.51       732
   macro avg       0.50      0.46      0.42       732
weighted avg       0.69      0.51      0.55       732

<matplotlib.axes._subplots.AxesSubplot at 0x7fec1b1d2240>

Optimized bucket edges

In [29]:
optimized_buckets = optimize(joined)
bin_guess_to_all_edges(optimized_buckets.x)
Out[29]:
array([-1.        , -0.24658085,  0.00459128,  0.15693727,  0.50163665,
        2.25      ])
In [30]:
draw_dist(optimized_buckets.x, 'Media source ideology distribution - optimized buckets')
In [31]:
draw_confusion_matrix(optimized_buckets.x, joined)
Accuracy metrics if we consider PP&D as "true" buckets
              precision    recall  f1-score   support

           1       0.74      0.74      0.74       139
           2       0.43      0.43      0.43       112
           3       0.35      0.35      0.35        71
           4       0.22      0.22      0.22        78
           5       0.87      0.87      0.87       332

   micro avg       0.66      0.66      0.66       732
   macro avg       0.52      0.52      0.52       732
weighted avg       0.66      0.66      0.66       732

<matplotlib.axes._subplots.AxesSubplot at 0x7fec19808048>

Top ten sites by most sharers per bucket

In [32]:
df['ideo_group_optim'] = pd.cut(df['mean_sharer_ideo'], bin_guess_to_all_edges(optimized_buckets.x),
       labels=['left', 'center-left', 'center', 'center-right', 'right'])
df.groupby('ideo_group_optim').apply(lambda g: g.sort_values('num_sharers', ascending=False).head(10))
Out[32]:
mean_sharer_ideo stddev_sharer_ideo num_sharers num_uniq_urls num_sharers_in_ideo_bin_-10 num_sharers_in_ideo_bin_-9 num_sharers_in_ideo_bin_-8 num_sharers_in_ideo_bin_-7 num_sharers_in_ideo_bin_-6 num_sharers_in_ideo_bin_-5 ... num_sharers_in_ideo_bin_2 num_sharers_in_ideo_bin_3 num_sharers_in_ideo_bin_4 num_sharers_in_ideo_bin_5 num_sharers_in_ideo_bin_6 num_sharers_in_ideo_bin_7 num_sharers_in_ideo_bin_8 num_sharers_in_ideo_bin_9 num_sharers_in_ideo_bin_10 ideo_group_optim
ideo_group_optim
left slate.com -0.260700 0.9150 8147 22817 942 972 845 737 654 550 ... 169 134 100 105 109 128 155 210 238 left
thinkprogress.org -0.338100 0.7764 7609 13155 1019 967 884 752 665 543 ... 144 96 60 70 63 64 84 122 138 left
msnbc.com -0.313200 0.8677 7443 39052 1021 968 819 673 604 498 ... 148 102 75 81 82 82 127 174 176 left
rawstory.com -0.326200 0.8706 7320 80854 1031 958 844 682 593 499 ... 118 83 60 69 75 77 111 173 203 left
huffpost.com -0.284400 0.9287 7239 28359 836 860 738 616 556 469 ... 164 131 120 112 95 115 139 183 229 left
motherjones.com -0.371300 0.7266 7191 8664 972 950 835 717 630 526 ... 130 95 67 62 41 54 61 88 110 left
nymag.com -0.272700 0.9880 7113 9195 872 817 714 637 574 459 ... 135 105 86 95 93 117 165 238 283 left
actblue.com -0.453100 0.5390 6746 6448 1020 966 856 730 606 501 ... 104 54 46 27 18 11 23 33 32 left
vanityfair.com -0.270800 0.9854 6057 7498 736 715 613 549 474 408 ... 112 102 73 86 90 98 149 200 229 left
dailykos.com -0.402800 0.7510 6029 55557 810 820 710 607 519 443 ... 107 72 45 42 44 37 57 90 108 left
center-left nytimes.com -0.085940 1.0330 18618 94057 1801 1755 1626 1326 1209 1111 ... 533 529 461 433 469 472 602 623 755 center-left
washingtonpost.com -0.075200 1.0430 17872 126927 1758 1686 1580 1284 1164 1017 ... 492 475 434 431 459 463 595 616 746 center-left
cnn.com -0.066400 1.0480 16541 88968 1633 1594 1440 1173 1100 951 ... 442 465 398 400 433 429 534 594 703 center-left
politico.com -0.006554 1.1010 13179 29238 1259 1264 1140 956 867 721 ... 329 331 290 323 331 368 509 590 687 center-left
nbcnews.com -0.101200 1.0480 12343 32262 1285 1238 1115 929 837 726 ... 309 299 261 276 287 302 407 457 520 center-left
huffingtonpost.com -0.172700 0.9844 11930 52675 1326 1286 1213 931 876 755 ... 267 269 208 200 213 232 309 354 432 center-left
theguardian.com -0.128900 1.0310 11520 115514 1186 1172 1045 880 808 672 ... 296 287 243 243 237 257 341 413 477 center-left
npr.org -0.171900 0.9946 11342 33344 1149 1145 1052 884 795 699 ... 290 284 227 209 224 232 302 363 419 center-left
apnews.com -0.017210 1.1020 10147 36514 1000 986 878 769 675 557 ... 240 251 227 236 243 277 379 457 537 center-left
medium.com -0.108300 1.0370 9982 28868 995 982 903 779 699 601 ... 247 242 200 221 211 229 318 370 413 center-left
center twitter.com 0.037870 1.0610 29515 9486346 2464 2368 2234 1906 1756 1619 ... 949 1006 956 917 1023 986 1078 1094 1244 center
youtube.com 0.083800 1.0900 21471 845936 1712 1697 1596 1307 1239 1144 ... 713 760 709 705 773 757 885 901 1035 center
facebook.com 0.061900 1.0830 17884 1031199 1452 1476 1410 1146 1085 970 ... 557 602 549 583 592 591 731 738 814 center
thehill.com 0.024350 1.1060 14963 73617 1403 1376 1271 1022 951 835 ... 397 395 377 398 440 459 580 659 790 center
instagram.com 0.054660 1.0830 14146 421828 1221 1175 1117 959 836 763 ... 431 466 433 429 470 472 577 576 639 center
abcnews.go.com 0.004593 1.0940 12309 21510 1185 1122 1046 884 771 693 ... 319 342 305 311 342 366 473 523 617 center
pscp.tv 0.156900 1.1670 11653 87564 972 952 879 703 676 588 ... 325 338 352 363 416 436 563 646 780 center
usatoday.com 0.038240 1.1140 11652 38416 1091 1055 952 831 722 630 ... 300 324 281 316 343 355 466 529 636 center
wsj.com 0.101750 1.1360 11558 35655 1003 1001 877 762 666 611 ... 317 322 316 336 383 403 518 589 673 center
cbsnews.com 0.023830 1.1120 10916 28845 1048 1024 883 779 705 595 ... 299 286 255 278 308 326 429 503 589 center
center-right dailymail.co.uk 0.369400 1.2360 8123 52638 627 596 588 484 439 361 ... 196 234 240 267 332 370 513 597 705 center-right
nypost.com 0.396000 1.2360 8120 31812 625 616 538 487 429 355 ... 202 223 250 284 331 400 504 603 714 center-right
mediaite.com 0.236700 1.2540 5764 11395 530 469 437 394 339 301 ... 107 115 137 150 191 217 340 425 528 center-right
telegraph.co.uk 0.197800 1.1990 5169 14025 422 438 386 350 313 252 ... 139 138 134 143 171 192 262 328 393 center-right
marketwatch.com 0.225500 1.2120 4698 10830 370 379 358 320 279 238 ... 102 127 99 140 157 193 257 324 363 center-right
justice.gov 0.277300 1.2550 4302 5263 350 352 311 281 249 219 ... 94 89 95 127 146 146 238 340 415 center-right
eventbrite.com 0.177200 1.1300 4197 10077 270 293 264 283 258 240 ... 143 127 124 142 134 137 180 223 276 center-right
vimeo.com 0.162800 1.1470 4106 10375 300 319 304 268 258 210 ... 121 138 123 123 113 133 178 237 272 center-right
espn.com 0.164900 1.1500 4099 13489 353 327 278 285 227 197 ... 104 120 102 121 170 159 206 231 238 center-right
reddit.com 0.349900 1.2430 4052 48061 268 317 248 237 205 179 ... 99 109 95 122 149 143 234 320 411 center-right
right foxnews.com 0.611300 1.2130 9754 65002 541 574 501 474 418 368 ... 295 341 375 438 511 580 721 785 909 right
washingtonexaminer.com 0.591300 1.2520 7627 40275 486 488 448 419 362 294 ... 184 226 250 286 362 429 562 670 775 right
whitehouse.gov 0.520500 1.2360 6181 3090 381 388 362 333 311 261 ... 159 179 210 263 296 320 421 530 587 right
breitbart.com 0.962000 1.1930 5966 75526 230 187 205 189 196 152 ... 169 207 256 306 359 442 584 706 856 right
dailycaller.com 0.947800 1.2000 5840 41847 229 209 209 199 185 152 ... 155 214 248 307 357 433 569 683 819 right
nationalreview.com 0.745000 1.2480 5717 10479 310 298 282 268 237 205 ... 142 178 201 256 312 362 482 570 680 right
washingtontimes.com 0.822000 1.2400 5472 27781 256 264 263 225 192 179 ... 147 178 212 256 283 358 493 598 705 right
insider.foxnews.com 0.960000 1.2110 5098 13358 202 185 176 182 163 157 ... 130 166 210 248 302 363 514 599 726 right
dailywire.com 1.302000 1.0250 4066 15773 62 65 50 75 77 49 ... 99 155 189 253 316 378 504 615 743 right
thefederalist.com 1.187500 1.1330 4008 18348 111 104 97 96 94 77 ... 91 135 167 204 288 339 458 588 688 right

50 rows × 25 columns

In [34]:
ax = df['ideo_group_optim'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])\
    .plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Number of media sources per bucket')

Observations

  • This pushes a lot of sites to the edges.
  • The center groups end up being really small.
  • We can't expect more than ~66% consistency with PP&D buckets if we just draw lines on the Barbera distribution.