Testing Media Source Ideology Estimates Sensitivity to Samples

Updated Dec. 21, 2018

We previously estimated media source ideologies using a random sample of 30k Twitter users. To test how stable our ideology estimates are, we grabbed the Twitter histories of another random sample of 30k Twitter users (from our set with ideology estimates). Here are details about the first and second samples:

Users with estimated ideologies:           2,926,841
Users after removing duplicate ideologies: 1,699,669

First Sample
------------
Users in sample:             30,000
Total tweets:            45,316,914
Tweets w/ URLs:          19,563,609
Retweets w/ URLs:        13,008,418
Original tweets w/ URLs:  6,555,488
Users w/ (re)tweets w/ URLs: 17,171

Second Sample
------------
Users in sample:             30,000
Total tweets:            54,263,754
Tweets w/ URLs:          23,432,066
Retweets w/ URLs:        15,805,171
Original tweets w/ URLs:  7,626,895
Users w/ (re)tweets w/ URLs: 17,210

Summary of Results

Sites with estimates in both samples: 6,191
Range of site estimates across both samples: -0.85 to 2.16
Mean site score difference between samples: 0.124
Correlation between scores: 0.977
Correlation between ranks:  0.958

Sites that disagree the most:

In [189]:
df.sort_values('abs_score_diff', ascending=False).head(10).loc[:,['s1_mean_sharer_ideo', 's2_mean_sharer_ideo', 'abs_score_diff', 's1_num_sharers', 's2_num_sharers']]
Out[189]:
s1_mean_sharer_ideo s2_mean_sharer_ideo abs_score_diff s1_num_sharers s2_num_sharers
entrepreneursoft.com -0.490723 0.767578 1.257812 21 26
businesslive.co.za -0.081421 1.033203 1.114258 50 100
krdo.com 1.282227 0.276123 1.005859 25 35
naplesnews.com -0.033417 0.970703 1.003906 125 333
wctv.tv 0.191162 1.179688 0.988281 54 161
dailybulletin.com 0.671387 -0.293213 0.964844 35 90
wsbt.com -0.040741 0.876953 0.917480 27 36
multichannel.com 1.081055 0.197144 0.883789 29 33
nvsos.gov 0.478027 -0.374023 0.852051 35 77
theinformation.com 0.421875 -0.424316 0.846191 26 38

Let's set an arbitrary cutoff at 0.3 and say any site that has a score difference above that is worth looking into. Here are the sites shared by the most users with score differences above 0.3.

In [193]:
df.query('abs_score_diff >= 0.3').sort_values('s2_num_sharers', ascending=False).head(10).loc[:,['s1_mean_sharer_ideo', 's2_mean_sharer_ideo', 'abs_score_diff', 's1_num_sharers', 's2_num_sharers']]
Out[193]:
s1_mean_sharer_ideo s2_mean_sharer_ideo abs_score_diff s1_num_sharers s2_num_sharers
floridapolitics.com -0.272949 0.249756 0.522461 309 512
naplesnews.com -0.033417 0.970703 1.003906 125 333
secretservice.gov 1.549805 1.241211 0.308594 248 300
collins.senate.gov -0.467041 0.145264 0.612305 201 293
dos.myflorida.com -0.500488 0.039429 0.540039 110 273
plus.google.com 0.143921 0.485596 0.341797 234 249
caller.com 0.360596 0.003405 0.357178 187 229
fox8.com 0.955566 0.648926 0.306641 178 196
votetexas.gov -0.384277 -0.006847 0.377441 114 195
ch7.io 1.507812 0.803223 0.704590 98 184

After investigating the differences between these sites and sites with better estimates, I couldn't find an easy way of trimming out the questionable scores without also removing a lot of good scores. I think the best estimates we're going to get are just combining the two samples.

Details

In [1]:
%matplotlib inline
import gzip, pickle, collections, itertools

import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs

plotly.init_notebook_mode()

Let's load in all the data.

In [2]:
%%time
user_ideos = pd.read_csv('data/cleaned_user_ideology_estimates_20180705.csv.gz', index_col=0)
uid_to_ideo = dict(user_ideos['normed_theta'].items())
MIN_USERS = 30

s1 = pd.read_pickle('data/sample1/subdomain_to_users.pkl')
s1 = {k: v for k, v in s1.items() if len(v) >= MIN_USERS}
max_length = max(len(v) for k,v in s1.items())
s1 = pd.DataFrame.from_dict({k:[uid_to_ideo.get(u, None) for u in v] + ([np.nan] * (max_length - len(v))) for k,v in s1.items()}, dtype='float16')
subdomain_to_num_urls_s1 = pd.Series({k: len(v) for k,v in pd.read_pickle('data/sample1/subdomain_to_urls.pkl').items()})

s2 = pd.read_pickle('data/sample2/subdomain_to_users.pkl')
s2 = {k: v for k, v in s2.items() if len(v) >= MIN_USERS}
max_length = max(len(v) for k,v in s2.items())
s2 = pd.DataFrame.from_dict({k:[uid_to_ideo.get(u, None) for u in v] + ([np.nan] * (max_length - len(v))) for k,v in s2.items()}, dtype='float16')
subdomain_to_num_urls_s2 = pd.Series({k: len(v) for k,v in pd.read_pickle('data/sample2/subdomain_to_urls.pkl').items()})
/berkman/home/jclark/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:522: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

CPU times: user 1min 3s, sys: 2.39 s, total: 1min 6s
Wall time: 1min 4s
In [3]:
%%time
df = pd.DataFrame({
    's1_mean_sharer_ideo': s1.mean(), 
    's1_stddev_sharer_ideo': s1.std(),
    's1_num_sharers': s1.count(),
    's1_num_uniq_urls': subdomain_to_num_urls_s1,
    
    's2_mean_sharer_ideo': s2.mean(), 
    's2_stddev_sharer_ideo': s2.std(),
    's2_num_sharers': s2.count(),
    's2_num_uniq_urls': subdomain_to_num_urls_s2,
}, index=s1.columns.intersection(s2.columns))
display(df.sample(5))
s1_mean_sharer_ideo s1_stddev_sharer_ideo s1_num_sharers s1_num_uniq_urls s2_mean_sharer_ideo s2_stddev_sharer_ideo s2_num_sharers s2_num_uniq_urls
verne.elpais.com 0.349609 0.973145 25 173 -0.160034 0.733887 21 185
saraacarter.com 1.721680 0.727051 669 375 1.688477 0.750488 725 518
wsoctv.com 0.634766 1.332031 91 4867 0.581055 1.342773 101 187
generalassemb.ly -0.373047 0.768555 28 70 -0.373047 0.668945 35 72
stuff.co.nz -0.097839 1.038086 138 717 0.077026 1.136719 146 231
CPU times: user 27.4 s, sys: 308 ms, total: 27.7 s
Wall time: 23.8 s
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6191 entries, twitter.com to readthememo.org
Data columns (total 8 columns):
s1_mean_sharer_ideo      6191 non-null float16
s1_stddev_sharer_ideo    6191 non-null float16
s1_num_sharers           6191 non-null int64
s1_num_uniq_urls         6191 non-null int64
s2_mean_sharer_ideo      6191 non-null float16
s2_stddev_sharer_ideo    6191 non-null float16
s2_num_sharers           6191 non-null int64
s2_num_uniq_urls         6191 non-null int64
dtypes: float16(4), int64(4)
memory usage: 290.2+ KB

Sample Comparisons

Now we have estimates from the second sample. Let's load our estimates from the first sample and compare them. I'm including some transformations of the data to make things more linear and easier to eyeball.

In [87]:
df['s1_ideo_rank'] = df['s1_mean_sharer_ideo'].rank()
df['s2_ideo_rank'] = df['s2_mean_sharer_ideo'].rank()
df['abs_rank_diff'] = (df['s1_ideo_rank'] - df['s2_ideo_rank']).abs()
df['abs_score_diff'] = (df['s1_mean_sharer_ideo'] - df['s2_mean_sharer_ideo']).abs()
df['s1_num_sharers_logged'] = df['s1_num_sharers'].apply(np.log)
df['s2_num_sharers_logged'] = df['s2_num_sharers'].apply(np.log)
df['s1_num_uniq_urls_logged'] = df['s1_num_uniq_urls'].apply(np.log)
df['s2_num_uniq_urls_logged'] = df['s2_num_uniq_urls'].apply(np.log)
df['abs_score_diff_sqrt'] = (df['abs_score_diff']).apply(np.sqrt)
df['abs_rank_diff_logged'] = (df['abs_rank_diff'] + 1).apply(np.log)
#df = df.drop('twitter.com')
In [88]:
f, ax = plt.subplots(figsize=(15, 10))
_ = sbs.heatmap(df.corr(), annot=True, fmt=".3f", linewidths=.5, ax=ax, vmin=-1, vmax=1, center=0)

Observations

  • Correlation between the scores (Pearson) is good at 0.977 (s1_mean_sharer_ideo vs. s2_mean_sharer_ideo)
  • Correlation between the ranks (Spearman) is still good but less so at 0.958 (s1_ideo_rank vs. s2_ideo_rank)
  • There's a bunch of correlation here, but right now I'm just going to look at correlations between the score and rank differences and features to see what might be contributing to disrepancies between samples.
  • abs_score_diff_sqrt is correlated with within-sample standard deviation, which makes sense
  • It's inversely correlated with number of sharers, which also makes sense
  • It's slightly inversely correlated with the number of unique URLs, but that's partially dependent on number of sharers, so duh
  • Score difference is inversely correlated with number of sharers, which also makes sense
  • I don't fully grasp the difference between rank and score, and why they correlate so differently with things. I think I need more graphs to do that.
In [79]:
_ = df['s1_mean_sharer_ideo'].plot.hist(bins=200, alpha=0.4)
_ = df['s2_mean_sharer_ideo'].plot.hist(bins=200, alpha=0.4)

This look good. Let's graph every pair of variables.

In [90]:
_ = sbs.pairplot(df, vars=['s1_mean_sharer_ideo', 's2_mean_sharer_ideo', 's1_ideo_rank', 's2_ideo_rank',  
                           's2_num_sharers_logged', 's2_stddev_sharer_ideo', 's2_num_uniq_urls_logged',
                           'abs_score_diff_sqrt', 'abs_rank_diff_logged'],
                 plot_kws={'alpha': 0.1}, diag_kind='kde')

Observations

  • The mean-mean graph and the rank-rank graph look decent. Most of the discrepancies are sites in the middle, which can also be seen in the mean_sharer_ideo vs. abs_score_diff_logged graph.
  • The graph of mean sharer ideology vs. standard deviation of sharer ideologies looks like the fermata symbol. The curve makes sense because the ideology distribution is bimodal, so the more mixed those two modes are, the higher the standard deviation. The dot in the fermata is a cluster of Spanish language sites which have users that are actually estimated to be center-right, hence the lower standard deviation.

The range of scores for media sources is -0.84 to 2.0, so a change of 0.3 is close to the max we'd want here to trust our scores. What are the biggest sites with a score difference greater than 0.3?

In [102]:
df.query('abs_score_diff >= 0.3').sort_values(by='s1_num_sharers', ascending=False).head(30)
Out[102]:
s1_mean_sharer_ideo s1_stddev_sharer_ideo s1_num_sharers s1_num_uniq_urls s2_mean_sharer_ideo s2_stddev_sharer_ideo s2_num_sharers s2_num_uniq_urls s1_ideo_rank s2_ideo_rank abs_rank_diff abs_score_diff s1_num_sharers_logged s2_num_sharers_logged abs_rank_diff_logged abs_score_diff_sqrt s1_num_uniq_urls_logged s2_num_uniq_urls_logged
floridapolitics.com -0.272949 0.922363 309 825 0.249756 1.278320 512 873 2123.0 3834.0 1711.0 0.522461 5.733341 6.238325 7.445418 0.722656 6.715383 6.771936
secretservice.gov 1.549805 0.979492 248 112 1.241211 1.128906 300 163 5265.5 5076.0 189.5 0.308594 5.513429 5.703782 5.249652 0.555664 4.718499 5.093750
plus.google.com 0.143921 1.039062 234 1407 0.485596 1.128906 249 2088 3629.5 4383.0 753.5 0.341797 5.455321 5.517453 6.626055 0.584473 7.249215 7.643962
collins.senate.gov -0.467041 0.645020 201 52 0.145264 1.209961 293 46 1368.0 3533.0 2165.0 0.612305 5.303305 5.680173 7.680637 0.782715 3.951244 3.828641
caller.com 0.360596 1.336914 187 133 0.003405 1.172852 229 141 4174.5 3014.0 1160.5 0.357178 5.231109 5.433722 7.057468 0.597656 4.890349 4.948760
fox8.com 0.955566 1.246094 178 193 0.648926 1.308594 196 311 4913.0 4614.5 298.5 0.306641 5.181784 5.278115 5.702114 0.553711 5.262690 5.739793
brainyquote.com 0.104675 1.146484 147 313 0.487549 1.274414 136 495 3501.0 4387.5 886.5 0.382812 4.990433 4.912655 6.788409 0.618652 5.746203 6.204558
gpo.gov 0.833496 1.276367 147 114 0.526855 1.286133 183 154 4814.5 4450.5 364.0 0.306641 4.990433 5.209486 5.899897 0.553711 4.736198 5.036953
walmart.com 1.189453 1.250000 138 187 0.800293 1.301758 144 123 5042.5 4789.5 253.0 0.389160 4.927254 4.969813 5.537334 0.624023 5.231109 4.812184
travel.state.gov 0.850586 1.224609 129 63 0.538086 1.229492 142 79 4825.0 4470.5 354.5 0.312500 4.859812 4.955827 5.873525 0.559082 4.143135 4.369448
duluthnewstribune.com -0.327637 0.834473 129 156 -0.017868 1.083008 171 201 1904.5 2945.0 1040.5 0.309814 4.859812 5.141664 6.948417 0.556641 5.049856 5.303305
naplesnews.com -0.033417 1.117188 125 351 0.970703 1.246094 333 133 3004.0 4910.0 1906.0 1.003906 4.828314 5.808142 7.553287 1.001953 5.860786 4.890349
floridadisaster.org 0.728516 1.250977 124 43 1.047852 1.133789 181 45 4720.5 4960.0 239.5 0.319336 4.820282 5.198497 5.482720 0.564941 3.761200 3.806662
jewishjournal.com 0.689941 1.336914 119 69 0.339844 1.280273 142 106 4689.5 4063.5 626.0 0.350098 4.779123 4.955827 6.440947 0.591797 4.234107 4.663439
wmal.com 1.597656 0.937012 117 203 0.997070 1.282227 116 180 5294.0 4921.0 373.0 0.600586 4.762174 4.753590 5.924256 0.774902 5.313206 5.192957
kob.com 1.033203 1.214844 117 107 0.530762 1.264648 101 135 4960.0 4458.0 502.0 0.502441 4.762174 4.615121 6.220590 0.708984 4.672829 4.905275
votetexas.gov -0.384277 0.884766 114 22 -0.006847 1.173828 195 34 1688.0 2978.0 1290.0 0.377441 4.736198 5.273000 7.163172 0.614258 3.091042 3.526361
dos.myflorida.com -0.500488 0.771973 110 24 0.039429 1.219727 273 47 1219.0 3165.0 1946.0 0.540039 4.700480 5.609472 7.574045 0.734863 3.178054 3.850148
thelocal.fr 0.910156 1.282227 107 137 0.348877 1.312500 160 116 4876.0 4099.5 776.5 0.561523 4.672829 5.075174 6.656084 0.749512 4.919981 4.753590
wesh.com 0.082703 1.130859 107 295 0.421631 1.239258 109 212 3421.0 4260.5 839.5 0.338867 4.672829 4.691348 6.733997 0.582031 5.686975 5.356586
tpusa.com 1.963867 0.526367 104 19 1.638672 0.787109 180 23 6118.0 5317.0 801.0 0.325195 4.644391 5.192957 6.687109 0.570312 2.944439 3.135494
wbrc.com 0.604004 1.350586 104 107 0.231201 1.251953 117 151 4585.0 3789.0 796.0 0.372803 4.644391 4.762174 6.680855 0.610352 4.672829 5.017280
journals.lww.com -0.088440 0.992188 103 255 -0.396240 0.723145 104 120 2815.0 1542.5 1272.5 0.307861 4.634729 4.644391 7.149524 0.554688 5.541264 4.787492
theblast.com 0.694824 1.357422 103 48 1.164062 1.272461 162 72 4692.0 5035.0 343.0 0.469238 4.634729 5.087596 5.840642 0.685059 3.871201 4.276666
houstonpress.com -0.208130 0.993164 102 106 0.205322 1.260742 134 100 2370.0 3707.5 1337.5 0.413574 4.624973 4.897840 7.199305 0.643066 4.663439 4.605170
journalnow.com -0.075500 1.073242 102 106 0.256592 1.203125 120 125 2864.0 3854.0 990.0 0.332031 4.624973 4.787492 6.898715 0.576172 4.663439 4.828314
ch7.io 1.507812 0.864258 98 198 0.803223 1.315430 184 1330 5222.0 4793.0 429.0 0.704590 4.584967 5.214936 6.063785 0.839355 5.288267 7.192934
reverbnation.com 0.323975 1.213867 97 350 0.698730 1.282227 132 287 4100.0 4686.0 586.0 0.374756 4.574711 4.882802 6.375025 0.612305 5.857933 5.659482
desantis.house.gov 0.936035 1.329102 95 20 1.243164 1.254883 80 17 4901.0 5077.5 176.5 0.307129 4.553877 4.382027 5.178971 0.554199 2.995732 2.833213
fox43.com 1.002930 1.182617 91 120 0.666992 1.332031 106 99 4940.0 4638.0 302.0 0.335938 4.510860 4.663439 5.713733 0.579590 4.787492 4.595120

Observations

  • There's a bunch of Florida stuff in here. (floridapolitics.com, naplesnews.com, floridadisaster.org, dos.myflorida.com, wesh.com)
  • There's a bunch of federal gov't stuff in here.
  • Fox 8 is Cleveland
  • WMAL is DC AM radio
  • KOB is New Mexico
  • WBRC is Alabama
  • Fox 43 is Pennsylvania
  • A few Texas
  • So it looks like a lot of swing states are represented here, but not all local media are swing states.

Let's see what the underlying user ideology distributions look like for each sample for these sites.

In [175]:
fig, axes = plt.subplots(5, 6, figsize=(20, 12), sharex=True)
subdomains = df.query('abs_score_diff >= 0.3').sort_values(by='s1_num_sharers', ascending=False).head(30).index.values
for i, s in enumerate(subdomains):
    ax = axes[i // 6][i % 6]
    ax.set_title('{} - {:.2f}'.format(s, df.loc[s, 'abs_score_diff']))
    s1[s].plot.hist(bins=30, alpha=0.4, density=True, ax=ax)
    s2[s].plot.hist(bins=30, alpha=0.4, density=True, ax=ax)

Observations

  • Some of these look really different between samples. Sites like floridapolitics.com, naplesnews.com, ch7.io
  • If these underlying distributions are so different between samples, I'm not sure what there is we can do. Just crazy bad samples?

What happens if I bootstrap within each sample for a site to see what confidence intervals for the mean might look like?

In [196]:
subdomain = 'floridapolitics.com'
_ = pd.Series([s1[subdomain].dropna().sample(frac=1, replace=True).mean() for _ in range(1000)]).plot.hist(bins=50, alpha=0.4)
_ = pd.Series([s2[subdomain].dropna().sample(frac=1, replace=True).mean() for _ in range(1000)]).plot.hist(bins=50, alpha=0.4)

Observations

  • Those are no where close.

I don't know what to do with this, so here's an interactive graph.

In [199]:
interactive_scatter(df, 's1_mean_sharer_ideo', 's2_mean_sharer_ideo')
In [198]:
def interactive_scatter(data, x, y):
    scatter1 = go.Scatter(
    x=data[x],
    y=data[y],
    mode='markers',
    text=data.index,
    hoverinfo='text',
    marker=dict(
        #color=site_ideo['mean_ideology'].loc[dom_ideo_normed_hist.index],
        #cmin=-1, cmax=1,
        #size=plotted_df['num_uniq_users_sample1'],
        #sizemode='area',
        #sizemin=2,
        #sizeref=30,
        opacity=0.1
        )
    )

    layout1 = go.Layout(
        #title ='Domain Similarity by Distribution of Audience Ideology',
        hovermode = 'closest',
        #xaxis = dict(visible = False),
        #yaxis = dict(visible = False),
        width = 800,
        height = 600
    )
    plotly.iplot(go.Figure(data=[scatter1], layout=layout1))