Updated Dec. 21, 2018
We previously estimated media source ideologies using a random sample of 30k Twitter users. To test how stable our ideology estimates are, we grabbed the Twitter histories of another random sample of 30k Twitter users (from our set with ideology estimates). Here are details about the first and second samples:
Users with estimated ideologies: 2,926,841
Users after removing duplicate ideologies: 1,699,669
First Sample
------------
Users in sample: 30,000
Total tweets: 45,316,914
Tweets w/ URLs: 19,563,609
Retweets w/ URLs: 13,008,418
Original tweets w/ URLs: 6,555,488
Users w/ (re)tweets w/ URLs: 17,171
Second Sample
------------
Users in sample: 30,000
Total tweets: 54,263,754
Tweets w/ URLs: 23,432,066
Retweets w/ URLs: 15,805,171
Original tweets w/ URLs: 7,626,895
Users w/ (re)tweets w/ URLs: 17,210
Sites with estimates in both samples: 6,191
Range of site estimates across both samples: -0.85 to 2.16
Mean site score difference between samples: 0.124
Correlation between scores: 0.977
Correlation between ranks: 0.958
Sites that disagree the most:
df.sort_values('abs_score_diff', ascending=False).head(10).loc[:,['s1_mean_sharer_ideo', 's2_mean_sharer_ideo', 'abs_score_diff', 's1_num_sharers', 's2_num_sharers']]
Let's set an arbitrary cutoff at 0.3 and say any site that has a score difference above that is worth looking into. Here are the sites shared by the most users with score differences above 0.3.
df.query('abs_score_diff >= 0.3').sort_values('s2_num_sharers', ascending=False).head(10).loc[:,['s1_mean_sharer_ideo', 's2_mean_sharer_ideo', 'abs_score_diff', 's1_num_sharers', 's2_num_sharers']]
After investigating the differences between these sites and sites with better estimates, I couldn't find an easy way of trimming out the questionable scores without also removing a lot of good scores. I think the best estimates we're going to get are just combining the two samples.
%matplotlib inline
import gzip, pickle, collections, itertools
import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
plotly.init_notebook_mode()
Let's load in all the data.
%%time
user_ideos = pd.read_csv('data/cleaned_user_ideology_estimates_20180705.csv.gz', index_col=0)
uid_to_ideo = dict(user_ideos['normed_theta'].items())
MIN_USERS = 30
s1 = pd.read_pickle('data/sample1/subdomain_to_users.pkl')
s1 = {k: v for k, v in s1.items() if len(v) >= MIN_USERS}
max_length = max(len(v) for k,v in s1.items())
s1 = pd.DataFrame.from_dict({k:[uid_to_ideo.get(u, None) for u in v] + ([np.nan] * (max_length - len(v))) for k,v in s1.items()}, dtype='float16')
subdomain_to_num_urls_s1 = pd.Series({k: len(v) for k,v in pd.read_pickle('data/sample1/subdomain_to_urls.pkl').items()})
s2 = pd.read_pickle('data/sample2/subdomain_to_users.pkl')
s2 = {k: v for k, v in s2.items() if len(v) >= MIN_USERS}
max_length = max(len(v) for k,v in s2.items())
s2 = pd.DataFrame.from_dict({k:[uid_to_ideo.get(u, None) for u in v] + ([np.nan] * (max_length - len(v))) for k,v in s2.items()}, dtype='float16')
subdomain_to_num_urls_s2 = pd.Series({k: len(v) for k,v in pd.read_pickle('data/sample2/subdomain_to_urls.pkl').items()})
%%time
df = pd.DataFrame({
's1_mean_sharer_ideo': s1.mean(),
's1_stddev_sharer_ideo': s1.std(),
's1_num_sharers': s1.count(),
's1_num_uniq_urls': subdomain_to_num_urls_s1,
's2_mean_sharer_ideo': s2.mean(),
's2_stddev_sharer_ideo': s2.std(),
's2_num_sharers': s2.count(),
's2_num_uniq_urls': subdomain_to_num_urls_s2,
}, index=s1.columns.intersection(s2.columns))
display(df.sample(5))
df.info()
Now we have estimates from the second sample. Let's load our estimates from the first sample and compare them. I'm including some transformations of the data to make things more linear and easier to eyeball.
df['s1_ideo_rank'] = df['s1_mean_sharer_ideo'].rank()
df['s2_ideo_rank'] = df['s2_mean_sharer_ideo'].rank()
df['abs_rank_diff'] = (df['s1_ideo_rank'] - df['s2_ideo_rank']).abs()
df['abs_score_diff'] = (df['s1_mean_sharer_ideo'] - df['s2_mean_sharer_ideo']).abs()
df['s1_num_sharers_logged'] = df['s1_num_sharers'].apply(np.log)
df['s2_num_sharers_logged'] = df['s2_num_sharers'].apply(np.log)
df['s1_num_uniq_urls_logged'] = df['s1_num_uniq_urls'].apply(np.log)
df['s2_num_uniq_urls_logged'] = df['s2_num_uniq_urls'].apply(np.log)
df['abs_score_diff_sqrt'] = (df['abs_score_diff']).apply(np.sqrt)
df['abs_rank_diff_logged'] = (df['abs_rank_diff'] + 1).apply(np.log)
#df = df.drop('twitter.com')
f, ax = plt.subplots(figsize=(15, 10))
_ = sbs.heatmap(df.corr(), annot=True, fmt=".3f", linewidths=.5, ax=ax, vmin=-1, vmax=1, center=0)
Observations
s1_mean_sharer_ideo vs. s2_mean_sharer_ideo)s1_ideo_rank vs. s2_ideo_rank)abs_score_diff_sqrt is correlated with within-sample standard deviation, which makes sense_ = df['s1_mean_sharer_ideo'].plot.hist(bins=200, alpha=0.4)
_ = df['s2_mean_sharer_ideo'].plot.hist(bins=200, alpha=0.4)
This look good. Let's graph every pair of variables.
_ = sbs.pairplot(df, vars=['s1_mean_sharer_ideo', 's2_mean_sharer_ideo', 's1_ideo_rank', 's2_ideo_rank',
's2_num_sharers_logged', 's2_stddev_sharer_ideo', 's2_num_uniq_urls_logged',
'abs_score_diff_sqrt', 'abs_rank_diff_logged'],
plot_kws={'alpha': 0.1}, diag_kind='kde')
Observations
The range of scores for media sources is -0.84 to 2.0, so a change of 0.3 is close to the max we'd want here to trust our scores. What are the biggest sites with a score difference greater than 0.3?
df.query('abs_score_diff >= 0.3').sort_values(by='s1_num_sharers', ascending=False).head(30)
Observations
floridapolitics.com, naplesnews.com, floridadisaster.org, dos.myflorida.com, wesh.com)Let's see what the underlying user ideology distributions look like for each sample for these sites.
fig, axes = plt.subplots(5, 6, figsize=(20, 12), sharex=True)
subdomains = df.query('abs_score_diff >= 0.3').sort_values(by='s1_num_sharers', ascending=False).head(30).index.values
for i, s in enumerate(subdomains):
ax = axes[i // 6][i % 6]
ax.set_title('{} - {:.2f}'.format(s, df.loc[s, 'abs_score_diff']))
s1[s].plot.hist(bins=30, alpha=0.4, density=True, ax=ax)
s2[s].plot.hist(bins=30, alpha=0.4, density=True, ax=ax)
Observations
floridapolitics.com, naplesnews.com, ch7.ioWhat happens if I bootstrap within each sample for a site to see what confidence intervals for the mean might look like?
subdomain = 'floridapolitics.com'
_ = pd.Series([s1[subdomain].dropna().sample(frac=1, replace=True).mean() for _ in range(1000)]).plot.hist(bins=50, alpha=0.4)
_ = pd.Series([s2[subdomain].dropna().sample(frac=1, replace=True).mean() for _ in range(1000)]).plot.hist(bins=50, alpha=0.4)
Observations
I don't know what to do with this, so here's an interactive graph.
interactive_scatter(df, 's1_mean_sharer_ideo', 's2_mean_sharer_ideo')
def interactive_scatter(data, x, y):
scatter1 = go.Scatter(
x=data[x],
y=data[y],
mode='markers',
text=data.index,
hoverinfo='text',
marker=dict(
#color=site_ideo['mean_ideology'].loc[dom_ideo_normed_hist.index],
#cmin=-1, cmax=1,
#size=plotted_df['num_uniq_users_sample1'],
#sizemode='area',
#sizemin=2,
#sizeref=30,
opacity=0.1
)
)
layout1 = go.Layout(
#title ='Domain Similarity by Distribution of Audience Ideology',
hovermode = 'closest',
#xaxis = dict(visible = False),
#yaxis = dict(visible = False),
width = 800,
height = 600
)
plotly.iplot(go.Figure(data=[scatter1], layout=layout1))