Updated Jan 31, 2019
The goal here is to be able to draw divisions between media sources so that we have a left, center-left, center, center-right, and right. I'm using a few different methods here. For each, I'm outputting a distribution of the number of media sources within each bucket and then showing the top domains within each bucket.
Table of Contents
Changes
Instead of using the estimated ideology scores directly, which are difficult to partition, let's use a metric that is easier to partition. We'll try two: relative sharing by either half of the ideology score distribution and relative sharing by users that preferentially follow the two parties.
First, we'll take the ideology score spectrum we've already created and separate it into two groups of users: those left of center and those right of center. In a separate notebook we've estimated center to be around 0.395.
%matplotlib inline
import gzip, pickle, collections, itertools, random
import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
import scipy.stats
sbs.set(style='white')
plotly.init_notebook_mode()
CENTER = 0.395
all_acct_ideos = pd.read_csv('data/cleaned_user_ideology_estimates_20180705.csv.gz', index_col=0)
acct_ids_in_sample = pd.read_pickle('data/all_samples_combined/user_ids.pkl')
all_acct_ideos['ideo'] = all_acct_ideos['normed_theta'] - CENTER
acct_ideos = all_acct_ideos.reindex(acct_ids_in_sample).dropna()
acct_ideos['pole'] = acct_ideos.apply(lambda r: 'left' if r['ideo'] < 0 else 'right', axis=1)
Sample of account ideology data
display(acct_ideos.sample(5))
Number of accounts in both poles
display(acct_ideos['pole'].value_counts())
ax = acct_ideos['ideo'].plot.hist(bins=200, alpha=0.5, figsize=(12, 6))
ax.set_title(f'Account ideology distribution - center at {CENTER}')
_ = ax.vlines(0, 0, 700, color='red')
%%time
MIN_USERS = 30
site_to_accts = pd.read_pickle('data/all_samples_combined/subdomain_to_users.pkl')
site_to_accts = {k: v for k, v in site_to_accts.items() if len(v) >= MIN_USERS}
acct_to_pole = dict(acct_ideos['pole'].items())
max_length = max(len(v) for k,v in site_to_accts.items())
s1 = pd.DataFrame.from_dict({k:[acct_to_pole.get(u, np.nan) for u in v] + ([np.nan] * (max_length - len(v))) for k,v in site_to_accts.items()})
%%time
site_share_counts_by_pole = s1.apply(lambda c: c.value_counts()).T
site_share_counts_by_pole = site_share_counts_by_pole.fillna(0)
site_share_counts_by_pole['num_sharers'] = site_share_counts_by_pole['left'] + site_share_counts_by_pole['right']
Sample of media ideology estimates
source_ideo = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
pct_shared = site_share_counts_by_pole / acct_ideos['pole'].value_counts()
site_sharing = site_share_counts_by_pole.assign(
barbera_ideo=source_ideo['mean_sharer_ideo'],
left_pct=pct_shared['left'],
right_pct=pct_shared['right'])
site_sharing['left_pct/right_pct'] = site_sharing['left_pct'] / site_sharing['right_pct']
# Following quintiles breaks on page 45 of "Partisanship, Propaganda, & Disinformation"
def ratio_to_group(ratio):
if ratio <= 1.0/4.0:
return 'right'
if 1.0/4.0 < ratio <= 2.0/3.0:
return 'center-right'
if 2.0/3.0 < ratio <= 3.0/2.0:
return 'center'
if 3.0/2.0 < ratio <= 4.0/1.0:
return 'center-left'
if 4.0/1.0 < ratio:
return 'left'
site_sharing['ideo_group'] = site_sharing.apply(lambda r: ratio_to_group(r['left_pct/right_pct']), axis=1)
site_sharing.sample(5)
ax = site_sharing['ideo_group'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])\
.plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Number of media sources per bucket')
Observations
Top ten sites by most sharers per bucket
site_sharing.groupby('ideo_group').apply(lambda g: g.sort_values('num_sharers', ascending=False).head(10))
Barbera ideology distribution per bucket
site_sharing.groupby('ideo_group')['barbera_ideo'].describe().reindex(['left', 'center-left', 'center', 'center-right', 'right'])
source_ideo.join(
site_sharing.rename(
{'left': 'num_sharers_left', 'right': 'num_sharers_right', 'left_pct': 'pct_shared_left', 'right_pct': 'pct_shared_right'},
axis=1).drop(['num_sharers', 'barbera_ideo'], axis=1)).to_csv('data/all_samples_combined/media_source_ideo_with_buckets.csv')
Instead of breaking users into two groups using Barbera ideology scores, we can use who they follow. Specifically, we can break users into two groups: those that follow predominantly Democratic politicians and those that follow Republican politicians. I've done this in another notebook.
Here's a quick summary of that notebook: There are about 1 million accounts that follow mostly Democratic politicians and about 5 million accounts that follow mostly Republican politicians. That's a big difference (evidence of right insularity?), especially given there are significantly more left-leaning accounts than right-leaning. I did one version where I did not consider the different group sizes and just looked at media sharing ratios by the straight count of people. That's ideo_group_by_count. I also did a version where I took the percentage of each group sharing each domain. That's ideo_group_by_pct.
I'm only showing results of ideo_group_by_count here. Somewhat surprisingly, the percentage results looked crazy. Everything was pushed far to the left, such that Breitbart ended up in the "center" bucket. Breitbart had ~5,000 R followers share it versus ~1,000 D followers but there are ~5M R followers and ~1M D followers, so the percentages end up the same. I haven't looked very far into this phenomena yet.
Sample of media ideology estimates
source_ideo_from_party_pole = pd.read_csv('data/media_source_ideo_with_party_pole_buckets.csv', index_col=0)
source_ideo_from_party_pole.sample(5)
ax = source_ideo_from_party_pole['ideo_group_by_count'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])\
.plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Number of media sources per bucket')
Top ten sites by most sharers per bucket
source_ideo_from_party_pole.groupby('ideo_group_by_count').apply(lambda g: g.sort_values('R+D', ascending=False).head(10))
Observations
foxnews.com is still in the center-rightWe broke up the user ideology distribution, but how does that translate into breaking up the media source ideology distribution? We have a center from the user distribution. It's where self described "liberals" and "conservatives" cross. It's around 0.395. Let's try something to what we did with the user distribution: pick the center and break up things somewhat evenly from there.
Now let's break the left side into 2.5 groups and the right side into 2.5 groups. That's the same as breaking into 5 groups on each side and then merging 4 groups into 2.
df = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
media_ideos = df['mean_sharer_ideo']
CENTER = 0.395
media_ideos -= CENTER
_, left_bins = pd.qcut(media_ideos[media_ideos < 0], 5, retbins=True, labels=[i for i in range(-5, 0)])
_, right_bins = pd.qcut(media_ideos[media_ideos > 0], 5, retbins=True, labels=[i for i in range(1, 6)])
all_bins = list(left_bins)
all_bins.extend(list(right_bins))
indices_to_keep = [0, 2, 4, 7, 9, 11]
print('Bin edges:')
merged_bins = [all_bins[i] for i in indices_to_keep]
merged_bins
ax = media_ideos.plot.hist(bins=300, alpha=0.5, figsize=(12, 6))
ax.set_title('Media source ideology distribution with bins')
#ax.vlines(0, 0, 200, color='red')
_ = ax.vlines(merged_bins, 0, 200, linestyles='dotted', alpha=0.5)
df['cut_ideo_bin'] = pd.cut(media_ideos, merged_bins, labels=['left', 'center-left', 'center', 'center-right', 'right'])
_ = df['cut_ideo_bin'].value_counts().plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
Top ten sites by most sharers per bucket
df.groupby('cut_ideo_bin').apply(lambda g: g.sort_values('num_sharers', ascending=False).head(10)).loc[:,['num_sharers', 'mean_sharer_ideo']]
The problem with this is that is captures nice separations in this data, but not nice separations in the world. CNN is -0.46. Wall Street Journal is -0.29, Fox News is 0.22, Breitbart is 0.57, Gateway Pundit is 0.99, Wayne Dupree is 1.27. The center seems too far right, and the scale doesn't seem linear in the ways we usually think about it. That would make rank order correct but the distances between things wrong. If that were the case, we should still be able to draw lines in the distribution to get buckets that make sense.
In progress
I want to gauge how far off we are from the buckets from Partisanship, Propaganda, and Disinformation. To do that, I'm going to try to use the positions of the sites within the old buckets to draw bucket boundaries against the new distribution. I'm going to try to find the bucket boundaries that minimize difference from the old buckets. The function I'm going to be minimizing is the Jaccard distance between the concatenated columns of a boolean matrix where each row is a site and each column is a bucket. It's not a principled method of forming buckets, but it can give us a bound on the consistency between Barbera and PP&D we could expect if we use any method that just chops up the distribution into buckets.
The question is this: Where can we draw bucket boundaries on the distrubtion graph such that the Barbera buckets and the the PP&D buckets agree most? I'll start with a baseline set of bucket edges that I just pick visually.
def bin_guess_to_all_edges(edges):
return np.sort(np.concatenate((edges, np.array([-1, 2.25]))))
from sklearn.metrics import confusion_matrix, classification_report
def draw_confusion_matrix(bins, df, title='Confusion matrix'):
bins = bin_guess_to_all_edges(bins)
barbera = pd.cut(df['mean_sharer_ideo'], bins=bins, labels=[1, 2, 3, 4, 5])
ppd = df['partition']
classifications = pd.DataFrame({'barbera': barbera, 'ppd': ppd})
cm = pd.DataFrame(confusion_matrix(classifications['ppd'], classifications['barbera']), index=range(1, 6), columns=range(1, 6))
ax = sbs.heatmap(cm, annot=True, cmap='Blues', fmt='d')
ax.set_xlabel('Barbera bins')
ax.set_ylabel('PPD bins')
ax.set_title(title)
print('Accuracy metrics if we consider PP&D as "true" buckets')
print(classification_report(classifications['ppd'], classifications['barbera']))
display(ax)
def draw_dist(bins=None, title='Media source ideology distribution'):
ax = df['mean_sharer_ideo'].plot.hist(bins=200, alpha=0.4, density=True, figsize=(12, 6))
joined['mean_sharer_ideo'].plot.hist(bins=200, ax=ax, alpha=0.3, density=True)
ax.set_title(title)
ax.legend(['Media in Barbera', 'Media in Barbera and PPD'])
if bins is not None:
ax.vlines(bin_guess_to_all_edges(bins), 0, ax.get_ylim()[1], linestyles='dotted', alpha=0.5)
from scipy.spatial import distance
def bins_to_jaccard_dist(bins, df):
bins = bin_guess_to_all_edges(bins)
num_sites = df.shape[0]
num_buckets = 5
if len(set(bins)) < num_buckets + 1:
return 1
barbera_vect = pd.get_dummies(pd.cut(df['mean_sharer_ideo'], bins=bins)).values.reshape((num_sites * num_buckets, 1))
ppd_vect = pd.get_dummies(df['partition']).values.reshape((num_sites * num_buckets, 1))
return distance.jaccard(barbera_vect, ppd_vect)
def bins_to_pct_dist(bins, df):
bins = bin_guess_to_all_edges(bins)
num_buckets = 5
if len(set(bins)) < num_buckets + 1:
return 1
barbera = pd.cut(df['mean_sharer_ideo'], bins=bins, labels=[1, 2, 3, 4, 5])
ppd = df['partition']
return (ppd != barbera).sum() / ppd.count()
def bins_to_KL_dist(bins, df):
bins = bin_guess_to_all_edges(bins)
num_buckets = 5
if len(set(bins)) < num_buckets + 1:
return 1
barbera = pd.cut(df['mean_sharer_ideo'], bins=bins, labels=[1, 2, 3, 4, 5])
return scipy.stats.entropy(barbera.value_counts().sort_index().values, df['partition'].value_counts().sort_index().values)
from scipy.optimize import differential_evolution
def optimize(df):
#bounds = Bounds(-1, 2.5)
#x0 = np.array(BASELINE_BIN_EDGES)
#res = minimize(bins_to_dist, x0, args=(df,), bounds=bounds)
#bh_res = basinhopping(bins_to_pct_dist, x0, niter=50000, minimizer_kwargs={"args": joined, "bounds": bounds})
bounds = [(-1, 2.25)] * 4
return differential_evolution(bins_to_KL_dist, bounds=bounds, args=(df,))
BASELINE_BIN_EDGES = np.array([-0.5, -0.25, 0.5, 1.5])
print(f'Baseline bucket edges: {bin_guess_to_all_edges(BASELINE_BIN_EDGES)}')
ppd_scores = pd.read_csv('data/benkler_ppd_media_source_ideo.csv', index_col='subdomain')
ppd_scores = ppd_scores[~ppd_scores.index.duplicated()] # pajamasmedia.com, townhall.com, and univision.com are duplicated
df = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
joined = ppd_scores.join(df['mean_sharer_ideo'], how='inner')
Sample of PP&D partitions and Barbera estimates
joined.sample(5)
draw_dist(BASELINE_BIN_EDGES, 'Media source ideology distribution - baseline buckets')
draw_confusion_matrix(BASELINE_BIN_EDGES, joined, 'Bucket agreement between PP&D and Barbera w/ baseline edges')
Optimized bucket edges
optimized_buckets = optimize(joined)
bin_guess_to_all_edges(optimized_buckets.x)
draw_dist(optimized_buckets.x, 'Media source ideology distribution - optimized buckets')
draw_confusion_matrix(optimized_buckets.x, joined)
Top ten sites by most sharers per bucket
df['ideo_group_optim'] = pd.cut(df['mean_sharer_ideo'], bin_guess_to_all_edges(optimized_buckets.x),
labels=['left', 'center-left', 'center', 'center-right', 'right'])
df.groupby('ideo_group_optim').apply(lambda g: g.sort_values('num_sharers', ascending=False).head(10))
ax = df['ideo_group_optim'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])\
.plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Number of media sources per bucket')
Observations