Estimating Media Source Ideology from Shares

Updated Feb 11, 2019

We've aggregated in a bunch of ways in the past: account-level, story-level, domain-level. What happens if we don't do any of that? Let's just look at the ratio of shares from the left and the right. We're still using the Barbera account ideology estimates and the center we estimated (0.395), but every share comes from either the left or the right of that center. First we'll just look at raw shares, and then we'll look at the percentage of total shares from each side, which will give the left and the right equal weight.

I'm only going to look at shares from accounts that share the given domain 5000 times or fewer.

In [30]:
%matplotlib inline
import collections

import ujson, tqdm
import pandas as pd
import numpy as np
In [9]:
CENTER = 0.395

all_acct_ideos = pd.read_csv('data/cleaned_user_ideology_estimates_20180705.csv.gz', index_col=0)
acct_ids_in_sample = pd.read_pickle('data/all_samples_combined/user_ids.pkl')
all_acct_ideos['ideo'] = all_acct_ideos['normed_theta'] - CENTER
acct_ideos = all_acct_ideos.reindex(acct_ids_in_sample).dropna()
acct_ideos['pole'] = acct_ideos.apply(lambda r: 'left' if r['ideo'] < 0 else 'right', axis=1)
acct_to_ideo = dict(acct_ideos['ideo'].items())
acct_to_pole = dict(acct_ideos['pole'].items())
In [10]:
_ = acct_ideos['ideo'].plot.hist(bins=300)
In [11]:
_ = acct_ideos['pole'].value_counts().plot.bar()

Raw, Unweighted Shares

Let's look at the ratio of left/right shares, removing all users that shared a single domain more than 5000 times.

In [71]:
subdomain_to_pole_shares = collections.defaultdict(collections.Counter)
subdomain_to_acct_shares = pd.read_pickle('data/all_samples_combined/subdomain_to_user_shares.pkl')
subdomain_to_num_acct_shares = {}
In [96]:
MAX_SHARES_FROM_SINGLE_ACCOUNT = 5000
for subdomain, acct_shares in tqdm.tqdm(subdomain_to_acct_shares.items()):
    subdomain_to_num_acct_shares[subdomain] = collections.Counter(acct_shares)
    for acct in acct_shares:
        if subdomain_to_num_acct_shares[subdomain][acct] > MAX_SHARES_FROM_SINGLE_ACCOUNT:
            continue
        try:
            pole = acct_to_pole[acct]
            subdomain_to_pole_shares[subdomain][pole] += 1
        except KeyError:
            continue
100%|██████████| 457135/457135 [03:08<00:00, 2426.94it/s] 
In [110]:
raw = pd.DataFrame.from_dict(subdomain_to_pole_shares, orient='index').fillna(0)
raw = raw.rename({'left': 'left_shares', 'right': 'right_shares'}, axis=1)
raw['total_shares'] = raw['left_shares'] + raw['right_shares']
raw['pct_of_shares_from_left'] = raw['left_shares'] / raw['total_shares']
raw.sample(5)
Out[110]:
right_shares left_shares total_shares pct_of_shares_from_left
okeymor57.tumblr.com 0.0 9.0 9.0 1.000000
portcanaveralwebcam.com 0.0 6.0 6.0 1.000000
chp.edu 0.0 3.0 3.0 1.000000
adrindia.org 0.0 6.0 6.0 1.000000
ldeo.columbia.edu 6.0 30.0 36.0 0.833333
In [112]:
_ = raw['pct_of_shares_from_left'].plot.hist(bins=200)
In [113]:
news_media_domains = [
    'english.alarabiya.net', 'aljazeera.com', 'americanthinker.com', 'bbc.com',
    'bbc.co.uk', 'bloomberg.com', 'bostonglobe.com', 'breitbart.com',
    'buzzfeed.com', 'cbc.ca', 'cbsnews.com', 'chicagotribune.com', 'cnbc.com',
    'cnn.com', 'csmonitor.com', 'dailycaller.com', 'dailykos.com',
    'dailymail.co.uk', 'economist.com', 'forbes.com', 'foreignpolicy.com',
    'fortune.com', 'insider.foxnews.com', 'nation.foxnews.com', 'foxnews.com', 'haaretz.com', 'hindustantimes.com',
    'huffingtonpost.com', 'huffpost.com', 'independent.co.uk', 'infowars.com',
    'latimes.com', 'miamiherald.com', 'motherjones.com', 'msnbc.com',
    'nationalreview.com', 'nbcnews.com', 'newsweek.com', 'newyorker.com',
    'npr.org', 'nydailynews.com', 'nypost.com', 'nytimes.com', 'pbs.org',
    'politico.com', 'propublica.org', 'realclearpolitics.com','reuters.com',
    'rollcall.com', 'rt.com', 'salon.com', 'news.sky.com', 'slate.com',
    'sputniknews.com', 'theatlantic.com', 'theguardian.com', 'thehill.com',
    'time.com', 'usatoday.com', 'vox.com', 'washingtonpost.com',
    'washingtontimes.com', 'weeklystandard.com', 'westernjournal.com', 'wsj.com',
    'zerohedge.com',
]
non_news_domains = [
    'aclu.org', 'change.org', 'cosmopolitan.com', 'facebook.com', 'google.com',
    'harvard.edu', 'hbr.org', 'mit.edu', 'patreon.com', 'politifact.com',
    'reddit.com', 'reuters.com', 'twitter.com', 'wikileaks.org', 'youtube.com',
]
#domains = news_media_domains + non_news_domains
_ = (-1 * raw.loc[news_media_domains, 'pct_of_shares_from_left'] + 0.5).sort_values(ascending=False).plot.barh(figsize=(10, 20), legend=False)

Observations

  • The center isn't the center.
  • There's a big jump around 0.
  • Politico right of CNN?

Percentage of Total Shares

Let's look at the same ratio of left/right shares, but divide each side by the total shares on that side.

In [102]:
num_left_shares = raw['left_shares'].sum()
num_right_shares = raw['right_shares'].sum()
num_total_shares = num_left_shares + num_right_shares
raw['pct_of_left_shares'] = raw['left_shares'] / num_left_shares
raw['pct_of_right_shares'] = raw['right_shares'] / num_right_shares
raw['pct_of_shares_from_left_pct'] = raw['pct_of_left_shares'] / (raw['pct_of_left_shares'] + raw['pct_of_right_shares'])
raw.sample(5)
Out[102]:
right_shares left_shares total_shares pct_of_shares_from_left variance skewness pct_of_left_shares pct_of_right_shares pct_of_shares_from_left_pct
gspp.berkeley.edu 0.0 12.0 12.0 1.000000 0.000000 -inf 1.057231e-07 0.000000e+00 1.000000
ww7.liberalmountain.com 0.0 30.0 30.0 1.000000 0.000000 -inf 2.643077e-07 0.000000e+00 1.000000
johnackerman.blogspot.com 3.0 81.0 84.0 0.964286 2.892857 -0.545949 7.136308e-07 4.552871e-08 0.940027
radioairplayblog.blogspot.com 0.0 3.0 3.0 1.000000 0.000000 -inf 2.643077e-08 0.000000e+00 1.000000
hotnewsus.rayselcuk.xyz 0.0 6.0 6.0 1.000000 0.000000 -inf 5.286154e-08 0.000000e+00 1.000000
In [103]:
_ = (-1 * raw.loc[news_media_domains, 'pct_of_shares_from_left_pct'] + 0.5).sort_values(ascending=False).plot.barh(figsize=(10, 20), legend=False)

Observations

  • Normalization budges the center over, but that's about it.
In [104]:
raw.loc['news.sky.com']
Out[104]:
right_shares                   24570.000000
left_shares                    41448.000000
total_shares                   66018.000000
pct_of_shares_from_left            0.627829
variance                       15425.752976
skewness                          -0.002058
pct_of_left_shares                 0.000365
pct_of_right_shares                0.000373
pct_of_shares_from_left_pct        0.494775
Name: news.sky.com, dtype: float64
In [109]:
raw.loc['rt.com']
Out[109]:
right_shares                    90090.000000
left_shares                     57927.000000
total_shares                   148017.000000
pct_of_shares_from_left             0.391354
variance                        35257.054460
skewness                            0.001157
pct_of_left_shares                  0.000510
pct_of_right_shares                 0.001367
pct_of_shares_from_left_pct         0.271814
Name: rt.com, dtype: float64
In [114]:
raw.to_csv('data/all_samples_combined/subdomain_ideo_est_just_shares.csv')