Updated Jun 13, 2019
How are the discrete ideology scores of media sources changing over time?
%matplotlib inline
import collections, math, sys
import pandas as pd
import tqdm
sys.path.append("/berkman/home/jclark/mc/projects/ideology_from_followers/cleaned_up/data")
from collapsible_domains import COLLAPSIBLE_DOMAINS
MONTHS = [
'2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07',
'2016-08', '2016-09', '2016-10', '2016-11', '2016-12', '2017-01', '2017-02',
'2017-03', '2017-04', '2017-05', '2017-06', '2017-07', '2017-08', '2017-09',
'2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04',
'2018-05', '2018-06', '2018-07', '2018-08', #'2018-09', '2018-10', '2018-11', '2018-12'
]
MONTHS_PER_PERIOD = 4
NUM_PERIODS = math.ceil(len(MONTHS) // MONTHS_PER_PERIOD)
We created a panel of accounts in the create_panel notebook. Load them in.
with open('data/panel_accounts.txt') as f:
panel_accounts = set([int(line.strip()) for line in f])
panel_ideos = pd.read_pickle('data/panel_ideos.pkl')
Divide the panel into two groups: left and right of center. We've determined the center to be about 0.444 in a separate notebook.
CENTER = 0.444
panel_groups = pd.DataFrame({
'ideo': panel_ideos,
'group': panel_ideos.apply(lambda i: 'left' if i <= CENTER else 'right')}, index=panel_ideos.index)
_ = panel_groups.groupby('group').count().plot.bar(title='Number of accounts in each ideo group')
This is the total number of accounts in each group, but we're not normalizing by number of accounts, we're normalizing by number of shares within each group. That data changes month to month, so we'll be normalizing differently each month.
Let's look at the shares of each subdomain from our account panel. We have the data monthly, but we're going to collapse it into every four months so the data is less sparse.
MIN_SHARES = 5
periodic_subdomain_group_shares = []
months_in_this_period = 0
for period in tqdm.tqdm_notebook(range(NUM_PERIODS)):
subdomain_to_acct_shares = collections.defaultdict(set)
for month in MONTHS[period * MONTHS_PER_PERIOD:period * MONTHS_PER_PERIOD + MONTHS_PER_PERIOD]:
month_acct_shares = pd.read_pickle(f'../ideology_from_followers/data/historical_work/split_monthly/stats/{month}/subdomain_to_user_shares.pkl')
for domain, acct_ids in month_acct_shares.items():
panel_shares = set([acct for acct in acct_ids if acct in panel_accounts])
subdomain_to_acct_shares[domain] |= panel_shares
subdomain_to_group_shares = {}
for domain, panel_shares in subdomain_to_acct_shares.items():
if len(panel_shares) < MIN_SHARES: continue
subdomain_to_group_shares[domain] = collections.Counter(panel_groups.reindex(panel_shares)['group'].values)
subdomain_to_group_shares = pd.DataFrame.from_dict(subdomain_to_group_shares, orient='index').fillna(0)
periodic_subdomain_group_shares.append(subdomain_to_group_shares)
Now that we've counted the number of panel members from each group that have shared each subdomain, we need to collapse redundant subdomains and then normalize the number of shares across groups. Then we break the media sites into five groups based on their left-right mixture.
periodic_subdomain_group_shares_normed = []
for group_shares in tqdm.tqdm_notebook(periodic_subdomain_group_shares, smoothing=0):
for collapsible_domain, new_domain in COLLAPSIBLE_DOMAINS.items():
if collapsible_domain in group_shares.index:
if new_domain not in group_shares.index:
group_shares = group_shares.append(pd.Series({'left': 0, 'right': 0}, name=new_domain))
group_shares.loc[new_domain] += group_shares.loc[collapsible_domain]
group_shares = group_shares.drop(collapsible_domain)
group_sizes = group_shares.sum().reindex(['left', 'right'])
size_correction_factor = group_sizes.max() / group_sizes.min()
group_shares = group_shares.fillna(0).drop('twitter.com')
group_shares['right_corrected'] = group_shares['right'] * size_correction_factor
group_shares['total_shares'] = group_shares['left'] + group_shares['right_corrected']
group_shares['left_to_right_mixture'] = 2 * (group_shares['right_corrected'] / group_shares['total_shares']) - 1
BREAKPOINTS = [-1.01, -0.6, -0.2, 0.2, 0.6, 1.01]
group_shares['ideo_group'] = pd.cut(group_shares['left_to_right_mixture'], bins=BREAKPOINTS, labels=['left', 'center-left', 'center', 'center-right', 'right'])
periodic_subdomain_group_shares_normed.append({
'group_sizes': group_sizes,
'size_correction_factor': size_correction_factor,
'subdomain_ideos': group_shares
})
How many domains do we have per 4-month period?
ax = pd.Series([period['subdomain_ideos'].shape[0] for period in periodic_subdomain_group_shares_normed]).plot.bar(
title='Number of subdomains per 4-month period')
Let's pull out a simple dataset of left-to-right sharing mixture over time for each subdomain.
subdomain_ideo_over_time = {}
for i, period in enumerate(periodic_subdomain_group_shares_normed):
start_month = MONTHS[i * MONTHS_PER_PERIOD]
subdomain_ideo_over_time[start_month] = period['subdomain_ideos']['left_to_right_mixture']
subdomain_ideo_over_time = pd.DataFrame.from_dict(subdomain_ideo_over_time, orient='index')
What do the histories of the top 10 subdomains (by total panel shares) look like over time?
sites = periodic_subdomain_group_shares_normed[-1]['subdomain_ideos'].sort_values('total_shares', ascending=False).head(10).index.values
_ = subdomain_ideo_over_time.loc[:,sites].plot.line(figsize=(12, 6),
title='Audience Ideo of Top 10 Sites (by # of shares over entire period)')
Observations
foxnews.com is right, huffpost.com is leftthehill.com and cnn.com and maybe washingtonpost.com move leftThat was the top 10. Let's look at 11-20.
sites = periodic_subdomain_group_shares_normed[-1]['subdomain_ideos'].sort_values('total_shares', ascending=False).iloc[10:20].index.values
_ = subdomain_ideo_over_time.loc[:,sites].plot.line(figsize=(12, 6),
title='Audience Ideo of Sites 11-20 (by # of shares in Q3 2018)')
Observations
pscp.tv starts center and moves further rightwsj.com stays center-rightnpr.org and theguardian.com are leftnbcnews.com, politico.comLet's see how much these trends hold by looking at what the breakdown of media source ideology groups per 4-month period.
periodic_counts = {}
for i, period in enumerate(periodic_subdomain_group_shares_normed):
start_month = MONTHS[i * MONTHS_PER_PERIOD]
subdomain_ideos = period['subdomain_ideos']
periodic_counts[start_month] = subdomain_ideos['ideo_group'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])
periodic_counts = pd.DataFrame.from_dict(periodic_counts, orient='index')
display((periodic_counts.T / periodic_counts.T.sum()).T.head())
ax = (periodic_counts.T / periodic_counts.T.sum()).T.plot.line(
color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'],
figsize=(10, 6),
title="Each ideo group's share of media sources per 4-month period")
Observations
center-left starts largest but loses a lot to the left up to the inauguration and then rebounds a bitcenter stays pretty stableright gains a little bitcenter-right starts smallest and gets smallerTODO
subdomain_ideo_over_time.T.to_csv('data/subdomain_discrete_ideo_over_time.csv', index_label='subdomain')
for i, period in enumerate(periodic_subdomain_group_shares_normed):
start_month = MONTHS[i * MONTHS_PER_PERIOD]
period['subdomain_ideos'].to_csv(f'data/subdomain_discrete_ideo_details/{start_month}.csv', index_label='subdomain')