Discrete Scores over Time

Updated Jun 13, 2019

How are the discrete ideology scores of media sources changing over time?

In [1]:
%matplotlib inline

import collections, math, sys

import pandas as pd
import tqdm
In [2]:
sys.path.append("/berkman/home/jclark/mc/projects/ideology_from_followers/cleaned_up/data")
from collapsible_domains import COLLAPSIBLE_DOMAINS
In [3]:
MONTHS = [
    '2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06', '2016-07',
    '2016-08', '2016-09', '2016-10', '2016-11', '2016-12', '2017-01', '2017-02',
    '2017-03', '2017-04', '2017-05', '2017-06', '2017-07', '2017-08', '2017-09',
    '2017-10', '2017-11', '2017-12', '2018-01', '2018-02', '2018-03', '2018-04',
    '2018-05', '2018-06', '2018-07', '2018-08', #'2018-09', '2018-10', '2018-11', '2018-12'
]
MONTHS_PER_PERIOD = 4
NUM_PERIODS = math.ceil(len(MONTHS) // MONTHS_PER_PERIOD)

Load Panel

We created a panel of accounts in the create_panel notebook. Load them in.

In [4]:
with open('data/panel_accounts.txt') as f:
    panel_accounts = set([int(line.strip()) for line in f])

panel_ideos = pd.read_pickle('data/panel_ideos.pkl')

Group Panel into Ideo Groups

Divide the panel into two groups: left and right of center. We've determined the center to be about 0.444 in a separate notebook.

In [5]:
CENTER = 0.444
panel_groups = pd.DataFrame({
    'ideo': panel_ideos,
    'group': panel_ideos.apply(lambda i: 'left' if i <= CENTER else 'right')}, index=panel_ideos.index)
_ = panel_groups.groupby('group').count().plot.bar(title='Number of accounts in each ideo group')

This is the total number of accounts in each group, but we're not normalizing by number of accounts, we're normalizing by number of shares within each group. That data changes month to month, so we'll be normalizing differently each month.

Pull Out Panel Shares

Let's look at the shares of each subdomain from our account panel. We have the data monthly, but we're going to collapse it into every four months so the data is less sparse.

In [6]:
MIN_SHARES = 5

periodic_subdomain_group_shares = []
months_in_this_period = 0
for period in tqdm.tqdm_notebook(range(NUM_PERIODS)):
    subdomain_to_acct_shares = collections.defaultdict(set)
    for month in MONTHS[period * MONTHS_PER_PERIOD:period * MONTHS_PER_PERIOD + MONTHS_PER_PERIOD]:
        month_acct_shares = pd.read_pickle(f'../ideology_from_followers/data/historical_work/split_monthly/stats/{month}/subdomain_to_user_shares.pkl')
        for domain, acct_ids in month_acct_shares.items():
            panel_shares = set([acct for acct in acct_ids if acct in panel_accounts])
            subdomain_to_acct_shares[domain] |= panel_shares
    subdomain_to_group_shares = {}
    for domain, panel_shares in subdomain_to_acct_shares.items():
        if len(panel_shares) < MIN_SHARES: continue
        subdomain_to_group_shares[domain] = collections.Counter(panel_groups.reindex(panel_shares)['group'].values)                       
    subdomain_to_group_shares = pd.DataFrame.from_dict(subdomain_to_group_shares, orient='index').fillna(0)
    periodic_subdomain_group_shares.append(subdomain_to_group_shares)

Now that we've counted the number of panel members from each group that have shared each subdomain, we need to collapse redundant subdomains and then normalize the number of shares across groups. Then we break the media sites into five groups based on their left-right mixture.

In [7]:
periodic_subdomain_group_shares_normed = []
for group_shares in tqdm.tqdm_notebook(periodic_subdomain_group_shares, smoothing=0):
    for collapsible_domain, new_domain in COLLAPSIBLE_DOMAINS.items():
        if collapsible_domain in group_shares.index:
            if new_domain not in group_shares.index:
                group_shares = group_shares.append(pd.Series({'left': 0, 'right': 0}, name=new_domain))
            group_shares.loc[new_domain] += group_shares.loc[collapsible_domain]
            group_shares = group_shares.drop(collapsible_domain)

    group_sizes = group_shares.sum().reindex(['left', 'right'])
    size_correction_factor = group_sizes.max() / group_sizes.min()
    
    group_shares = group_shares.fillna(0).drop('twitter.com')
    group_shares['right_corrected'] = group_shares['right'] * size_correction_factor
    group_shares['total_shares'] = group_shares['left'] + group_shares['right_corrected']
    group_shares['left_to_right_mixture'] = 2 * (group_shares['right_corrected'] / group_shares['total_shares']) - 1
 
    BREAKPOINTS = [-1.01, -0.6, -0.2, 0.2, 0.6, 1.01]

    group_shares['ideo_group'] = pd.cut(group_shares['left_to_right_mixture'], bins=BREAKPOINTS, labels=['left', 'center-left', 'center', 'center-right', 'right'])

    periodic_subdomain_group_shares_normed.append({
        'group_sizes': group_sizes,
        'size_correction_factor': size_correction_factor,
        'subdomain_ideos': group_shares
    })

Look at the Data

How many domains do we have per 4-month period?

In [8]:
ax = pd.Series([period['subdomain_ideos'].shape[0] for period in periodic_subdomain_group_shares_normed]).plot.bar(
        title='Number of subdomains per 4-month period')

Let's pull out a simple dataset of left-to-right sharing mixture over time for each subdomain.

In [9]:
subdomain_ideo_over_time = {}
for i, period in enumerate(periodic_subdomain_group_shares_normed):
    start_month = MONTHS[i * MONTHS_PER_PERIOD]
    subdomain_ideo_over_time[start_month] = period['subdomain_ideos']['left_to_right_mixture']
subdomain_ideo_over_time = pd.DataFrame.from_dict(subdomain_ideo_over_time, orient='index')

What do the histories of the top 10 subdomains (by total panel shares) look like over time?

In [10]:
sites = periodic_subdomain_group_shares_normed[-1]['subdomain_ideos'].sort_values('total_shares', ascending=False).head(10).index.values
_ = subdomain_ideo_over_time.loc[:,sites].plot.line(figsize=(12, 6),
                                                    title='Audience Ideo of Top 10 Sites (by # of shares over entire period)')

Observations

  • foxnews.com is right, huffpost.com is left
  • thehill.com and cnn.com and maybe washingtonpost.com move left
  • Everything else looks pretty flat

That was the top 10. Let's look at 11-20.

In [11]:
sites = periodic_subdomain_group_shares_normed[-1]['subdomain_ideos'].sort_values('total_shares', ascending=False).iloc[10:20].index.values
_ = subdomain_ideo_over_time.loc[:,sites].plot.line(figsize=(12, 6),
                                                   title='Audience Ideo of Sites 11-20 (by # of shares in Q3 2018)')

Observations

  • pscp.tv starts center and moves further right
  • wsj.com stays center-right
  • npr.org and theguardian.com are left
  • Most other sites, which are main stream news, start center and move left, most notably nbcnews.com, politico.com

Let's see how much these trends hold by looking at what the breakdown of media source ideology groups per 4-month period.

In [12]:
periodic_counts = {}
for i, period in enumerate(periodic_subdomain_group_shares_normed):
    start_month = MONTHS[i * MONTHS_PER_PERIOD]
    subdomain_ideos = period['subdomain_ideos']
    periodic_counts[start_month] = subdomain_ideos['ideo_group'].value_counts().reindex(['left', 'center-left', 'center', 'center-right', 'right'])
periodic_counts = pd.DataFrame.from_dict(periodic_counts, orient='index')
In [13]:
display((periodic_counts.T / periodic_counts.T.sum()).T.head())
ax = (periodic_counts.T / periodic_counts.T.sum()).T.plot.line(
    color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'],
    figsize=(10, 6),
    title="Each ideo group's share of media sources per 4-month period")
left center-left center center-right right
2016-01 0.196691 0.267647 0.217132 0.150147 0.168382
2016-05 0.210290 0.253631 0.220334 0.150803 0.164941
2016-09 0.227901 0.233293 0.210207 0.150289 0.178311
2017-01 0.241329 0.214491 0.216523 0.147572 0.180085
2017-05 0.231195 0.223493 0.216764 0.148871 0.179677

Observations

  • center-left starts largest but loses a lot to the left up to the inauguration and then rebounds a bit
  • center stays pretty stable
  • right gains a little bit
  • center-right starts smallest and gets smaller

How do the ideologies of media sources compare with the partisan retweet scores for the same time period?

TODO

Export Data

In [14]:
subdomain_ideo_over_time.T.to_csv('data/subdomain_discrete_ideo_over_time.csv', index_label='subdomain')
In [15]:
for i, period in enumerate(periodic_subdomain_group_shares_normed):
    start_month = MONTHS[i * MONTHS_PER_PERIOD]
    period['subdomain_ideos'].to_csv(f'data/subdomain_discrete_ideo_details/{start_month}.csv', index_label='subdomain')