Media Source Partisanship as Measured by Congressional Tweets

I'm looking at the sharing patterns of various media sources by congressional Twitter accounts. The goal is to estimate source partisanship from these sharing patterns.

Outline

Summary

I pulled Twitter histories for 541 Twitter accounts for Congressfolk from the 115th Congress. I manually matched up the Twitter handles with their corresponding rows in the dw_nominate data. The resulting spreadsheet is here.

The tweet dataset contains ~22K unique domains (after unshortening). Approximately 14K of those are shared by single Congresspeople, so partisanship probably shouldn't be estimated for them. I'm not sure how many Congresspeople sharing a domain is enough to estimate partisanship, but I'm currently using the cutoff of two or more.

Below is a selection of websites and their partisanship estimates. Their order from left to right generally matches intuition. I'm as confident in the absolute numbers (which would include the location of the center) as I am in my knowledge of statistics, which is to say, not at all. I think there are some statistical biases and things I haven't normalized for that could move the absolute numbers around, but the relative positions would likely undergo only minor shuffling.

We could also estimate the partisanship of nontraditional outlets (youtube.com, facebook.com, instagram.com, etc.), but for me that just highlights the question of how much are we learning about media sources and how much are we learning about Congressional tweeting? (Though Congressional link sharing behavior seems like an interesting, out-of-scope topic.)

I've put a CSV of the full results here.

There's a lot of content below the fold. A lot of it is me trying to figure out how to do data analysis, but some of it might be interesting. I'd suggest just skimming the sections of graphs, most of which have a set of observations at the bottom.

In [69]:
_ = dsnf.loc[sites_of_interest].sort_values('mean_dw', ascending=False)['mean_dw'] \
    .plot.barh(figsize=(16, 8), legend=False, title="Weighted mean dw_nominate score for selected domains")

Details

In [3]:
%matplotlib inline
In [4]:
import json, concurrent.futures, urllib.parse, collections, subprocess, itertools, logging, operator, math

import tqdm, requests, tldextract, scipy.stats, joypy, seaborn, matplotlib.colors
import pandas as pd
from matplotlib import cm
seaborn.set() 

from source_partisanship_by_congressional_tweets import *

Congressional Twitter Accounts

First, let's import the manually created spreadsheet that combines this list of Twitter accounts from the 115th Congress with dw_nominate scores.

We need to filter out accounts for groups of people (e.g. @WaysandMeansGOP) as their membership can change over time and we don't have clear dw_nominate scores for them. We can also filter out accounts that don't have a name. We'll also lowercase all Twitter handles everywhere so we don't have to worry about case-sensitive matching.

In [5]:
congress_twitter = pd.read_csv('congress_twitter_to_bioguide.csv')
congress_twitter = congress_twitter[(congress_twitter['is_group'] != 'T') & (congress_twitter['bioname'].notna())]
congress_twitter['twitter'] = [t.lower() for t in congress_twitter['twitter']]
congress_twitter.sample(5)
Out[5]:
twitter bioname bioguide_id born died nominate_dim1 nominate_dim2 is_group
495 repchrisstewart STEWART, Chris S001192 1960.0 NaN 0.521 0.121 NaN
551 repwilson WILSON, Frederica W000808 1942.0 NaN -0.475 -0.007 NaN
347 repmcsally McSALLY, Martha M001197 1966.0 NaN 0.337 -0.008 NaN
309 daveloebsack LOEBSACK, Dave L000565 1952.0 NaN -0.277 -0.045 NaN
187 sengillibrand GILLIBRAND, Kirsten G000555 1966.0 NaN -0.381 -0.033 NaN

DW_NOMINATE

Let's get familiar with the dw_nominate distribution.

In [6]:
congress_twitter['nominate_dim1'].describe()
Out[6]:
count    539.000000
mean       0.086571
std        0.455764
min       -0.776000
25%       -0.375000
50%        0.260000
75%        0.503000
max        0.990000
Name: nominate_dim1, dtype: float64
In [7]:
_ = congress_twitter['nominate_dim1'].plot.hist(
    bins=100, title="Distribution of dw_nominate scores for 115th congress",
    xlim=(-1, 1))
In [8]:
print('D:',(congress_twitter['nominate_dim1'] < 0).sum(),
      'Mean:', congress_twitter[congress_twitter['nominate_dim1'] < 0]['nominate_dim1'].mean(),
      'R:', (congress_twitter['nominate_dim1'] > 0).sum(),
      'Mean:', congress_twitter[congress_twitter['nominate_dim1'] > 0]['nominate_dim1'].mean())
D: 249 Mean: -0.38128915662650603 R: 290 Mean: 0.4882862068965518

Observations

  • There are more Republicans than Democrats.
  • Left is more concentrated than the Right.
  • Ain't no centrists.
  • Right is further from the center than the left, partially pulled by far-right outliers.
  • The right most (0.99) is much further from the center than the left most (-0.776).
  • We'll never have a partisanship score distribution that covers the whole range [-1, 1] (without stretching) because this distribution bounds our score distribution.
  • Because there are more Rs than Ds, do we need to worry about normalizing by the varying party representation? I think ideally we'd want to sample from both ends of the spectrum equally. Let's imagine the partisan distribution of links to NYT. We'd guess more Ds than Rs to link, but still some Rs. If there are more Rs than Ds in our dataset, NYT will get more R links than it would in an equal world, and so we'd estimate it to the right of where it would be in world with an equally divided Congress. The effect of an unequal balance of power in Congress is probably small, but we probably still want to normalize for it. I mull this over more in the "I don't know statistics" section at the bottom.

Congressional Tweets

Let's put together of map of Twitter account -> shared URLS.

In [9]:
account_to_urls = collections.defaultdict(list)
pbar = tqdm.tqdm_notebook(total=line_count('tweets.txt'), desc='Reading tweets', smoothing=0)
for tweet in get_tweets('tweets.txt'):
    pbar.update()
    screen_name = tweet['user']['screen_name'].lower()
    if screen_name not in congress_twitter['twitter'].values: continue
    if 'urls' not in tweet: continue
    for url in tweet['urls']:
        url_to_test = url['url']
        if 'expanded_url' in url:
            url_to_test = url['expanded_url']
        account_to_urls[screen_name].append(url_to_test)
pbar.close()

Now, for every URL that was included in a tweet, we need to unshorten it so we can look at actual domains. We're assuming there's a one-to-one correspondence between domains and media sources. We need to make sure we preserve the relationship between the URL in the tweet and the unshortened URL so we can get back to individual tweets given an unshortened URL.

I've already done this, so I'll just load it.

In [10]:
with open('unshortened_urls.json') as f:
    unshortened_urls = json.load(f)

Now we'll extract the domain from every unshortened URL, build the map from Twitter account -> shared domains, and build a reversed map from domain -> Twitter accounts sharing that domain.

In [11]:
tweet_urls_to_domains = {}
for tweet_url, final_url in unshortened_urls.items():
    tweet_urls_to_domains[tweet_url] = tldextract.extract(final_url).registered_domain

account_to_domains = collections.defaultdict(list)
for account, urls in account_to_urls.items():
    account_to_domains[account] += [tweet_urls_to_domains[url] for url in urls]

domain_to_sharers = collections.defaultdict(set)
for account, domains in account_to_domains.items():
    for domain in domains:
        domain_to_sharers[domain].add(account)        

Let's use the above to calculate the sharing stats for the shared domains.

In [12]:
domain_to_sharer_nominate = collections.defaultdict(list)
for account, domains in account_to_domains.items():
    dw_nominate = congress_twitter[congress_twitter['twitter'] == account]['nominate_dim1'].values[0]
    for domain in domains:
        domain_to_sharer_nominate[domain].append(dw_nominate)

domain_to_sharer_nominate_s = {k : pd.Series(v) for k, v in domain_to_sharer_nominate.items()}
domain_stats = pd.DataFrame({k: {
    'mean_dw': v.mean(),
    'num_shares': v.count(),
    'num_sharers': len(domain_to_sharers[k]),
    'stddev_dw': v.std()} for k, v in domain_to_sharer_nominate_s.items()}).T

Media Source DW_NOMINATE scores

Let's take a look at the stats we just generated.

In [13]:
domain_stats.describe()
Out[13]:
mean_dw num_sharers num_shares stddev_dw
count 22234.000000 22234.000000 22234.000000 10872.000000
mean -0.021121 4.519969 37.158361 0.184741
std 0.416474 20.820726 1262.244844 0.193610
min -0.776000 1.000000 1.000000 0.000000
25% -0.384000 1.000000 1.000000 0.003182
50% -0.164111 1.000000 1.000000 0.107443
75% 0.400000 2.000000 4.000000 0.351885
max 0.990000 533.000000 126462.000000 1.037326

Observations

  • There were ~22K unique shared domains.
  • The mean of the mean_dw scores over all media sources is reasonably close to zero, but the min and max are the same as dw_nominate, which means the extrema Congressfolks were the only ones to share certain domains.
  • num_sharers and num_shares are very power lawy. It's mostly 1. We'll probably want to filter.
  • The mean of the standard deviations of the partisanship score for each media source seems reasonable at 0.18.
In [14]:
_ = domain_stats['mean_dw'].plot.hist(bins=100, xlim=(-1, 1))

Observations

  • This looks a lot like the dw_nominate distribution. On one hand, this could make sense because the media landscape shared by Congress would probably look like Congress itself. On the other hand, we used dw_nominate to generate these scores in the first place, so in cases where only a single Congressperson shared a domain, there's a one-to-one correspondence. That suggests filtering out domains by number of sharers.
  • It's important to remember that the y-axis is the number of unique media sources with that partisanship score. We'd expect more people to share more media sources, but there's probably a number to be found in here indicating the relative concentration of the media landscape on the right and left. Possibly another angle on other MC findings.
  • The center is beefier than the dw_nominate distribution. That suggests the presence of a large number of sites that are shared by both sides close to equally.

Let's look at the top and bottom of the site list when sorted by mean_dw (our partisanship score).

In [15]:
domain_stats.sort_values('mean_dw').head()
Out[15]:
mean_dw num_sharers num_shares stddev_dw
photojoiner.net -0.776 1.0 1.0 NaN
seattlechannel.org -0.776 1.0 4.0 0.0
aplus.com -0.776 1.0 1.0 NaN
codepink.org -0.776 1.0 1.0 NaN
westseattleblog.com -0.776 1.0 1.0 NaN
In [16]:
domain_stats.sort_values('mean_dw', ascending=False).head()
Out[16]:
mean_dw num_sharers num_shares stddev_dw
thefranklinnewspost.com 0.99 1.0 2.0 0.0
martinsvilledaily.com 0.99 1.0 1.0 NaN
farmvilleherald.com 0.99 1.0 1.0 NaN
tomtomfest.com 0.99 1.0 1.0 NaN
uvapolitics.com 0.99 1.0 1.0 NaN

Like we noticed above, the ends of the distribution are dominated by sites that have few shares or sharers. We already know that most sites are shared by a single person, and that means we're brining in the dw_nominate distribution directly, so let's see what the distribution looks like if we filter those sites out (and compare it to the distribution above).

In [17]:
domain_stats['num_sharers'].value_counts().head()
Out[17]:
1.0    13973
2.0     3029
3.0     1440
4.0      822
5.0      513
Name: num_sharers, dtype: int64
In [18]:
dsf = domain_stats[domain_stats['num_sharers'] > 1] # dsf == domain stats filtered
In [19]:
dsf.describe()
Out[19]:
mean_dw num_sharers num_shares stddev_dw
count 8261.000000 8261.000000 8261.000000 8261.000000
mean 0.009095 10.473793 97.090788 0.243131
std 0.362244 33.323020 2069.454665 0.187445
min -0.687000 2.000000 2.000000 0.000000
25% -0.333500 2.000000 3.000000 0.076258
50% -0.024333 3.000000 6.000000 0.194503
75% 0.346000 6.000000 19.000000 0.395850
max 0.877692 533.000000 126462.000000 1.037326
In [20]:
dsf.describe() - domain_stats.describe()
Out[20]:
mean_dw num_sharers num_shares stddev_dw
count -13973.000000 -13973.000000 -13973.000000 -2611.000000
mean 0.030216 5.953823 59.932427 0.058390
std -0.054230 12.502295 807.209821 -0.006165
min 0.089000 1.000000 1.000000 0.000000
25% 0.050500 1.000000 2.000000 0.073076
50% 0.139778 2.000000 5.000000 0.087060
75% -0.054000 4.000000 15.000000 0.043965
max -0.112308 0.000000 0.000000 0.000000
In [21]:
_ = pd.DataFrame({
    'all sites': domain_stats['mean_dw'],
    'sites with >1 sharers': domain_stats[domain_stats['num_sharers'] > 1]['mean_dw']
}).plot.hist(bins=100, sharex=True, figsize=(12, 8), subplots=True, xlim=(-1, 1))

Observations

  • The distribution is way cleaner now. We see that in the graph and the reduced standard deviation. A bunch of the spikes must have been due to single Congresspeople sharing a lot of media sources no one else was linking to.
  • The center is proportionally much larger now. We'd expect that because for a site to be in the center it had to be shared by more than one person (there is no dw_nominate center).
  • Everything got shifted to the right a little bit (except for the far right, which moved a little left). This can be seen in the differences between the two means and quartiles. They're mostly positive.
  • Since it shifted to the right, the single sharers of sources were disproportionately on the left.
  • I think it makes more sense to work with this data, so I'll do that going forward.

Let's get to the graph where we see how our intuition of partisanship for certain sites matches the output of this.

In [22]:
sites_of_interest = ['nytimes.com', 'foxnews.com', 'cnn.com', 
                     'wsj.com', 'npr.org', 'huffingtonpost.com', 'infowars.com',
                     'dailycaller.com', 'washingtonpost.com', 'msnbc.com',
                     'latimes.com', 'thehill.com',
                     'breitbart.com', 'washingtontimes.com', 'politico.com',
                     'nationalreview.com', 'drudgereport.com', 'nbcnews.com',
                     'theguardian.com', 'salon.com', 'vox.com',
                     'usatoday.com',  'rollcall.com', 'bloomberg.com',
                     'cbsnews.com', 'cnbc.com',
                     'forbes.com', 'time.com', 'theatlantic.com']
_ = dsf.loc[sites_of_interest].sort_values('mean_dw', ascending=False)['mean_dw'].plot.barh(figsize=(16, 8), xlim=(-1, 1), legend=False, 
                                                                  title="Mean dw_nominate score per domain")

Observations

  • The ordering from left to right looks pretty reasonable to me.
  • I'm not sure how meaningful the raw numbers are. Is thehill.com as far to the right as nbcnews.com is to the left? Is forbes.com twice as conservative as cnbc.com?
  • At least for this sample of sites, this isn't symmetric about zero - there are more sites on the right.
  • Drudge seems like an outlier.

Let's look at these sites in more detail.

In [23]:
_ = dsf.loc[sites_of_interest].sort_values('mean_dw').plot(figsize=(16, 10), kind='bar', 
                                                       subplots=True, legend=False, 
                                                       title='Detailed stats per domain')

Observations

  • Drudge and Infowars have way fewer shares and sharers than the others. Makes sense that Drudge is an outlier.
  • The number of sharers drops off toward the edges. The optimist in me wants to read that as centrist sites are shared by more congressfolk, but that's an old trap - we've defined "centrist" as sites shared by lots of people.
  • Number of sharers looks a lot more evenly distributed than number of shares. Perhaps there are some folk sharing those sites a lot? Should look at sites by average shares per sharer. I'm not sure that's useful as a metric in its own right as it tells us more about Congresspeople than media sources, but it might be useful for normalization somewhere.
  • The above raises the fact that some sites could be pushed more or less partisan because a really active user might just tweet it a lot. We should normalize by tweet volume. Instead of weighting the dw_nominate scores by number of tweets per account, we should weight by percentage of tweets sharing that domain per account.

DW_NOMINATE Weighting Schemes

Our goal here is to make sure that accounts that tweet a lot don't have an outsized effect on the partisanship score of a source - that is, we're normalizing by number of tweets per account.

It could also be the case that a Congressperson tweets about a whole bunch of different sites while another person tweets only links to a single site. How should that affect partisanship score? I dunno.

Let's take a simple example and see what intuition says. Our data set consists of two senators: Ed Markey, with a nominate score of -0.502, and Marco Rubio, with a nominate score of 0.585. If Markey tweets example.com links 30 times out of 1000 total tweets, and Rubio tweets example.com links 60 times out of 10000, what should we say the partisanship score of example.com is?

Intuitively, Markey tweets about example.com relatively more often, so I'd expect it to lean left. How far left? I dunno.

In [24]:
markey_dw, markey_domain_num, markey_total_num = -0.502, 30.0, 1000.0
rubio_dw, rubio_domain_num, rubio_total_num = 0.585, 60.0, 10000.0

markey_domain_share_in_account = markey_domain_num / markey_total_num
markey_share_total_vol = markey_total_num / (markey_total_num + rubio_total_num)

rubio_domain_share_in_account = rubio_domain_num / rubio_total_num
rubio_share_total_vol = rubio_total_num / (markey_total_num + rubio_total_num)
print('Markey shares: {:.2%} in account, {:.2%} of total'.format(markey_domain_share_in_account, markey_share_total_vol))
print('Rubio shares: {:.2%} in account, {:.2%} of total'.format(rubio_domain_share_in_account, rubio_share_total_vol))
Markey shares: 3.00% in account, 9.09% of total
Rubio shares: 0.60% in account, 90.91% of total

Below are a bunch of different weighing schemes. I'm not good enough at this yet to know which one properly normalizes by account volume. It should be fairly simple, and my hunch is domain_share_in_account is correct, but I'm too fried right now to think the whole thing through without confusing myself. Instead, I'll just make a bunch of graphs and see what looks right.

In [25]:
weighting_schemes = {
    'boolean': {
        'm': 1.0,
        'r': 1.0},
    'num_tweets': {
        'm': markey_domain_num,
        'r': rubio_domain_num},
    'domain_share_in_account': {
        'm': markey_domain_share_in_account,
        'r': rubio_domain_share_in_account},
    'inverse_share_total_vol': {
        'm': 1 / markey_share_total_vol,
        'r': rubio_share_total_vol},
    'num_tweets_over_share_total_vol': {
        'm': markey_domain_num / markey_share_total_vol,
        'r': rubio_domain_num / rubio_share_total_vol},
}
for name, weights in weighting_schemes.items():
    score = (markey_dw * weights['m'] + rubio_dw * weights['r']) / (weights['m'] + weights['r'])
    print('{:33} | score: {:6.3f} | markey weight: {:7.3f} | rubio weight: {:7.3f}'.format(name, score, weights['m'], weights['r']))
boolean                           | score:  0.041 | markey weight:   1.000 | rubio weight:   1.000
num_tweets                        | score:  0.223 | markey weight:  30.000 | rubio weight:  60.000
domain_share_in_account           | score: -0.321 | markey weight:   0.030 | rubio weight:   0.006
inverse_share_total_vol           | score: -0.419 | markey weight:  11.000 | rubio weight:   0.909
num_tweets_over_share_total_vol   | score: -0.321 | markey weight: 330.000 | rubio weight:  66.000

I'm hitting a wall trying to interpret this. Let's run them all against all the data and see what things look like.

In [26]:
total_vol = sum([len(urls) for urls in account_to_urls.values()])
account_to_share_total_vol = {k: len(u) / total_vol for k, u in account_to_urls.items()}
account_to_domain_counts = {k: collections.Counter(d) for k, d in account_to_domains.items()}
account_to_domain_pct = {k: {d: dc[d] / sum(dc.values()) for d in dc} for k, dc in account_to_domain_counts.items()}

domain_to_boolean_weighted_dw = collections.defaultdict(list)
domain_to_num_tweets_weighted_dw = collections.defaultdict(list)
domain_to_domain_share_in_account_weighted_dw = collections.defaultdict(list)
domain_to_inverse_share_total_vol_weighted_dw = collections.defaultdict(list)
domain_to_num_tweets_over_share_total_vol_weighted_dw = collections.defaultdict(list)

weight_sums = {}

for account, domains in account_to_domains.items():
    dw_nominate = congress_twitter[congress_twitter['twitter'] == account]['nominate_dim1'].values[0]
    for domain in set(domains):
        if domain in weight_sums:
            domain_weight_sums = weight_sums[domain]
        else:
            domain_weight_sums = {   
                'boolean': 0,
                'num_tweets': 0,
                'domain_share_in_account': 0,
                'inverse_share_total_vol': 0,
                'num_tweets_over_share_total_vol': 0,
            }
        boolean_weight = 1
        domain_to_boolean_weighted_dw[domain].append(boolean_weight * dw_nominate)
        domain_weight_sums['boolean'] += boolean_weight
        
        num_tweets_weight = account_to_domain_counts[account][domain]
        domain_to_num_tweets_weighted_dw[domain].append(num_tweets_weight * dw_nominate)
        domain_weight_sums['num_tweets'] += num_tweets_weight
        
        domain_share_in_account_weight = account_to_domain_pct[account][domain]
        domain_to_domain_share_in_account_weighted_dw[domain].append(domain_share_in_account_weight * dw_nominate)
        domain_weight_sums['domain_share_in_account'] += domain_share_in_account_weight
        
        inverse_share_total_vol_weight = 1.0 / account_to_share_total_vol[account]
        domain_to_inverse_share_total_vol_weighted_dw[domain].append(inverse_share_total_vol_weight * dw_nominate)
        domain_weight_sums['inverse_share_total_vol'] += inverse_share_total_vol_weight
        
        num_tweets_over_share_total_vol_weight = num_tweets_weight * inverse_share_total_vol_weight
        domain_to_num_tweets_over_share_total_vol_weighted_dw[domain].append(num_tweets_over_share_total_vol_weight * dw_nominate)
        domain_weight_sums['num_tweets_over_share_total_vol'] += num_tweets_over_share_total_vol_weight
        
        weight_sums[domain] = domain_weight_sums
In [27]:
weight_lists = {
    'boolean': domain_to_boolean_weighted_dw,
    'num_tweets': domain_to_num_tweets_weighted_dw,
    'domain_share_in_account': domain_to_domain_share_in_account_weighted_dw,
    'inverse_share_total_vol': domain_to_inverse_share_total_vol_weighted_dw,
    'num_tweets_over_share_total_vol': domain_to_num_tweets_over_share_total_vol_weighted_dw
}
weight_plot_data = {}
for name, weights in weight_lists.items():
    as_series = {k: pd.Series(v) for k, v in weights.items()}
    weighted_domain_stats = pd.DataFrame({k: {
        'mean_dw': v.sum() / weight_sums[k][name]}
        for k, v in as_series.items()}).T
    weight_plot_data[name] = weighted_domain_stats['mean_dw']
weight_plot_data = pd.DataFrame(weight_plot_data)
In [28]:
_ = weight_plot_data.loc[sites_of_interest].sort_values('domain_share_in_account').plot.bar(subplots=True, figsize=(16, 16), width=0.6)
In [29]:
_ = weight_plot_data.loc[sites_of_interest].sort_values('domain_share_in_account', ascending=False).plot.barh(figsize=(8, 16), width=0.7, xlim=(-1, 1))
In [30]:
_ = weight_plot_data.plot.hist(figsize=(16, 8), bins=100, alpha=0.3, xlim=(-1, 1))

Observations

  • Boolean pushes sites towards the center. That makes sense because both sides might link to sites, but not in equal proportion.
  • Domain share in account and number of tweets / account share of total volume are equivalent. I feel like this should be algebraically obvious, but I haven't walked myself through it yet.
  • I'm going to go with domain share in account. It's easier to calculate than its equivalent and looks the best. I think it's properly normalizing by account volumes without excluding data we want in there.

Normalized Partisanship Scores

Now that I have a normalization scheme that looks sane, let's recreate the graphs at the top just to make sure everything looks right.

I wasn't sure how to calculate the standard error of a weighted mean. Turns out, there's no agreement, so I'm going to exclude it for now. Error bars would be nice though. I think I should be doing this with the graphs above too, because it's also a weighted mean.

In [31]:
domain_to_nominate_scores = collections.defaultdict(list)
domain_sums = collections.defaultdict(int)
for account, domain_pcts in account_to_domain_pct.items():
    dw_nominate = congress_twitter[congress_twitter['twitter'] == account]['nominate_dim1'].values[0]
    for domain, share in domain_pcts.items():
        domain_to_nominate_scores[domain].append(share * dw_nominate)
        domain_sums[domain] += share
    
domain_to_nominate_scores_series = {k : pd.Series(v) for k, v in domain_to_nominate_scores.items()}
domain_stats_normed = pd.DataFrame({domain: {
    'mean_dw': scores.sum() / domain_sums[domain],
    'num_shares': sum([counts[domain] for counts in account_to_domain_counts.values()]),
    'num_sharers': len(domain_to_sharers[domain])} for domain, scores in domain_to_nominate_scores_series.items()}).T
dsn = domain_stats_normed
dsnf = domain_stats_normed[domain_stats_normed['num_sharers'] > 1]
In [32]:
dsn.describe()
Out[32]:
mean_dw num_sharers num_shares
count 22234.000000 22234.000000 22234.000000
mean -0.016296 4.519969 37.158361
std 0.419987 20.820726 1262.244844
min -0.776000 1.000000 1.000000
25% -0.384000 1.000000 1.000000
50% -0.171758 1.000000 1.000000
75% 0.404000 2.000000 4.000000
max 0.990000 533.000000 126462.000000
In [33]:
_ = dsnf.loc[sites_of_interest].sort_values('mean_dw', ascending=False)['mean_dw'] \
    .plot.barh(figsize=(16, 8), legend=False, xlim=(-1, 1),
               title="Weighted mean dw_nominate score for a selection of domains")
In [34]:
_ = dsnf.loc[sites_of_interest].sort_values('mean_dw').plot(figsize=(16, 10), kind='bar', 
                                                       subplots=True, legend=False, 
                                                       title='Detailed stats per domain')
In [35]:
_ = dsnf['mean_dw'].plot.hist(bins=100, xlim=(-1, 1))

Observations

  • This looks better. Drudge is still probably too partisan, but that has more to do with filtering out domains with fewer share(r)s than normalization.
  • Partisanship on the right for these sites doesn't look as linear as it does on the left. That likely partially to do with the sites I picked, but it could also be the missing center-right.

Scoring by Simple Democrat & Republican Share Counts (2016 U.S. Election Method)

Instead of incorporating the actual dw_nominate scores, let's instead just count Democrats and Republicans sharing each domain. For each domain we'll give it a one if shared by a Republican, a negative one if shared by a Democrat, and we'll average over those scores.

In [36]:
def is_democrat(handle):
    return (congress_twitter[congress_twitter['twitter'] == handle]['nominate_dim1'] < 0).values[0]
def is_republican(handle):
    return (congress_twitter[congress_twitter['twitter'] == handle]['nominate_dim1'] > 0).values[0]

unweighted_domain_scores = collections.defaultdict(list)
for domain, sharers in domain_to_sharers.items():
    for handle in sharers:
        if is_democrat(handle):
            unweighted_domain_scores[domain].append(-1)
        elif is_republican(handle):
            unweighted_domain_scores[domain].append(1)
        else:
            print(handle)

unweighted_domain_scores_mean = {domain: {'mean': pd.Series(scores).mean(), 'num_sharers': len(scores)} for domain, scores in unweighted_domain_scores.items()}
uds = pd.DataFrame(unweighted_domain_scores_mean).T # uds = unweighted domain scores

Let's look at the same set of sites as before with these new scores and compare to the normalized scores:

In [37]:
_ = uds['mean'].loc[sites_of_interest].sort_values(ascending=False).plot.barh(figsize=(6, 8), xlim=(-1, 1))

Observations

  • Still looks like a reasonable sorting. Let's compare with the weighted method.
In [38]:
unweighted_weighted_compared = pd.DataFrame({'unweighted': uds['mean'], 'weighted': dsnf['mean_dw']})
_ = unweighted_weighted_compared.loc[sites_of_interest].sort_values('weighted', ascending=False) \
    .plot.barh(figsize=(12, 12), xlim=(-1, 1),
               title="Unweighted mean boolean score for a selection of domains")

Observations

  • Big difference. Unweighted tends to give a broader range of scores, which would make sense because it's no longer constrained by the dw_nominate distribution.
  • For most sites, the unweighted method gives the more extreme score. There are exceptions around the middle, and a few other big exceptions. nytimes.com moves much much closer to the center. It's probably worth trying to understand that change. My guess is nytimes.com is shared by close to an equal number of congressfolk on the left and right, but the left shares it much more often.

All the sites are listed below along with the percentage difference between weighted and unweighted methods:

In [39]:
((unweighted_weighted_compared['unweighted'] - unweighted_weighted_compared['weighted']) / 
 unweighted_weighted_compared['unweighted']).abs().loc[sites_of_interest].sort_values(ascending=False) * 100
Out[39]:
bloomberg.com          972.338956
cbsnews.com            953.091960
cnn.com                739.650380
nytimes.com            661.726647
cnbc.com               445.581085
washingtonpost.com     333.372406
rollcall.com           303.812153
wsj.com                196.572398
nbcnews.com            177.842253
forbes.com              87.122392
usatoday.com            72.758986
latimes.com             62.934483
vox.com                 47.989827
salon.com               44.962980
infowars.com            37.859532
drudgereport.com        37.040359
nationalreview.com      32.678437
theguardian.com         32.248136
dailycaller.com         30.854168
msnbc.com               30.675797
thehill.com             29.668952
breitbart.com           29.189658
theatlantic.com         25.963372
npr.org                 18.893002
huffingtonpost.com      14.828828
foxnews.com             13.842145
politico.com            13.613409
washingtontimes.com     10.098449
time.com                 2.716829
dtype: float64

Let's see what the distribution looks like:

In [40]:
_ = uds['mean'].plot.hist(bins=100, xlim=(-1, 1))

It must be dominated by sites with a single sharer. Let's filter to greater than one.

In [41]:
_ = uds[uds['num_sharers'] > 1]['mean'].plot.hist(bins=49, xlim=(-1, 1))

This is what I was worried about in the "I don't know statistics" section. There's a bias here that has a harmonic effect. We get a peak in the middle where there are two sharers, peaks further out from the center where there are three sharers, etc.

Let's see what it looks like as we increase the number of sharers.

In [42]:
ud = {}
for i, shares in enumerate(list(range(0, 6)) + list(range(10, 490, 20))):
    ud['{}. gt {} sharers'.format(chr(i + 65), shares)] = uds[uds['num_sharers'] > shares]['mean']
In [43]:
_ = joypy.joyplot(pd.DataFrame(ud), ylim="own", overlap=0, bins=49, hist=True, figsize=(6, 18), 
                  x_range=(-1, 1), grid="y", linewidth=0,
                  title="Dist. of media source partisanship as # of congressional sharers increases")

Observations

  • It's pretty easy to see the harmonic bias up to about >10 sharers.
  • After that, it looks like all the biases get cancelled out.
  • Duh, this is just a problem with a small sample size. When we're considering each congressperson an observation, we need to have a reasonable number of observations to say something about the partisanship of a site.
  • For the future, let's say >=30 observations (congressional sharers) is our cutoff. 30 looks like an oft-used rule-of-thumb. What does filtering to sites with >= 30 sharers look like?
In [44]:
min_sharers = 30
udsf = uds[uds['num_sharers'] >= min_sharers]
print('{} sites with >= {} sharers'.format((dsn['num_sharers'] >= min_sharers).sum(), min_sharers))
474 sites with >= 30 sharers
In [45]:
_ = udsf['mean'].plot.hist(bins=50, xlim=(-1, 1))

Observations

  • Three big peaks: far left, far right, and center.
  • Fairly strong representation throughout the spectrum except for the center-right. That suggests there are sites that are shared mostly by the left but sometimes by the right, but almost no sites that are shared mostly by the right but sometimes by the left. Huh, interpreting it that way makes the asymmetry sound more attributable to the left rather than the right. How might we disentangle the two (sharing patterns of the left vs. partisan distribution of the right)? Well, non-sharing based partisanship metrics (like the text analysis) should help. Perhaps there's something else we could do with geocoded tweets and county voting records?

Let's compare the distribution for this method with a similarly filtered distribution for the weighted method.

In [46]:
_ = pd.DataFrame({'unweighted': udsf['mean'], 
                  'weighted': dsn[dsn['num_sharers'] >= min_sharers]['mean_dw']}).plot.hist(bins=80, xlim=(-1, 1), subplots=True)

Observations

  • Again, it's clear that the weighted distribution is stuck within the boundaries of the dw_nominate distribution.
  • The shapes aren't the same. Unweighted has big peaks on the far edges while weighted does not. I guess this makes sense. When we're not weighting by dw_nominate scores, sites shared only by one party will get pushed far to the edges. When we are weighting, sites that are shared by only one party are more likely to end up near the mean of that party's dw_nominate score. That actually seems more informative than the unweighted. Two sites that are each shared by 30 Democrats and zero Republicans should not necessarily be of equal partisanship.
  • This isn't really accurate because the weighted version considers every tweet and observation (so maybe 30 tweets is a sample?) while the unweighted version considers every person an observation.

Comparing partisanship scores across methods

Let's compare partisanship scores between the election study retweet method and the Congressional tweets dw_nominate method. We should also compare the election retweet method and the simpler D & R party count method.

First, let's load in the partisanship scores from the election study.

In [47]:
election_scores = pd.read_csv('election_retweeter_polarization_media_scores.csv')
election_scores['domain'] = election_scores['url'].apply(lambda u: tldextract.extract(u).registered_domain)
election_scores.set_index('domain', inplace=True)
election_scores.index.value_counts().head(10)
Out[47]:
wordpress.com      17
blogspot.com       16
cbslocal.com        8
google.com          4
trendolizer.com     3
iheart.com          3
wikipedia.org       2
townhall.com        2
feedsportal.com     2
fbi.gov             2
Name: domain, dtype: int64

We're going to have to join on URL, so I pulled out the domains. One issue that can be seen above is that some domains have multiple "media sources" within them, and so should be counted separately, while others should not. (For wordpress.com and blogspot.com it makes sense to treat each subdomain as a separate source. For cbslocal.com, that's less obvious.) We'll have to deal with this properly, but for now I'll just exclude all duplicate domains.

Let's look at the election retweet method vs the congressional dw_nominate first.

In [48]:
retweet_dwnom = election_scores.join(dsnf)
retweet_dwnom = retweet_dwnom.rename({
    'score': 'election_retweet',
    'mean_dw': 'congress_dwnom'}, axis='columns')
retweet_dwnom = retweet_dwnom[~retweet_dwnom.index.duplicated(keep=False)].dropna()
print('{} sites in common'.format(retweet_dwnom.shape[0]))
482 sites in common
In [49]:
_ = seaborn.jointplot('election_retweet', 'congress_dwnom', retweet_dwnom, alpha=0.5, 
                      xlim=(-1.1, 1.1), ylim=(-1.1, 1.1))

Observations

  • They're not entirely uncorrelated, so that's good - 0.69 seems reasonable, though it could be better.
  • A whole bunch of stuff the election retweet method sticks to the far right gets smeared out by the congress_dwnom method. A bunch of disagreement there.
  • Generally more disagreement on the right than the left, and it looks like the disagreement increases as you move right.
  • This isn't about the y=x line because the two methods end up with different ranges.
  • There are two big agreeing bunches in the corners and some solid agreement in center and center-left. That's a good sign.

Let's look at the outliers.

In [70]:
retweet_dwnom['diff'] = (retweet_dwnom['election_retweet'] - retweet_dwnom['congress_dwnom']).abs()
retweet_dwnom.to_csv('media_partisanship_scores-election_retweet_vs_congress_tweet_dwnomin.csv')
retweet_dwnom.sort_values('diff', ascending=False).head(20)[['election_retweet', 'congress_dwnom', 'diff', 'url', 'num_sharers', 'num_shares']]
Out[70]:
election_retweet congress_dwnom diff url num_sharers num_shares
donaldjtrump.com 0.928614 -0.570944 1.499558 https://donaldjtrump.com/ 2.0 2.0
hindustantimes.com 1.000000 -0.347834 1.347834 http://www.hindustantimes.com/ 3.0 3.0
ca.gov 1.000000 -0.313076 1.313076 http://registertovote.ca.gov#spider 54.0 1134.0
sltrib.com -0.814865 0.449039 1.263904 http://www.sltrib.com/ 29.0 281.0
periscope.tv 0.914249 -0.347919 1.262167 https://periscope.tv/ 9.0 11.0
eagleforum.org 1.000000 -0.234288 1.234288 http://www.eagleforum.org 2.0 2.0
appspot.com 1.000000 -0.223487 1.223487 http://trumpedamerica.appspot.com/ 3.0 3.0
myflorida.com 1.000000 -0.210832 1.210832 http://www.myflorida.com/#spider 8.0 20.0
bostonmagazine.com 0.787208 -0.419597 1.206805 http://www.bostonmagazine.com/index.html 6.0 21.0
kentucky.com -0.671592 0.519181 1.190773 http://www.kentucky.com/ 20.0 321.0
valuewalk.com 1.000000 -0.166818 1.166818 http://www.valuewalk.com/#spider 4.0 4.0
dilbert.com 0.931719 -0.229408 1.161128 http://dilbert.com/ 2.0 2.0
delawareonline.com 0.911739 -0.246475 1.158214 http://www.delawareonline.com/apps/pbcs.dll/fr... 15.0 352.0
cincinnati.com -0.637972 0.508730 1.146702 http://cincinnati.com 34.0 337.0
today.com 1.000000 -0.143518 1.143518 http://www.today.com#spider 110.0 176.0
poughkeepsiejournal.com 1.000000 -0.110912 1.110912 http://www.poughkeepsiejournal.com/ 7.0 142.0
reuters.tv 1.000000 -0.092698 1.092698 http://reuters.tv/ 4.0 6.0
eventbrite.com 0.909983 -0.178288 1.088272 http://www.eventbrite.com#spider 177.0 1105.0
thesun.co.uk 1.000000 -0.068186 1.068186 http://www.thesun.co.uk/sol/homepage/ 5.0 5.0
gwu.edu 0.682768 -0.384245 1.067013 https://mediarelations.gwu.edu/ 14.0 15.0

Observations

  • As we saw, a bunch of the biggest outliers are those with an election retweet score of 1.
  • A good deal of the outliers have a low number of sharers. We're only filtering to sharers > 1 for this method, but I think we should be filtering better. Tweets could count as observations here, so maybe >= 20-30 tweets?
  • I quickly looked at some of the tweets behind a couple of these. eagleforum.org is a pro-life site that was tweeted once by an R ("Americans Lose While Immigrants Gain") and once by a D ("A bizarrely sourced & false set of arguments..."). Same deal with donaldjtrump.com.
  • A bunch of geography-scoped sources here, mostly in US: Hindustan, California, Salt Lake City, Florida, Boston, Kentucky, Delaware, Cincinnati, Poughkeepsie. Not sure why. Maybe they aren't disproportionately represented here - there's just a lot of them?
  • I think the congress dw_nominate scores look closer to intuition, but only if we exclude sites with a small number of shares. It's hard to judge how reliable the election retweet scores are without more information about shares. That's probably in the replication data somewhere, but I haven't looked for it. I'm guessing most of the 1.0 scores have few tweets to back those scores up.
  • Important to remember different processes underly the different scores - public sharing vs party-elite sharing.
  • Data is available here

Let's do a quick comparison with the simple party count method since it's a similar scoring method.

In [51]:
retweet_party_count = election_scores.join(udsf)
retweet_party_count = retweet_party_count.rename({
    'score': 'election_retweet',
    'mean': 'congress_party_count'}, axis='columns')
retweet_party_count = retweet_party_count[~retweet_party_count.index.duplicated(keep=False)].dropna()
print('{} sites in common'.format(retweet_party_count.shape[0]))
193 sites in common
In [52]:
_ = seaborn.jointplot('election_retweet', 'congress_party_count', retweet_party_count, alpha=0.5)

Observations

  • There are a lot fewer sites to compare here (193). We're filtering out a lot more because they don't have enough sharers.
  • Similar correlation to the last comparison: 0.68. Reasonable.
  • Again, most of the outliers are below the y=x line, which suggests the party count method is mostly to the left of the election retweet method.
  • There is the conspicuous hole where a center-right would be. I think that's a combination of the thinner representation and greater disagreement as you move to the right.

And the outliers.

In [53]:
retweet_party_count['diff'] = (retweet_party_count['election_retweet'] - retweet_party_count['congress_party_count']).abs()
retweet_party_count.sort_values('diff', ascending=False).head(20)[['election_retweet', 'congress_party_count', 'diff', 'url', 'num_sharers']]
Out[53]:
election_retweet congress_party_count diff url num_sharers
ca.gov 1.000000 -0.777778 1.777778 http://registertovote.ca.gov#spider 54.0
twibbon.com 0.824377 -0.627907 1.452284 https://twibbon.com#spider 43.0
eventbrite.com 0.909983 -0.435028 1.345012 http://www.eventbrite.com#spider 177.0
usa.gov 0.871692 -0.328767 1.200459 http://www.usa.gov 146.0
today.com 1.000000 -0.200000 1.200000 http://www.today.com#spider 110.0
sfchronicle.com 0.330925 -0.853659 1.184583 http://sfchronicle.com/ 41.0
theintercept.com 0.190301 -0.850000 1.040301 https://theintercept.com/ 40.0
cincinnati.com -0.637972 0.294118 0.932090 http://cincinnati.com 34.0
politifact.com -0.812190 0.117241 0.929431 http://www.politifact.com 145.0
constantcontact.com 0.598264 -0.293103 0.891368 http://constantcontact.com/ 116.0
gizmodo.com 0.236883 -0.647059 0.883942 http://gizmodo.com 34.0
cnet.com 0.511517 -0.371429 0.882946 http://cnet.com 35.0
people-press.org -0.681453 0.200000 0.881453 http://www.people-press.org 30.0
ap.org -0.389126 0.441441 0.830568 http://ap.org 111.0
change.org 0.303457 -0.500000 0.803457 http://criminaljustice.change.org/ 44.0
linkedin.com 0.290099 -0.511111 0.801210 http://cn.linkedin.com/ 45.0
factcheck.org -0.637972 0.135135 0.773107 http://factcheck.org/ 37.0
prnewswire.com 0.473631 -0.290323 0.763953 http://www.prnewswire.com 62.0
dropbox.com 1.000000 0.244444 0.755556 http://www.dropbox.com 45.0
observer.com 0.480024 -0.243243 0.723268 http://www.observer.com/politics 37.0

Observations

  • Some of the same sites as above (ca.gov, eventbrite.com, today.com...). Suggests the two methods based on the Congressional tweets agree with each other on those sites.
  • Some surprising ones in here. theintercept.com and change.org do not seem right of center. Kind of hard to imagine how the differing sharing patterns of public Twitter vs. congressional Twitter can account for those differences. Libertarianism and trolling?
  • Fact checking sites - politifact.com and factcheck.org. Easier to imagine how the differing sharing patterns can account for these. That could be an interesting result. Party elites might share close to equally, while regular Twitter users on the left might share a lot more than the right.
  • Differing time periods of data collection might account for some of these differences as well. I'm looking at full Twitter histories (instead of limited to same period as election study). Sites like usa.gov might get very different shares depending on administration.

Recreating Figure 5 from 2016 U.S. Election Study

I thought it might be neat to recreate the figure depicting Twitter shares of the top 250 media outlets across the politcal spectrum using this data.

In [54]:
num_bins = 20
bins = [(i - num_bins / 2) / (num_bins / 2) for i in range(num_bins + 1)]
colors = ['#0d3b6e', '#0d3b6e', '#0d3b6e', '#0d3b6e', 
          '#869db6', '#869db6', '#869db6', '#869db6',
          '#2a7526', '#2a7526', '#2a7526', '#2a7526',
          '#d8919e', '#d8919e', '#d8919e', '#d8919e',
          '#b1243e', '#b1243e', '#b1243e', '#b1243e']
ranked_dsnf = dsnf.assign(shares_rank=dsnf['num_shares'].rank(ascending=False))
top_ranked_dsnf = ranked_dsnf[ranked_dsnf['shares_rank'] <= 250]
_ = top_ranked_dsnf.groupby(pd.cut(top_ranked_dsnf['mean_dw'], bins)).sum().plot.bar(y='num_shares', color=colors)

This doesn't look anything like the figure from the paper. Twitter share counts look a lot more power-law like here. Let's look at the top shares.

In [55]:
_ = top_ranked_dsnf['num_shares'].sort_values(ascending=False).head(40).plot.bar(figsize=(16, 4))

Those top 5 aren't really outlets that we're traditionally looking at, so let's filter them out.

In [56]:
top_ranked_dsnf = ranked_dsnf[ranked_dsnf['shares_rank'].between(6, 250)]
_ = top_ranked_dsnf.groupby(pd.cut(top_ranked_dsnf['mean_dw'], bins)).sum().plot.bar(y='num_shares', color=colors)

I think there are two things happening here. One, Congress might be less willing to link to crazy fringe stuff than the general public. And two, sites that are shared by more Congressfolk will have more shares, but they're also more likely to fall in the center because more sharers means the sharers will have a higher dw_nominate variance, and we're giving a partisanship score based on the average of all the folks that shared it.

Well, what does partisanship vs. the number of sharers look like then? Once we get up above the number of members in a single party, how quickly does it move toward the center?

In [57]:
d = {}
for i, shares in enumerate(list(range(0, 6)) + list(range(10, 490, 20))):
    d['{}. gt {} sharers'.format(chr(i + 65), shares)] = dsn[dsn['num_sharers'] > shares]['mean_dw']
In [58]:
_ = joypy.joyplot(pd.DataFrame(d), ylim="own", bins=40, overlap=1, figsize=(12, 9),
              alpha=0.8, x_range=(-1, 1), linewidth=0.3, bw_method=0.05, 
              title="Dist. of media source partisanship as # of congressional sharers increases")

This is a neato Rob graph. I guess it's traditionally called a joyplot (from the Joy Division album cover) or a ridgeline graph.

Observations

  • The edges fall in as we see more sharers, which is expected.
  • The "lifespan" of sources depends on two factors: the number of Congressfolk near that area on the political spectrum, and their propensity to tweet.
  • Sources don't fall out symmetrically. The ranges near 0.25 and 0.4 fall away the fastest. They're basically unrepresented after 130 sharers. According to the dw_nominate distribution, there are actually some folks in that area. I'd like to say this is a nice confirmation of the absent center-right point in the election paper, but I don't know if it's that or just the relative absense of center-right Congressfolk or some statistical quirk of the model that biases it against that range of numbers.
  • Some partisan sources are shared by a larger number of Congressfolk. The one of the left that makes it to >450 is nytimes.com and the one on the right that makes it to >430 is wsj.com.
  • The sites that make it to the arbitrary end of the graph are twitter.com, youtube.com, facebook.com, thehill.com, house.gov, washingtonpost.com and c-span.org.

Future Work

  • Validation?
  • Some subdomains matter and some don't. Subdomains on Wordpress vs. "www", for example. Need to handle that.
  • Each Congress is an uneven representation of the political spectrum. That unevenness biases our partisanship estimation in favor of the party in power. We need to normalize for that.
  • I stumbled upon this while investigating the #pjnet hashtag: http://www.patriotjournalist.com/usHouse.php?src=Home http://www.patriotjournalist.com/usSenate.php?src=Home It's a weird righty website, but those lists look more complete than the c-span one.
  • Decide how to deal with time better. I've pulled down the Twitter accounts of most of the 115th Congress, but I've pulled full Twitter histories (not limited to the 115th Congress). Should I limit the Twitter histories or try to look at other Congresses?
  • There are a lot of interesting questions to be asked and answered about Congressional sharing (and by extension, attention) with this data. Do Democrats share from a broader swath of sites? Do Republicans really use Instagram more?
  • Reason through the Twitter volume normalization better.
  • Learn statistics.

Odds and Ends

Do some partisanship bins have more media sources than others? This kind of analysis is getting too far into ouroboros territory, but I want to see it anyway.

In [59]:
num_bins = 20
bins = [(i - num_bins / 2) / (num_bins / 2) for i in range(num_bins + 1)]
congress_partisanship_bins = congress_twitter.groupby(
    pd.cut(congress_twitter['nominate_dim1'], bins))['nominate_dim1'].count()
media_partisanship_bins = dsnf.groupby(pd.cut(dsnf['mean_dw'], bins))['mean_dw'].count()
media_source_per_congress_by_partisanship = media_partisanship_bins / congress_partisanship_bins
media_source_per_congress_by_partisanship.loc[media_source_per_congress_by_partisanship.isna() | (media_source_per_congress_by_partisanship == float('inf'))] = 1
_ = pd.DataFrame([congress_partisanship_bins, 
                  media_partisanship_bins, 
                  media_source_per_congress_by_partisanship.apply(math.log)]).T.plot.bar(subplots=True)

Observations

  • The center is way over represented. From 0 to 0.1 is infinity (because no Congresspeople in that range).
  • The left looks slightly better represented than the right.

Is worrying about the dw_nominate distribution leaking into the media distribution a real concern? I dunno, but I can graph them on top of each other.

In [60]:
_ = pd.DataFrame({'dw_nomin': congress_twitter['nominate_dim1'], 'media': dsnf['mean_dw']}) \
    .plot.hist(bins=100, sharex=True, normed=True, figsize=(12, 8), alpha=0.5, 
              title="Dist. of media partisanship vs. dist. of dw_nominate scores")

Observations

  • Do we expect the media landscape to have a stronger center than the political landscape?
  • Do we expect that stronger center to be made up for by a smaller right (instead of a smaller left, or both equally)?

Same as the joyplot, just a bunch of histograms instead.

In [61]:
_ = pd.DataFrame(d).plot.hist(bins=50, figsize=(10, 20), sharex=True, subplots=True, xlim=(-1, 1))
In [62]:
_ = dsnf.plot.scatter('mean_dw', 'num_sharers', xlim=(-1, 1))
In [63]:
ranked_dsnf = dsnf.assign(sharers_rank=dsnf['num_sharers'].rank(ascending=False))
top_ranked_dsnf = ranked_dsnf[ranked_dsnf['sharers_rank'] <= 250]
top_ranked_dsnf.groupby(pd.cut(top_ranked_dsnf['mean_dw'], bins)).sum().plot.bar(y='num_sharers', color=colors)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd87fd32ba8>
In [64]:
dsnf.sort_values('num_sharers', ascending=False).head(20)
Out[64]:
mean_dw num_sharers num_shares
twitter.com -0.008799 533.0 94883.0
youtube.com 0.164139 533.0 58497.0
facebook.com 0.085908 524.0 34196.0
thehill.com 0.055500 514.0 11915.0
house.gov 0.151885 503.0 126462.0
washingtonpost.com -0.084014 500.0 15994.0
c-span.org 0.067164 485.0 8468.0
politico.com 0.016577 469.0 7391.0
nytimes.com -0.280897 461.0 13803.0
cnn.com -0.069679 459.0 5367.0
usatoday.com 0.053040 456.0 3921.0
wsj.com 0.344537 439.0 7052.0
instagram.com 0.220431 433.0 14659.0
rollcall.com 0.049126 411.0 1964.0
nbcnews.com -0.120801 391.0 2041.0
senate.gov 0.156875 388.0 66829.0
go.com 0.028291 385.0 2073.0
reuters.com 0.143960 371.0 2220.0
bloomberg.com 0.047153 370.0 2167.0
foxnews.com 0.497671 366.0 5031.0
In [65]:
dsnf.sort_values('num_shares', ascending=False).head(10)
Out[65]:
mean_dw num_sharers num_shares
house.gov 0.151885 503.0 126462.0
twitter.com -0.008799 533.0 94883.0
senate.gov 0.156875 388.0 66829.0
youtube.com 0.164139 533.0 58497.0
facebook.com 0.085908 524.0 34196.0
washingtonpost.com -0.084014 500.0 15994.0
instagram.com 0.220431 433.0 14659.0
nytimes.com -0.280897 461.0 13803.0
thehill.com 0.055500 514.0 11915.0
c-span.org 0.067164 485.0 8468.0
In [66]:
dsn.join(uds.drop('num_sharers', axis='columns'), how='outer') \
.rename({'mean_dw': 'congress_dwnom', 'mean': 'congress_party_count'}, axis='columns') \
.to_csv('media_partisanship_from_congressional_tweets.csv')

Which accounts are sharing a particular domain, and how often?

In [67]:
domain = 'reddit.com'
domain_to_sharers[domain]
domain_urls = set([k for k, v in tweet_urls_to_domains.items() if v == domain])
{k: len(set(v).intersection(domain_urls)) for k, v in account_to_urls.items() if len(set(v).intersection(domain_urls)) > 0}
Out[67]:
{'corybooker': 22,
 'danarohrabacher': 1,
 'darrellissa': 31,
 'jahimes': 1,
 'jaredpolis': 6,
 'jerrymoran': 5,
 'keithellison': 5,
 'louiseslaughter': 3,
 'randpaul': 5,
 'repadamschiff': 1,
 'repannaeshoo': 2,
 'repbillfoster': 1,
 'repblumenauer': 2,
 'repcardenas': 2,
 'reperikpaulsen': 1,
 'repesty': 1,
 'rephankjohnson': 1,
 'rephuffman': 4,
 'repjimcooper': 1,
 'replowenthal': 1,
 'repmarkpocan': 1,
 'repmcgovern': 4,
 'reppaultonko': 2,
 'repperlmutter': 2,
 'repricklarsen': 1,
 'repzoelofgren': 3,
 'rokhanna': 1,
 'ronwyden': 8,
 'senatorbaldwin': 1,
 'senblumenthal': 1,
 'sengillibrand': 1,
 'senjohnmccain': 1,
 'sensanders': 7,
 'sethmoulton': 2,
 'usrepmikedoyle': 2}

I don't know statistics

We're averaging points from a bimodal distribution. We can weight them in different ways, but the underlying distribution is limiting the output distribution. If we were to randomly sample n-points (where n is chosen from some power distribution) from the bimodal distribution (to represent congresspeople sharing stuff) and draw a histogram of their averages, I think we'd end up with hotspots. The center is one obvious hotspot. But the hotspots are periodic with varying strengths with some fundamental frequency. For example, let's say two congresspeople share a media source. There are three possibilites: DD, DR, RR. DD is going to be near the mean for D, RR is going to be near the mean for R, and DR is going to be near the center. If only two congresspeople share that media source, it's near impossible for our estimate of it's partisanship to not be one of those three options. It might be center-right in "reality", but the distribution of the observable in conjunction with our model rules that out entirely. Ah crap, I think that is just the nature of statistics? If my sample size is small, my estimator is bad. Grr, but this is still an issue if we have more Rs than Ds. We need to write out our assumptions about the distribution of partisanship of media sources. If we want to assume a uniform distribution, shouldn't we need to correct for an observable distribution that's not uniform?

Let's think about this more statistically. There is a population of media sources. Each media source has this latent variable that is its "partisanship". That hidden variable need not be stable for each media source, but we'll assume it's slow moving. We're trying to measure that hidden variable by looking at observable variables.

We know the dimensions of the latent variable space ([-1, 1]). Let's say we're trying to estimate the distribution of this latent space. That's a fundamentally different question than estimating the partisanship of a single source. If we're trying to estimate the distribution...

Maybe a different approach. Let's take a single congressperson and the top 1M websites. Our prior is that each of those websites has an equal probability of getting shared by the given congressperson. One tweet from that congressperson is an observation, and it updates our prior. It was a choice by the congressperson to tweet that one domain and not others. How do choice models work?

The above would give us an estimation of the probability of each congressperson to tweet each of the top million sites. We now have 541 distributions. How could those be aggregated in a way that makes sense? For each website, out prior is that it has equal probability of falling in any space on the partisanship distribution. For every congressperson, we have our expection of how likely they are to tweet the given domain. That's an observation. We take their dw_nominate score and update the website prior. That's basically what we're doing already. How do error bars work? For some congresspeople, we know more than we know about others because they tweet more.