I'm looking at the sharing patterns of various media sources by congressional Twitter accounts. The goal is to estimate source partisanship from these sharing patterns.
I pulled Twitter histories for 541 Twitter accounts for Congressfolk from the 115th Congress. I manually matched up the Twitter handles with their corresponding rows in the dw_nominate data. The resulting spreadsheet is here.
The tweet dataset contains ~22K unique domains (after unshortening). Approximately 14K of those are shared by single Congresspeople, so partisanship probably shouldn't be estimated for them. I'm not sure how many Congresspeople sharing a domain is enough to estimate partisanship, but I'm currently using the cutoff of two or more.
Below is a selection of websites and their partisanship estimates. Their order from left to right generally matches intuition. I'm as confident in the absolute numbers (which would include the location of the center) as I am in my knowledge of statistics, which is to say, not at all. I think there are some statistical biases and things I haven't normalized for that could move the absolute numbers around, but the relative positions would likely undergo only minor shuffling.
We could also estimate the partisanship of nontraditional outlets (youtube.com, facebook.com, instagram.com, etc.), but for me that just highlights the question of how much are we learning about media sources and how much are we learning about Congressional tweeting? (Though Congressional link sharing behavior seems like an interesting, out-of-scope topic.)
I've put a CSV of the full results here.
There's a lot of content below the fold. A lot of it is me trying to figure out how to do data analysis, but some of it might be interesting. I'd suggest just skimming the sections of graphs, most of which have a set of observations at the bottom.
_ = dsnf.loc[sites_of_interest].sort_values('mean_dw', ascending=False)['mean_dw'] \
.plot.barh(figsize=(16, 8), legend=False, title="Weighted mean dw_nominate score for selected domains")
%matplotlib inline
import json, concurrent.futures, urllib.parse, collections, subprocess, itertools, logging, operator, math
import tqdm, requests, tldextract, scipy.stats, joypy, seaborn, matplotlib.colors
import pandas as pd
from matplotlib import cm
seaborn.set()
from source_partisanship_by_congressional_tweets import *
First, let's import the manually created spreadsheet that combines this list of Twitter accounts from the 115th Congress with dw_nominate scores.
We need to filter out accounts for groups of people (e.g. @WaysandMeansGOP) as their membership can change over time and we don't have clear dw_nominate scores for them. We can also filter out accounts that don't have a name. We'll also lowercase all Twitter handles everywhere so we don't have to worry about case-sensitive matching.
congress_twitter = pd.read_csv('congress_twitter_to_bioguide.csv')
congress_twitter = congress_twitter[(congress_twitter['is_group'] != 'T') & (congress_twitter['bioname'].notna())]
congress_twitter['twitter'] = [t.lower() for t in congress_twitter['twitter']]
congress_twitter.sample(5)
Let's get familiar with the dw_nominate distribution.
congress_twitter['nominate_dim1'].describe()
_ = congress_twitter['nominate_dim1'].plot.hist(
bins=100, title="Distribution of dw_nominate scores for 115th congress",
xlim=(-1, 1))
print('D:',(congress_twitter['nominate_dim1'] < 0).sum(),
'Mean:', congress_twitter[congress_twitter['nominate_dim1'] < 0]['nominate_dim1'].mean(),
'R:', (congress_twitter['nominate_dim1'] > 0).sum(),
'Mean:', congress_twitter[congress_twitter['nominate_dim1'] > 0]['nominate_dim1'].mean())
Observations
Let's put together of map of Twitter account -> shared URLS.
account_to_urls = collections.defaultdict(list)
pbar = tqdm.tqdm_notebook(total=line_count('tweets.txt'), desc='Reading tweets', smoothing=0)
for tweet in get_tweets('tweets.txt'):
pbar.update()
screen_name = tweet['user']['screen_name'].lower()
if screen_name not in congress_twitter['twitter'].values: continue
if 'urls' not in tweet: continue
for url in tweet['urls']:
url_to_test = url['url']
if 'expanded_url' in url:
url_to_test = url['expanded_url']
account_to_urls[screen_name].append(url_to_test)
pbar.close()
Now, for every URL that was included in a tweet, we need to unshorten it so we can look at actual domains. We're assuming there's a one-to-one correspondence between domains and media sources. We need to make sure we preserve the relationship between the URL in the tweet and the unshortened URL so we can get back to individual tweets given an unshortened URL.
I've already done this, so I'll just load it.
with open('unshortened_urls.json') as f:
unshortened_urls = json.load(f)
Now we'll extract the domain from every unshortened URL, build the map from Twitter account -> shared domains, and build a reversed map from domain -> Twitter accounts sharing that domain.
tweet_urls_to_domains = {}
for tweet_url, final_url in unshortened_urls.items():
tweet_urls_to_domains[tweet_url] = tldextract.extract(final_url).registered_domain
account_to_domains = collections.defaultdict(list)
for account, urls in account_to_urls.items():
account_to_domains[account] += [tweet_urls_to_domains[url] for url in urls]
domain_to_sharers = collections.defaultdict(set)
for account, domains in account_to_domains.items():
for domain in domains:
domain_to_sharers[domain].add(account)
Let's use the above to calculate the sharing stats for the shared domains.
domain_to_sharer_nominate = collections.defaultdict(list)
for account, domains in account_to_domains.items():
dw_nominate = congress_twitter[congress_twitter['twitter'] == account]['nominate_dim1'].values[0]
for domain in domains:
domain_to_sharer_nominate[domain].append(dw_nominate)
domain_to_sharer_nominate_s = {k : pd.Series(v) for k, v in domain_to_sharer_nominate.items()}
domain_stats = pd.DataFrame({k: {
'mean_dw': v.mean(),
'num_shares': v.count(),
'num_sharers': len(domain_to_sharers[k]),
'stddev_dw': v.std()} for k, v in domain_to_sharer_nominate_s.items()}).T
Let's take a look at the stats we just generated.
domain_stats.describe()
Observations
mean_dw scores over all media sources is reasonably close to zero, but the min and max are the same as dw_nominate, which means the extrema Congressfolks were the only ones to share certain domains.num_sharers and num_shares are very power lawy. It's mostly 1. We'll probably want to filter._ = domain_stats['mean_dw'].plot.hist(bins=100, xlim=(-1, 1))
Observations
dw_nominate distribution. On one hand, this could make sense because the media landscape shared by Congress would probably look like Congress itself. On the other hand, we used dw_nominate to generate these scores in the first place, so in cases where only a single Congressperson shared a domain, there's a one-to-one correspondence. That suggests filtering out domains by number of sharers.dw_nominate distribution. That suggests the presence of a large number of sites that are shared by both sides close to equally.Let's look at the top and bottom of the site list when sorted by mean_dw (our partisanship score).
domain_stats.sort_values('mean_dw').head()
domain_stats.sort_values('mean_dw', ascending=False).head()
Like we noticed above, the ends of the distribution are dominated by sites that have few shares or sharers. We already know that most sites are shared by a single person, and that means we're brining in the dw_nominate distribution directly, so let's see what the distribution looks like if we filter those sites out (and compare it to the distribution above).
domain_stats['num_sharers'].value_counts().head()
dsf = domain_stats[domain_stats['num_sharers'] > 1] # dsf == domain stats filtered
dsf.describe()
dsf.describe() - domain_stats.describe()
_ = pd.DataFrame({
'all sites': domain_stats['mean_dw'],
'sites with >1 sharers': domain_stats[domain_stats['num_sharers'] > 1]['mean_dw']
}).plot.hist(bins=100, sharex=True, figsize=(12, 8), subplots=True, xlim=(-1, 1))
Observations
dw_nominate center).Let's get to the graph where we see how our intuition of partisanship for certain sites matches the output of this.
sites_of_interest = ['nytimes.com', 'foxnews.com', 'cnn.com',
'wsj.com', 'npr.org', 'huffingtonpost.com', 'infowars.com',
'dailycaller.com', 'washingtonpost.com', 'msnbc.com',
'latimes.com', 'thehill.com',
'breitbart.com', 'washingtontimes.com', 'politico.com',
'nationalreview.com', 'drudgereport.com', 'nbcnews.com',
'theguardian.com', 'salon.com', 'vox.com',
'usatoday.com', 'rollcall.com', 'bloomberg.com',
'cbsnews.com', 'cnbc.com',
'forbes.com', 'time.com', 'theatlantic.com']
_ = dsf.loc[sites_of_interest].sort_values('mean_dw', ascending=False)['mean_dw'].plot.barh(figsize=(16, 8), xlim=(-1, 1), legend=False,
title="Mean dw_nominate score per domain")
Observations
thehill.com as far to the right as nbcnews.com is to the left? Is forbes.com twice as conservative as cnbc.com?Let's look at these sites in more detail.
_ = dsf.loc[sites_of_interest].sort_values('mean_dw').plot(figsize=(16, 10), kind='bar',
subplots=True, legend=False,
title='Detailed stats per domain')
Observations
Our goal here is to make sure that accounts that tweet a lot don't have an outsized effect on the partisanship score of a source - that is, we're normalizing by number of tweets per account.
It could also be the case that a Congressperson tweets about a whole bunch of different sites while another person tweets only links to a single site. How should that affect partisanship score? I dunno.
Let's take a simple example and see what intuition says. Our data set consists of two senators: Ed Markey, with a nominate score of -0.502, and Marco Rubio, with a nominate score of 0.585. If Markey tweets example.com links 30 times out of 1000 total tweets, and Rubio tweets example.com links 60 times out of 10000, what should we say the partisanship score of example.com is?
Intuitively, Markey tweets about example.com relatively more often, so I'd expect it to lean left. How far left? I dunno.
markey_dw, markey_domain_num, markey_total_num = -0.502, 30.0, 1000.0
rubio_dw, rubio_domain_num, rubio_total_num = 0.585, 60.0, 10000.0
markey_domain_share_in_account = markey_domain_num / markey_total_num
markey_share_total_vol = markey_total_num / (markey_total_num + rubio_total_num)
rubio_domain_share_in_account = rubio_domain_num / rubio_total_num
rubio_share_total_vol = rubio_total_num / (markey_total_num + rubio_total_num)
print('Markey shares: {:.2%} in account, {:.2%} of total'.format(markey_domain_share_in_account, markey_share_total_vol))
print('Rubio shares: {:.2%} in account, {:.2%} of total'.format(rubio_domain_share_in_account, rubio_share_total_vol))
Below are a bunch of different weighing schemes. I'm not good enough at this yet to know which one properly normalizes by account volume. It should be fairly simple, and my hunch is domain_share_in_account is correct, but I'm too fried right now to think the whole thing through without confusing myself. Instead, I'll just make a bunch of graphs and see what looks right.
weighting_schemes = {
'boolean': {
'm': 1.0,
'r': 1.0},
'num_tweets': {
'm': markey_domain_num,
'r': rubio_domain_num},
'domain_share_in_account': {
'm': markey_domain_share_in_account,
'r': rubio_domain_share_in_account},
'inverse_share_total_vol': {
'm': 1 / markey_share_total_vol,
'r': rubio_share_total_vol},
'num_tweets_over_share_total_vol': {
'm': markey_domain_num / markey_share_total_vol,
'r': rubio_domain_num / rubio_share_total_vol},
}
for name, weights in weighting_schemes.items():
score = (markey_dw * weights['m'] + rubio_dw * weights['r']) / (weights['m'] + weights['r'])
print('{:33} | score: {:6.3f} | markey weight: {:7.3f} | rubio weight: {:7.3f}'.format(name, score, weights['m'], weights['r']))
I'm hitting a wall trying to interpret this. Let's run them all against all the data and see what things look like.
total_vol = sum([len(urls) for urls in account_to_urls.values()])
account_to_share_total_vol = {k: len(u) / total_vol for k, u in account_to_urls.items()}
account_to_domain_counts = {k: collections.Counter(d) for k, d in account_to_domains.items()}
account_to_domain_pct = {k: {d: dc[d] / sum(dc.values()) for d in dc} for k, dc in account_to_domain_counts.items()}
domain_to_boolean_weighted_dw = collections.defaultdict(list)
domain_to_num_tweets_weighted_dw = collections.defaultdict(list)
domain_to_domain_share_in_account_weighted_dw = collections.defaultdict(list)
domain_to_inverse_share_total_vol_weighted_dw = collections.defaultdict(list)
domain_to_num_tweets_over_share_total_vol_weighted_dw = collections.defaultdict(list)
weight_sums = {}
for account, domains in account_to_domains.items():
dw_nominate = congress_twitter[congress_twitter['twitter'] == account]['nominate_dim1'].values[0]
for domain in set(domains):
if domain in weight_sums:
domain_weight_sums = weight_sums[domain]
else:
domain_weight_sums = {
'boolean': 0,
'num_tweets': 0,
'domain_share_in_account': 0,
'inverse_share_total_vol': 0,
'num_tweets_over_share_total_vol': 0,
}
boolean_weight = 1
domain_to_boolean_weighted_dw[domain].append(boolean_weight * dw_nominate)
domain_weight_sums['boolean'] += boolean_weight
num_tweets_weight = account_to_domain_counts[account][domain]
domain_to_num_tweets_weighted_dw[domain].append(num_tweets_weight * dw_nominate)
domain_weight_sums['num_tweets'] += num_tweets_weight
domain_share_in_account_weight = account_to_domain_pct[account][domain]
domain_to_domain_share_in_account_weighted_dw[domain].append(domain_share_in_account_weight * dw_nominate)
domain_weight_sums['domain_share_in_account'] += domain_share_in_account_weight
inverse_share_total_vol_weight = 1.0 / account_to_share_total_vol[account]
domain_to_inverse_share_total_vol_weighted_dw[domain].append(inverse_share_total_vol_weight * dw_nominate)
domain_weight_sums['inverse_share_total_vol'] += inverse_share_total_vol_weight
num_tweets_over_share_total_vol_weight = num_tweets_weight * inverse_share_total_vol_weight
domain_to_num_tweets_over_share_total_vol_weighted_dw[domain].append(num_tweets_over_share_total_vol_weight * dw_nominate)
domain_weight_sums['num_tweets_over_share_total_vol'] += num_tweets_over_share_total_vol_weight
weight_sums[domain] = domain_weight_sums
weight_lists = {
'boolean': domain_to_boolean_weighted_dw,
'num_tweets': domain_to_num_tweets_weighted_dw,
'domain_share_in_account': domain_to_domain_share_in_account_weighted_dw,
'inverse_share_total_vol': domain_to_inverse_share_total_vol_weighted_dw,
'num_tweets_over_share_total_vol': domain_to_num_tweets_over_share_total_vol_weighted_dw
}
weight_plot_data = {}
for name, weights in weight_lists.items():
as_series = {k: pd.Series(v) for k, v in weights.items()}
weighted_domain_stats = pd.DataFrame({k: {
'mean_dw': v.sum() / weight_sums[k][name]}
for k, v in as_series.items()}).T
weight_plot_data[name] = weighted_domain_stats['mean_dw']
weight_plot_data = pd.DataFrame(weight_plot_data)
_ = weight_plot_data.loc[sites_of_interest].sort_values('domain_share_in_account').plot.bar(subplots=True, figsize=(16, 16), width=0.6)
_ = weight_plot_data.loc[sites_of_interest].sort_values('domain_share_in_account', ascending=False).plot.barh(figsize=(8, 16), width=0.7, xlim=(-1, 1))
_ = weight_plot_data.plot.hist(figsize=(16, 8), bins=100, alpha=0.3, xlim=(-1, 1))
Observations
Now that I have a normalization scheme that looks sane, let's recreate the graphs at the top just to make sure everything looks right.
I wasn't sure how to calculate the standard error of a weighted mean. Turns out, there's no agreement, so I'm going to exclude it for now. Error bars would be nice though. I think I should be doing this with the graphs above too, because it's also a weighted mean.
domain_to_nominate_scores = collections.defaultdict(list)
domain_sums = collections.defaultdict(int)
for account, domain_pcts in account_to_domain_pct.items():
dw_nominate = congress_twitter[congress_twitter['twitter'] == account]['nominate_dim1'].values[0]
for domain, share in domain_pcts.items():
domain_to_nominate_scores[domain].append(share * dw_nominate)
domain_sums[domain] += share
domain_to_nominate_scores_series = {k : pd.Series(v) for k, v in domain_to_nominate_scores.items()}
domain_stats_normed = pd.DataFrame({domain: {
'mean_dw': scores.sum() / domain_sums[domain],
'num_shares': sum([counts[domain] for counts in account_to_domain_counts.values()]),
'num_sharers': len(domain_to_sharers[domain])} for domain, scores in domain_to_nominate_scores_series.items()}).T
dsn = domain_stats_normed
dsnf = domain_stats_normed[domain_stats_normed['num_sharers'] > 1]
dsn.describe()
_ = dsnf.loc[sites_of_interest].sort_values('mean_dw', ascending=False)['mean_dw'] \
.plot.barh(figsize=(16, 8), legend=False, xlim=(-1, 1),
title="Weighted mean dw_nominate score for a selection of domains")
_ = dsnf.loc[sites_of_interest].sort_values('mean_dw').plot(figsize=(16, 10), kind='bar',
subplots=True, legend=False,
title='Detailed stats per domain')
_ = dsnf['mean_dw'].plot.hist(bins=100, xlim=(-1, 1))
Observations
Instead of incorporating the actual dw_nominate scores, let's instead just count Democrats and Republicans sharing each domain. For each domain we'll give it a one if shared by a Republican, a negative one if shared by a Democrat, and we'll average over those scores.
def is_democrat(handle):
return (congress_twitter[congress_twitter['twitter'] == handle]['nominate_dim1'] < 0).values[0]
def is_republican(handle):
return (congress_twitter[congress_twitter['twitter'] == handle]['nominate_dim1'] > 0).values[0]
unweighted_domain_scores = collections.defaultdict(list)
for domain, sharers in domain_to_sharers.items():
for handle in sharers:
if is_democrat(handle):
unweighted_domain_scores[domain].append(-1)
elif is_republican(handle):
unweighted_domain_scores[domain].append(1)
else:
print(handle)
unweighted_domain_scores_mean = {domain: {'mean': pd.Series(scores).mean(), 'num_sharers': len(scores)} for domain, scores in unweighted_domain_scores.items()}
uds = pd.DataFrame(unweighted_domain_scores_mean).T # uds = unweighted domain scores
Let's look at the same set of sites as before with these new scores and compare to the normalized scores:
_ = uds['mean'].loc[sites_of_interest].sort_values(ascending=False).plot.barh(figsize=(6, 8), xlim=(-1, 1))
Observations
unweighted_weighted_compared = pd.DataFrame({'unweighted': uds['mean'], 'weighted': dsnf['mean_dw']})
_ = unweighted_weighted_compared.loc[sites_of_interest].sort_values('weighted', ascending=False) \
.plot.barh(figsize=(12, 12), xlim=(-1, 1),
title="Unweighted mean boolean score for a selection of domains")
Observations
dw_nominate distribution.nytimes.com moves much much closer to the center. It's probably worth trying to understand that change. My guess is nytimes.com is shared by close to an equal number of congressfolk on the left and right, but the left shares it much more often.All the sites are listed below along with the percentage difference between weighted and unweighted methods:
((unweighted_weighted_compared['unweighted'] - unweighted_weighted_compared['weighted']) /
unweighted_weighted_compared['unweighted']).abs().loc[sites_of_interest].sort_values(ascending=False) * 100
Let's see what the distribution looks like:
_ = uds['mean'].plot.hist(bins=100, xlim=(-1, 1))
It must be dominated by sites with a single sharer. Let's filter to greater than one.
_ = uds[uds['num_sharers'] > 1]['mean'].plot.hist(bins=49, xlim=(-1, 1))
This is what I was worried about in the "I don't know statistics" section. There's a bias here that has a harmonic effect. We get a peak in the middle where there are two sharers, peaks further out from the center where there are three sharers, etc.
Let's see what it looks like as we increase the number of sharers.
ud = {}
for i, shares in enumerate(list(range(0, 6)) + list(range(10, 490, 20))):
ud['{}. gt {} sharers'.format(chr(i + 65), shares)] = uds[uds['num_sharers'] > shares]['mean']
_ = joypy.joyplot(pd.DataFrame(ud), ylim="own", overlap=0, bins=49, hist=True, figsize=(6, 18),
x_range=(-1, 1), grid="y", linewidth=0,
title="Dist. of media source partisanship as # of congressional sharers increases")
Observations
min_sharers = 30
udsf = uds[uds['num_sharers'] >= min_sharers]
print('{} sites with >= {} sharers'.format((dsn['num_sharers'] >= min_sharers).sum(), min_sharers))
_ = udsf['mean'].plot.hist(bins=50, xlim=(-1, 1))
Observations
Let's compare the distribution for this method with a similarly filtered distribution for the weighted method.
_ = pd.DataFrame({'unweighted': udsf['mean'],
'weighted': dsn[dsn['num_sharers'] >= min_sharers]['mean_dw']}).plot.hist(bins=80, xlim=(-1, 1), subplots=True)
Observations
dw_nominate distribution.dw_nominate scores, sites shared only by one party will get pushed far to the edges. When we are weighting, sites that are shared by only one party are more likely to end up near the mean of that party's dw_nominate score. That actually seems more informative than the unweighted. Two sites that are each shared by 30 Democrats and zero Republicans should not necessarily be of equal partisanship.Let's compare partisanship scores between the election study retweet method and the Congressional tweets dw_nominate method. We should also compare the election retweet method and the simpler D & R party count method.
First, let's load in the partisanship scores from the election study.
election_scores = pd.read_csv('election_retweeter_polarization_media_scores.csv')
election_scores['domain'] = election_scores['url'].apply(lambda u: tldextract.extract(u).registered_domain)
election_scores.set_index('domain', inplace=True)
election_scores.index.value_counts().head(10)
We're going to have to join on URL, so I pulled out the domains. One issue that can be seen above is that some domains have multiple "media sources" within them, and so should be counted separately, while others should not. (For wordpress.com and blogspot.com it makes sense to treat each subdomain as a separate source. For cbslocal.com, that's less obvious.) We'll have to deal with this properly, but for now I'll just exclude all duplicate domains.
Let's look at the election retweet method vs the congressional dw_nominate first.
retweet_dwnom = election_scores.join(dsnf)
retweet_dwnom = retweet_dwnom.rename({
'score': 'election_retweet',
'mean_dw': 'congress_dwnom'}, axis='columns')
retweet_dwnom = retweet_dwnom[~retweet_dwnom.index.duplicated(keep=False)].dropna()
print('{} sites in common'.format(retweet_dwnom.shape[0]))
_ = seaborn.jointplot('election_retweet', 'congress_dwnom', retweet_dwnom, alpha=0.5,
xlim=(-1.1, 1.1), ylim=(-1.1, 1.1))
Observations
0.69 seems reasonable, though it could be better.Let's look at the outliers.
retweet_dwnom['diff'] = (retweet_dwnom['election_retweet'] - retweet_dwnom['congress_dwnom']).abs()
retweet_dwnom.to_csv('media_partisanship_scores-election_retweet_vs_congress_tweet_dwnomin.csv')
retweet_dwnom.sort_values('diff', ascending=False).head(20)[['election_retweet', 'congress_dwnom', 'diff', 'url', 'num_sharers', 'num_shares']]
Observations
eagleforum.org is a pro-life site that was tweeted once by an R ("Americans Lose While Immigrants Gain") and once by a D ("A bizarrely sourced & false set of arguments..."). Same deal with donaldjtrump.com.Let's do a quick comparison with the simple party count method since it's a similar scoring method.
retweet_party_count = election_scores.join(udsf)
retweet_party_count = retweet_party_count.rename({
'score': 'election_retweet',
'mean': 'congress_party_count'}, axis='columns')
retweet_party_count = retweet_party_count[~retweet_party_count.index.duplicated(keep=False)].dropna()
print('{} sites in common'.format(retweet_party_count.shape[0]))
_ = seaborn.jointplot('election_retweet', 'congress_party_count', retweet_party_count, alpha=0.5)
Observations
And the outliers.
retweet_party_count['diff'] = (retweet_party_count['election_retweet'] - retweet_party_count['congress_party_count']).abs()
retweet_party_count.sort_values('diff', ascending=False).head(20)[['election_retweet', 'congress_party_count', 'diff', 'url', 'num_sharers']]
Observations
ca.gov, eventbrite.com, today.com...). Suggests the two methods based on the Congressional tweets agree with each other on those sites.theintercept.com and change.org do not seem right of center. Kind of hard to imagine how the differing sharing patterns of public Twitter vs. congressional Twitter can account for those differences. Libertarianism and trolling?politifact.com and factcheck.org. Easier to imagine how the differing sharing patterns can account for these. That could be an interesting result. Party elites might share close to equally, while regular Twitter users on the left might share a lot more than the right.I thought it might be neat to recreate the figure depicting Twitter shares of the top 250 media outlets across the politcal spectrum using this data.
num_bins = 20
bins = [(i - num_bins / 2) / (num_bins / 2) for i in range(num_bins + 1)]
colors = ['#0d3b6e', '#0d3b6e', '#0d3b6e', '#0d3b6e',
'#869db6', '#869db6', '#869db6', '#869db6',
'#2a7526', '#2a7526', '#2a7526', '#2a7526',
'#d8919e', '#d8919e', '#d8919e', '#d8919e',
'#b1243e', '#b1243e', '#b1243e', '#b1243e']
ranked_dsnf = dsnf.assign(shares_rank=dsnf['num_shares'].rank(ascending=False))
top_ranked_dsnf = ranked_dsnf[ranked_dsnf['shares_rank'] <= 250]
_ = top_ranked_dsnf.groupby(pd.cut(top_ranked_dsnf['mean_dw'], bins)).sum().plot.bar(y='num_shares', color=colors)
This doesn't look anything like the figure from the paper. Twitter share counts look a lot more power-law like here. Let's look at the top shares.
_ = top_ranked_dsnf['num_shares'].sort_values(ascending=False).head(40).plot.bar(figsize=(16, 4))
Those top 5 aren't really outlets that we're traditionally looking at, so let's filter them out.
top_ranked_dsnf = ranked_dsnf[ranked_dsnf['shares_rank'].between(6, 250)]
_ = top_ranked_dsnf.groupby(pd.cut(top_ranked_dsnf['mean_dw'], bins)).sum().plot.bar(y='num_shares', color=colors)
I think there are two things happening here. One, Congress might be less willing to link to crazy fringe stuff than the general public. And two, sites that are shared by more Congressfolk will have more shares, but they're also more likely to fall in the center because more sharers means the sharers will have a higher dw_nominate variance, and we're giving a partisanship score based on the average of all the folks that shared it.
Well, what does partisanship vs. the number of sharers look like then? Once we get up above the number of members in a single party, how quickly does it move toward the center?
d = {}
for i, shares in enumerate(list(range(0, 6)) + list(range(10, 490, 20))):
d['{}. gt {} sharers'.format(chr(i + 65), shares)] = dsn[dsn['num_sharers'] > shares]['mean_dw']
_ = joypy.joyplot(pd.DataFrame(d), ylim="own", bins=40, overlap=1, figsize=(12, 9),
alpha=0.8, x_range=(-1, 1), linewidth=0.3, bw_method=0.05,
title="Dist. of media source partisanship as # of congressional sharers increases")
This is a neato Rob graph. I guess it's traditionally called a joyplot (from the Joy Division album cover) or a ridgeline graph.
Observations
dw_nominate distribution, there are actually some folks in that area. I'd like to say this is a nice confirmation of the absent center-right point in the election paper, but I don't know if it's that or just the relative absense of center-right Congressfolk or some statistical quirk of the model that biases it against that range of numbers.nytimes.com and the one on the right that makes it to >430 is wsj.com.twitter.com, youtube.com, facebook.com, thehill.com, house.gov, washingtonpost.com and c-span.org.Do some partisanship bins have more media sources than others? This kind of analysis is getting too far into ouroboros territory, but I want to see it anyway.
num_bins = 20
bins = [(i - num_bins / 2) / (num_bins / 2) for i in range(num_bins + 1)]
congress_partisanship_bins = congress_twitter.groupby(
pd.cut(congress_twitter['nominate_dim1'], bins))['nominate_dim1'].count()
media_partisanship_bins = dsnf.groupby(pd.cut(dsnf['mean_dw'], bins))['mean_dw'].count()
media_source_per_congress_by_partisanship = media_partisanship_bins / congress_partisanship_bins
media_source_per_congress_by_partisanship.loc[media_source_per_congress_by_partisanship.isna() | (media_source_per_congress_by_partisanship == float('inf'))] = 1
_ = pd.DataFrame([congress_partisanship_bins,
media_partisanship_bins,
media_source_per_congress_by_partisanship.apply(math.log)]).T.plot.bar(subplots=True)
Observations
Is worrying about the dw_nominate distribution leaking into the media distribution a real concern? I dunno, but I can graph them on top of each other.
_ = pd.DataFrame({'dw_nomin': congress_twitter['nominate_dim1'], 'media': dsnf['mean_dw']}) \
.plot.hist(bins=100, sharex=True, normed=True, figsize=(12, 8), alpha=0.5,
title="Dist. of media partisanship vs. dist. of dw_nominate scores")
Observations
Same as the joyplot, just a bunch of histograms instead.
_ = pd.DataFrame(d).plot.hist(bins=50, figsize=(10, 20), sharex=True, subplots=True, xlim=(-1, 1))
_ = dsnf.plot.scatter('mean_dw', 'num_sharers', xlim=(-1, 1))
ranked_dsnf = dsnf.assign(sharers_rank=dsnf['num_sharers'].rank(ascending=False))
top_ranked_dsnf = ranked_dsnf[ranked_dsnf['sharers_rank'] <= 250]
top_ranked_dsnf.groupby(pd.cut(top_ranked_dsnf['mean_dw'], bins)).sum().plot.bar(y='num_sharers', color=colors)
dsnf.sort_values('num_sharers', ascending=False).head(20)
dsnf.sort_values('num_shares', ascending=False).head(10)
dsn.join(uds.drop('num_sharers', axis='columns'), how='outer') \
.rename({'mean_dw': 'congress_dwnom', 'mean': 'congress_party_count'}, axis='columns') \
.to_csv('media_partisanship_from_congressional_tweets.csv')
Which accounts are sharing a particular domain, and how often?
domain = 'reddit.com'
domain_to_sharers[domain]
domain_urls = set([k for k, v in tweet_urls_to_domains.items() if v == domain])
{k: len(set(v).intersection(domain_urls)) for k, v in account_to_urls.items() if len(set(v).intersection(domain_urls)) > 0}
We're averaging points from a bimodal distribution. We can weight them in different ways, but the underlying distribution is limiting the output distribution. If we were to randomly sample n-points (where n is chosen from some power distribution) from the bimodal distribution (to represent congresspeople sharing stuff) and draw a histogram of their averages, I think we'd end up with hotspots. The center is one obvious hotspot. But the hotspots are periodic with varying strengths with some fundamental frequency. For example, let's say two congresspeople share a media source. There are three possibilites: DD, DR, RR. DD is going to be near the mean for D, RR is going to be near the mean for R, and DR is going to be near the center. If only two congresspeople share that media source, it's near impossible for our estimate of it's partisanship to not be one of those three options. It might be center-right in "reality", but the distribution of the observable in conjunction with our model rules that out entirely. Ah crap, I think that is just the nature of statistics? If my sample size is small, my estimator is bad. Grr, but this is still an issue if we have more Rs than Ds. We need to write out our assumptions about the distribution of partisanship of media sources. If we want to assume a uniform distribution, shouldn't we need to correct for an observable distribution that's not uniform?
Let's think about this more statistically. There is a population of media sources. Each media source has this latent variable that is its "partisanship". That hidden variable need not be stable for each media source, but we'll assume it's slow moving. We're trying to measure that hidden variable by looking at observable variables.
We know the dimensions of the latent variable space ([-1, 1]). Let's say we're trying to estimate the distribution of this latent space. That's a fundamentally different question than estimating the partisanship of a single source. If we're trying to estimate the distribution...
Maybe a different approach. Let's take a single congressperson and the top 1M websites. Our prior is that each of those websites has an equal probability of getting shared by the given congressperson. One tweet from that congressperson is an observation, and it updates our prior. It was a choice by the congressperson to tweet that one domain and not others. How do choice models work?
The above would give us an estimation of the probability of each congressperson to tweet each of the top million sites. We now have 541 distributions. How could those be aggregated in a way that makes sense? For each website, out prior is that it has equal probability of falling in any space on the partisanship distribution. For every congressperson, we have our expection of how likely they are to tweet the given domain. That's an observation. We take their dw_nominate score and update the website prior. That's basically what we're doing already. How do error bars work? For some congresspeople, we know more than we know about others because they tweet more.