The goal of this is to estimate the ideology of various media sources by using Pablo Barberá's 2015 paper, "Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data". That paper estimates the political ideology ("ideal point") of Twitter users by looking at who follows them (for "elites"), and who they follow (for the rest of us).
Barberá has some code for estimating user ideologies around the 2016 U.S. Presidential election. I ran a slightly modified version of that code, which resulted in ideology estimates for around 54 million politically engaged Twitter users.
I picked a random sample of 10,000 users from that set and fetched their full Twitter histories. I filtered that set of tweets to only those containing URLs, and then munged each URL to get to a registered domain. We could now group tweets by tweeted domains.
For each domain, I gathered together the unique users that tweeted that domain. I excluded domains with fewer than 30 unique tweeters from further analysis. For the 647 domains remaining, I took the mean of the estimated political ideologies of the users that tweeted them out.
Here's a graph of the mean audience ideologies of a handful of known news media sites:
_ = domain_ideo_means.sort_values(ascending=False).plot.barh(figsize=(10, 18), legend=False)
It looks pretty good to me. I do a bit more validation work comparing it to other estimators we've played with as well as mediabiasfactcheck.com's bias scores.
There are a couple ways we can use Barberá.
Let's start off by using shares by accounts with estimated ideologies. To do that, we'll first sample 10,000 Twitter users from our set and collect their Twitter histories.
import pandas as pd
est_ideo = pd.read_csv('ideology-estimates-20160101.csv', index_col=0)
est_ideo.describe()
Let's see what the distributions look like.
%matplotlib inline
import seaborn as sb
sb.set()
_ = sb.distplot(est_ideo['theta'], kde=False)
Observations
g = sb.distplot(est_ideo['pol.follow'], kde=False)
_ = g.set(yscale="log", xscale='log')
Observations
Now we'll actually sample 10k accounts.
# I won't run this again to keep from overwriting my original sample.
#with open('sampled_account_ids_10k.txt', 'w') as f:
# for aid in est_ideo.sample(10000)['id'].values:
# f.write(str(aid) + "\n")
I've run a few scripts outside of this notebook:
grep '"urls": \[{"'I'll read all of those tweets into a data structure that maps each domain to the ideologies of the unique users that shared that domain. I'll only consider domains that were shared by more than 30 unique users.
import json, collections
MAX_COUNT = None
MIN_USERS = 30
count = 0
domain_to_users = collections.defaultdict(set)
domain_to_ideo = collections.defaultdict(list)
with open('10k_sampled_accounts_tweets_with_augmented_urls.txt') as f:
for line in f:
if MAX_COUNT and count >= MAX_COUNT: break
tweet = json.loads(line)
user_id = int(tweet['user']['id'])
try:
user_ideo = est_ideo.at[user_id, 'theta']
except KeyError:
continue
for url in tweet['urls']:
domain = url['domain']
if user_id in domain_to_users[domain]: continue
domain_to_users[domain].add(user_id)
domain_to_ideo[domain].append(user_ideo)
count += 1
dom_ideo = pd.DataFrame.from_dict({k:v for k,v in domain_to_ideo.items() if len(v) >= MIN_USERS}, orient='index').T
Let's see what that data looks like.
display(dom_ideo.shape)
dom_ideo.count().sort_values(ascending=False).head(20)
OK, so we've got 647 domains. I should remember that we can get ideology estimates for more domains by increasing the number of users that we sample. We're sampling 10k of 54M right now.
Twitter has the most unique user shares at 2,520, which makes sense. It drops off pretty quickly, as we'd expect with a power law.
Let's plot all the distributions on top of each other to see what we're working with.
_ = dom_ideo.plot(kind='kde', figsize=(8,8), legend=False, alpha=0.1, color='#7777dd')
Observations
We've got a distribution of audience ideologies for every domain, but now we should look at ways of reducing those distributions to single scores. The obvious choice is just taking the mean, so let's just look at that for now.
Let's pick a subset of sites that we know a little bit about and see what their mean scores look like.
news_media_domains = [
'alarabiya.net', 'aljazeera.com', 'americanthinker.com', 'bbc.com',
'bbc.co.uk', 'bloomberg.com', 'bostonglobe.com', 'breitbart.com',
'buzzfeed.com', 'cbc.ca', 'cbsnews.com', 'chicagotribune.com', 'cnbc.com',
'cnn.com', 'csmonitor.com', 'dailycaller.com', 'dailykos.com',
'dailymail.co.uk', 'economist.com', 'forbes.com', 'foreignpolicy.com',
'fortune.com', 'foxnews.com', 'haaretz.com', 'hindustantimes.com',
'huffingtonpost.com', 'huffpost.com', 'independent.co.uk', 'infowars.com',
'latimes.com', 'miamiherald.com', 'motherjones.com', 'msnbc.com',
'nationalreview.com', 'nbcnews.com', 'newsweek.com', 'newyorker.com',
'npr.org', 'nydailynews.com', 'nypost.com', 'nytimes.com', 'pbs.org',
'politico.com', 'propublica.org', 'realclearpolitics.com','reuters.com',
'rollcall.com', 'rt.com', 'salon.com', 'sky.com', 'slate.com',
'sputniknews.com', 'theatlantic.com', 'theguardian.com', 'thehill.com',
'time.com', 'usatoday.com', 'vox.com', 'washingtonpost.com',
'washingtontimes.com', 'weeklystandard.com', 'westernjournal.com', 'wsj.com',
'zerohedge.com',
]
non_news_domains = [
'aclu.org', 'change.org', 'cosmopolitan.com', 'facebook.com', 'google.com',
'harvard.edu', 'hbr.org', 'mit.edu', 'patreon.com', 'politifact.com',
'reddit.com', 'reuters.com', 'twitter.com', 'wikileaks.org', 'youtube.com',
]
domains = news_media_domains + non_news_domains
domain_ideo_means = dom_ideo.loc[:,news_media_domains].mean()
_ = domain_ideo_means.sort_values(ascending=False).plot.barh(figsize=(10, 18), legend=False)
Observations
The means look so convincing, it's probably worth reminding ourselves how wide the audience distributions are.
_ = dom_ideo.loc[:,news_media_domains].T.reindex(domain_ideo_means.sort_values(ascending=False).index) \
.T.plot.box(figsize=(10, 18), vert=False)
I made a couple more graphs of the same stuff that are pretty and somewhat elucidating, so here they are.
import matplotlib.colors as mcol
pol_cm = mcol.LinearSegmentedColormap.from_list("Pol",["#3771f3", "#b147cc", "#d62222"])
_ = dom_ideo.T.loc[news_media_domains,].reindex(domain_ideo_means.sort_values().index)\
.T.plot(subplots=True, kind='kde', layout=(10,8), figsize=(16,16), sharex=True, sharey=True, colormap=pol_cm)
_ = dom_ideo.T.loc[news_media_domains,].reindex(domain_ideo_means.sort_values().index)\
.T.plot(kind='kde', figsize=(8,8), legend=False, colormap=pol_cm, alpha=0.5)
import joypy
_ = joypy.joyplot(dom_ideo.T.loc[news_media_domains,].reindex(domain_ideo_means.sort_values().index).T,
overlap=0.6, figsize=(8, 12), alpha=0.95, x_range=(-2.3, 5), linewidth=0.5, bw_method=0.15,
colormap=pol_cm)
Let's compare to the 2016 election Trump/HRC retweet ideology estimation, and the ideology estimator from Congressional tweets I was working on.
import tldextract
retweeter_ideo = pd.read_csv('election_retweeter_polarization_media_scores.csv')
retweeter_ideo['domain'] = retweeter_ideo['url'].apply(lambda u: tldextract.extract(u).registered_domain)
retweeter_ideo.set_index('domain', inplace=True)
# Remove duplicates. See discussion in "Media Source Partisanship as Measured by Congressional Tweets"
retweeter_ideo = retweeter_ideo[~retweeter_ideo.index.duplicated(keep=False)].dropna()
retweeter_ideo.head(10)
audience_ideo = dom_ideo.mean().T
audience_ideo.name = 'ideo_by_mean_audience_ideo'
congress_tweet_ideo = pd.read_csv('media_partisanship_from_congressional_tweets.csv', index_col=0)
joined_ideo_est = retweeter_ideo.join(congress_tweet_ideo).join(audience_ideo).dropna()
joined_ideo_est = joined_ideo_est.rename({
'score': 'ideo_by_retweet',
'congress_dwnom': 'ideo_by_congress_tweet',
'score_by_followers': 'ideo_by_mean_audience_ideo'}, axis='columns')
joined_ideo_est.head()
sb.set(style="ticks", color_codes=True)
_ = sb.pairplot(joined_ideo_est, vars=['ideo_by_retweet', 'ideo_by_congress_tweet', 'ideo_by_mean_audience_ideo'])
print("Num sites:", joined_ideo_est.shape[0])
joined_ideo_est.loc[:,['ideo_by_retweet', 'ideo_by_congress_tweet', 'ideo_by_mean_audience_ideo']].corr()
Observations
I don't like the fact that I'm not validating against external datasets. I've requested access to the Facebook estimations from 2015, but until that comes through, let's look at Media Bias Fact Check data. I couldn't find an official source for it, but I did find someone that scraped the site and put up their ratings here. Let's add that into the mix. The data I have give text tags for each domain ("left", "right", etc.), so I turned that into points on a scale from -1 to 1.
mbfc = pd.read_csv('domain_information.csv', index_col=1)
mbfc_ideo = mbfc['mediabiasfactcheck'].dropna().map({
'left': -1.0,
'left_center': -0.5,
'least_biased': 0.0,
'right_center': 0.5,
'right': 1.0}).dropna().sort_values()
with_mbfc_ideo_est = joined_ideo_est.join(mbfc_ideo)\
.rename({'mediabiasfactcheck': 'ideo_by_mbfc'}, axis='columns').dropna()
_ = sb.pairplot(with_mbfc_ideo_est,
vars=['ideo_by_retweet', 'ideo_by_congress_tweet', 'ideo_by_mean_audience_ideo', 'ideo_by_mbfc'])
sb.set()
print("Num sites:", with_mbfc_ideo_est.shape[0])
with_mbfc_ideo_est\
.loc[:,['ideo_by_retweet', 'ideo_by_congress_tweet', 'ideo_by_mean_audience_ideo', 'ideo_by_mbfc']].corr()
Observations
facebook_ideo = pd.read_csv('facebook_ideology_estimates.csv', index_col=0,
converters={'domain': lambda d: tldextract.extract(d).registered_domain})
with_facebook_ideo_est = joined_ideo_est.join(facebook_ideo)\
.rename({'avg_align': 'ideo_by_facebook'}, axis='columns').dropna()
with_facebook_ideo_est = with_facebook_ideo_est[~with_facebook_ideo_est.index.duplicated(keep=False)].dropna()
print("Num sites:", with_facebook_ideo_est.shape[0])
with_facebook_ideo_est\
.loc[:,['ideo_by_retweet', 'ideo_by_congress_tweet', 'ideo_by_mean_audience_ideo', 'ideo_by_facebook']].corr()
sb.set(style='ticks')
_ = sb.pairplot(with_facebook_ideo_est,
vars=['ideo_by_retweet', 'ideo_by_congress_tweet', 'ideo_by_mean_audience_ideo', 'ideo_by_facebook'])
sb.set()
Observations
Let's dig into the outliers.
import plotly.offline as plotly
import plotly.graph_objs as go
plotly.init_notebook_mode()
scatter1 = go.Scattergl(
y=joined_ideo_est['ideo_by_retweet'],
x=joined_ideo_est['ideo_by_mean_audience_ideo'],
mode='markers',
text=joined_ideo_est.index,
marker=dict(
#color=joined_scores.index.isin(news_media_domains) * 1
)
)
layout1 = go.Layout(
title ='Comparison of Ideology Score Metrics',
hovermode = 'closest',
xaxis = dict(title = 'Ideology by Mean Audience Ideology'),
yaxis = dict(title = 'Ideology by Trump/HRC Retweet'),
)
scatter2 = go.Scattergl(
y=joined_ideo_est['ideo_by_congress_tweet'],
x=joined_ideo_est['ideo_by_mean_audience_ideo'],
mode='markers',
text=joined_ideo_est.index,
marker=dict(
#color=joined_scores.index.isin(news_media_domains) * 1
)
)
layout2 = go.Layout(
title = 'Comparison of Ideology Score Metrics',
hovermode = 'closest',
xaxis = dict(title = 'Ideology by Mean Audience Ideology'),
yaxis = dict(title = 'Ideology by Congressional Tweets'),
)
scatter3 = go.Scattergl(
y=with_mbfc_ideo_est['ideo_by_mbfc'],
x=with_mbfc_ideo_est['ideo_by_mean_audience_ideo'],
mode='markers',
text=with_mbfc_ideo_est.index,
marker=dict(
#color=joined_scores.index.isin(news_media_domains) * 1
)
)
layout3 = go.Layout(
title = 'Comparison of Ideology Score Metrics',
hovermode = 'closest',
xaxis = dict(title = 'Ideology by Mean Audience Ideology'),
yaxis = dict(title = 'Ideology by Media Bias/Fact Check'),
)
scatter4 = go.Scattergl(
y=with_facebook_ideo_est['ideo_by_facebook'],
x=with_facebook_ideo_est['ideo_by_mean_audience_ideo'],
mode='markers',
text=with_facebook_ideo_est.index,
marker=dict(
#color=joined_scores.index.isin(news_media_domains) * 1
)
)
layout4 = go.Layout(
title = 'Comparison of Ideology Score Metrics',
hovermode = 'closest',
xaxis = dict(title = 'Ideology by Mean Audience Ideology'),
yaxis = dict(title = 'Ideology by Facebook'),
)
plotly.iplot(go.Figure(data=[scatter1], layout=layout1))
Observations
plotly.iplot(go.Figure(data=[scatter2], layout=layout2))
Observations
liveleak.com, imgur.com, wikileaks.org, donaldjtrump.com, sun-sentinel.com, torontosun.com, thefreethoughtproject.com are outliers.plotly.iplot(go.Figure(data=[scatter3], layout=layout3))
Observations
observer.com, torontosun.com, timesofisrael.com, mediaite.com are the big outliers.plotly.iplot(go.Figure(data=[scatter4], layout=layout4))
Observations
This might be an interesting complement to the other work I was doing around Congressional sharing on Twitter and DW_NOMINATE to estimate source ideology. Say we bucket Congressional Twitter users and regular Twitter users by ideology - how does the sharing of given domains vary between Congress and regular folks within those buckets? Which domain-sharing patterns look the same between Congress and citizens and which look different? Are those patterns looking more alike or different over time? Across the whole ideology spectrum? Who's shifting?
scatter3d = go.Scatter3d(
x=joined_ideo_est['ideo_by_congress_tweet'],
y=joined_ideo_est['ideo_by_mean_audience_ideo'],
z=joined_ideo_est['ideo_by_retweet'],
mode='markers',
text=joined_ideo_est.index,
marker=dict(
size=6
)
)
layout3d = go.Layout(
title= 'Comparison of Ideology Score Metrics',
)
plotly.iplot(go.Figure(data=[scatter3d], layout=layout3d))
> # who is on the extremes
> head(users[order(users$phi1),])
twitter name gender party phi1 phi2 phi3
547 senkamalaharris Kamala D. Harris F Democrat -0.9964626 -0.4306486 -1.61173590
308 repjoekennedy Joseph P. Kennedy III M Democrat -0.9884050 0.2418400 -0.21613413
363 repmaxinewaters Maxine Waters F Democrat -0.9550397 -0.4858503 -1.80167847
160 repadamschiff Adam B. Schiff M Democrat -0.9473548 -0.1116719 -1.42363724
315 repjohnlewis John Lewis M Democrat -0.9416533 -0.4777571 -1.64266921
535 senfeinstein Dianne Feinstein F Democrat -0.9304100 0.3827524 0.00955195
> tail(users[order(users$phi1),])
twitter name gender party phi1 phi2 phi3
601 warrendavidson Warren Davidson M Republican 2.317007 0.6290465 -0.56455723
102 judgecarter John R. Carter M Republican 2.317354 -1.0996292 0.02720010
326 repkenmarchant Kenny Marchant M Republican 2.319707 -1.1509091 0.17487176
182 repbillflores Bill Flores M Republican 2.322245 -1.1054425 0.06811868
348 replouiegohmert Louie Gohmert M Republican 2.390454 2.6556838 -1.65378962
362 repmattgaetz Matt Gaetz M Republican 2.455506 3.3088650 -2.02016381
> # primary candidates
> users <- users[order(users$phi1),]
> users[users$type=="Primary Candidate",c("screen_name", "phi1")]
screen_name phi1
11 BernieSanders -0.6135061
79 HillaryClinton -0.5506683
121 MartinOMalley -0.4106683
157 realDonaldTrump 0.1744603
111 LincolnChafee 0.1777569
154 RandPaul 0.2044847
101 JohnKasich 0.3302630
115 marcorubio 0.3618459
96 JimWebbUSA 0.6485842
89 JebBush 0.6557951
67 GovChristie 0.8206196
71 GovMikeHuckabee 0.8545968
73 GrahamBlog 1.0938885
580 tedcruz 1.1070568
25 CarlyFiorina 1.1892254
68 GovernorPataki 1.2859452
156 RealBenCarson 1.3563719
70 gov_gilmore 1.4005577
479 RickSantorum 1.5283071
491 ScottWalker 1.6191919
69 GovernorPerry 1.6686204
17 BobbyJindal 1.7090644
>
> # screen_name phi1
> # 548 SenSanders -0.92013210
> # 129 MartinOMalley -0.90287394
> # 117 LincolnChafee -0.73712461
> # 83 HillaryClinton -0.60077731
> # 102 JimWebbUSA -0.04502052
> # 70 gov_gilmore 0.40719497
> # 71 GovChristie 0.59047439
> # 76 GrahamBlog 0.69157606
> # 108 JohnKasich 0.77414571
> # 95 JebBush 0.79967822
> # 72 GovernorPataki 0.82375241
> # 169 realDonaldTrump 0.88911528
> # 472 RickSantorum 1.16065941
> # 123 marcorubio 1.23982284
> # 25 CarlyFiorina 1.24576870
> # 74 GovMikeHuckabee 1.29905012
> # 17 BobbyJindal 1.32647736
> # 73 GovernorPerry 1.34255394
> # 484 ScottWalker 1.43519830
> # 164 RandPaul 1.44800182
> # 569 tedcruz 1.68322519
> # 168 RealBenCarson 1.70647329
>
> # others
> users[users$type=="Media Outlets",c("screen_name", "phi1")]
screen_name phi1
573 StephenAtHome -0.69375065
130 MotherJones -0.66313351
583 TheDailyShow -0.65338527
44 dailykos -0.59726026
585 thinkprogress -0.58887694
131 MSNBC -0.48343146
56 edshow -0.47690322
570 Slate -0.43370275
136 NewYorker -0.41769496
140 nprnews -0.40501285
143 nytimes -0.33360566
4 ajam -0.31918381
83 HuffPostPol -0.31757079
76 GuardianUS -0.30789882
134 NewsHour -0.30566222
602 washingtonpost -0.27665626
33 CNN -0.26960885
1 ABC -0.20768687
9 BBCWorld -0.20749533
133 NBCNews -0.20449353
151 politico -0.18948322
28 CBSNews -0.05804963
591 USATODAY 0.01182303
55 EconUS 0.07932249
23 BuzzFeedPol 0.07964434
604 WSJ 0.09065961
605 YahooNews 0.25306298
15 Bloomberg 0.38091176
58 FoxNews 0.83977682
54 DRUDGE_REPORT 1.49262192
21 BreitbartNews 1.60689951
582 theblaze 1.69644715
487 rushlimbaugh 2.01593693
> users[users$type=="Journalists",c("screen_name", "phi1")]
screen_name phi1
114 maddow -0.6821680
126 MHarrisPerry -0.6371607
6 andersoncooper -0.4633423
75 GStephanopoulos -0.2108086
125 megynkelly 0.8412498
492 seanhannity 0.8799218
7 AnnCoulter 1.1974140
64 glennbeck 1.4937796
145 oreillyfactor 1.7845476
110 limbaugh 2.0388439
> users[users$type=="Other Politicians",c("screen_name", "phi1")]
screen_name phi1
98 JoeBiden -0.7130537
5 algore -0.5471131
584 TheDemocrats -0.4957447
13 BillClinton -0.4820982
495 SenateDems -0.4268991
80 HouseDemocrats -0.3663539
48 dccc -0.3168939
152 POTUS 0.2617762
60 GeorgeHWBush 0.5469928
65 GOP 0.9547741
81 HouseGOP 1.0075366
135 newtgingrich 1.2309574
490 SarahPalinUSA 1.2785546
105 KarlRove 1.4705073