Media Source Ideology Estimates from Party Followers

One of the problems with the user ideology estimates is that once we come up with estimates for media sources from that, it's hard to know where to break the distribution up into quintiles. One thing Rob wanted to do to remedy that is look at users that follow some ratio of Rs and Ds to create a two-pole system. For example, all users that follow >=75% Democrats make up our blue pole. For any given media source, we look at what percentage of each pole shared it.

Potential Issues

This doesn't seem great because it's throwing away a lot of data about people in the middle. If the problem is where to draw lines in the distribution, going back this far and throwing away data about the center doesn't seem like a good approach. We could have media sources that are shared entirely by people in the center-right (like the Spanish language news sites). Those folks can be center right in two ways: either they exclusively follow the few center-right politicians that exist, or they follow some mix of left and right in a 1:2 ratio or something. If they follow center-right, they are now in the R camp along with the alt-right and everyone else. If they follow a mix, they might fall below our threshold and get excluded entirely.

In [1]:
%matplotlib inline
import gzip, pickle, collections, itertools, random, os, glob, re
import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
import scipy.io, tqdm

plotly.init_notebook_mode()

Generating Convenient Data Structures

This section just gets all the data into an easy-to-use format. Skip down to the next section for results.

In [ ]:
data_filename = 'data/follower_to_elites_dict.pkl'
if os.path.isfile(data_filename):
    follower_to_elites = pd.read_pickle(data_filename)
else:
    follower_to_elites = collections.defaultdict(list)
    for follower_list_file in tqdm.tqdm(glob.glob('data/follower_lists_20180707/*.txt'), smoothing=0):
        elite_acct = re.match(r'data/follower_lists_20180707/(.*)\.txt', follower_list_file).groups()[0]
        with open(follower_list_file) as f:
            for line in f:
                follower = int(line.strip())
                follower_to_elites[follower].append(elite_acct)
    pd.to_pickle(follower_to_elites, data_filename)
In [ ]:
followers = list(follower_to_elites.keys())
print(len(followers))
for acct in tqdm.tqdm(followers):
    if len(follower_to_elites[acct]) < 3:
        del follower_to_elites[acct]
print(len(follower_to_elites))
In [ ]:
elite_accts = []
for follower_list_file in tqdm.tqdm(glob.glob('data/follower_lists_20180707/*.txt'), smoothing=0):
    elite_accts.append(re.match(r'data/follower_lists_20180707/(.*)\.txt', follower_list_file).groups()[0])
In [ ]:
data_filename = 'data/elite_to_followers_dict.pkl'
if os.path.isfile(data_filename):
    elite_to_followers = pd.read_pickle(data_filename)
else:
    elite_to_followers = collections.defaultdict(list)
    for follower_list_file in tqdm.tqdm(glob.glob('data/follower_lists_20180707/*.txt'), smoothing=0):
        elite_acct = re.match(r'data/follower_lists_20180707/(.*)\.txt', follower_list_file).groups()[0]
        with open(follower_list_file) as f:
            for line in f:
                elite_to_followers[elite_acct].append(int(line.strip()))
    pd.to_pickle(elite_to_followers, data_filename)
In [ ]:
pol_data = pd.read_csv('data/politician_data-20180705.csv', index_col=1)
pol_data.sample(5)
In [ ]:
pol_data_by_sn = pol_data.set_index('screen_name')
elite_acct_to_party = {acct: pol_data_by_sn.at[acct, 'party'] for acct in elite_accts if acct in pol_data_by_sn.index}
In [ ]:
follower_to_party_counts = {}
for acct, elites in tqdm.tqdm(follower_to_elites.items()):
    follower_to_party_counts[acct] = collections.Counter(map(lambda a: elite_acct_to_party.get(a, None), elites))
In [ ]:
pd.to_pickle(follower_to_party_counts, 'data/follower_to_elite_party_counts.pkl')
In [ ]:
follower_to_party_counts = pd.read_pickle('data/follower_to_elite_party_counts.pkl')
parties = ['Democrat', 'Independent', None, 'Republican']
In [ ]:
parties = set()
for party_counts in follower_to_party_counts.values():
    parties.update(party_counts.keys())
In [ ]:
parties
In [ ]:
top = list(follower_to_party_counts.keys())[0:10]
In [ ]:
df = pd.DataFrame.from_dict(follower_to_party_counts, orient='index')
In [ ]:
df
In [ ]:
df.to_pickle('data/follower_to_elite_party_counts_df.pkl')

Looking at the Data

Here I'm going to look at the party loyalists and how those loyalists can be used to compute ideologies for sites.

Sample of account data

In [5]:
accts = pd.read_pickle('data/follower_to_elite_party_counts_df.pkl').fillna(0)
accts.sample(5)
Out[5]:
Democrat None Republican Independent
2936463781 0.0 6.0 0.0 0.0
3103601237 0.0 13.0 0.0 0.0
40162979 0.0 12.0 7.0 0.0
3184901724 0.0 3.0 0.0 0.0
852554777640685568 0.0 3.0 0.0 0.0
In [6]:
accts = accts.drop([None, 'Independent'], axis=1)

accts['Sum'] = accts.T.sum()
MIN_FOLLOWED = 3
accts = accts[accts['Sum'] >= MIN_FOLLOWED]

accts['D/Sum'] = accts['Democrat'] / accts['Sum']
accts.sample(10)
Out[6]:
Democrat Republican Sum D/Sum
2718593657 0.0 3.0 3.0 0.00
889550761759322121 1.0 3.0 4.0 0.25
321462479 6.0 2.0 8.0 0.75
830173915586625537 0.0 3.0 3.0 0.00
863052921079988228 0.0 4.0 4.0 0.00
836461035041275904 0.0 3.0 3.0 0.00
2683590737 0.0 3.0 3.0 0.00
865682138 0.0 5.0 5.0 0.00
97423028 0.0 3.0 3.0 0.00
838078731151368192 1.0 4.0 5.0 0.20
In [7]:
_ = accts['D/Sum'].plot.hist(bins=30)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8113ed7f60>
In [8]:
R_LOYALIST_THRESHOLD = 0.2
D_LOYALIST_THRESHOLD = 0.8
accts['R_Follower'] = (accts['D/Sum'] <= R_LOYALIST_THRESHOLD)
accts['D_Follower'] = (accts['D/Sum'] >= D_LOYALIST_THRESHOLD)
accts.loc[accts['R_Follower'], 'Party_Followed'] = 'R'
accts.loc[accts['D_Follower'], 'Party_Followed'] = 'D'
accts.sample(10)
Out[8]:
Democrat Republican Sum D/Sum R_Follower D_Follower Party_Followed
932233383672995840 1.0 2.0 3.0 0.333333 False False NaN
906641143361925120 0.0 4.0 4.0 0.000000 True False R
504659844 0.0 3.0 3.0 0.000000 True False R
754697492193763330 0.0 4.0 4.0 0.000000 True False R
71590701 6.0 7.0 13.0 0.461538 False False NaN
1521886098 1.0 4.0 5.0 0.200000 True False R
973970361065836547 2.0 1.0 3.0 0.666667 False False NaN
984117029509763074 3.0 1.0 4.0 0.750000 False False NaN
17962608 3.0 0.0 3.0 1.000000 False True D
16089748 3.0 0.0 3.0 1.000000 False True D
In [10]:
accts['Party_Followed'].value_counts()
Out[10]:
R    4984966
D    1093470
Name: Party_Followed, dtype: int64

Observations

  • There are almost five times as many accounts that ~exclusively follow Republican lawmakers than there are accounts that ~exclusively follow Democratic lawmakers.
In [11]:
subdomain_to_users = pd.read_pickle('data/all_samples_combined/subdomain_to_users.pkl')
In [13]:
if os.path.isfile('data/subdomain_to_party_followers.pkl'):
    subdomain_to_party_followers = pd.read_pickle('data/subdomain_to_party_followers.pkl')
else:
    subdomain_to_party_followers = {
        s: collections.Counter([accts.at[sharer, 'Party'] for sharer in sharers if sharer in accts.index])
        for s, sharers in subdomain_to_users.items()}
    pd.to_pickle(subdomain_to_party_followers, 'data/subdomain_to_party_followers.pkl')
In [14]:
MIN_USERS = 5
subdomains_with_data = [s for s, u in subdomain_to_party_followers.items() if len(u) > MIN_USERS]
len(subdomains_with_data)
Out[14]:
30378
In [23]:
site_ideos = pd.DataFrame.from_dict({k:
                              {p:subdomain_to_party_followers[k][p] for p in ['R', 'D']} 
                              for k in subdomains_with_data}, orient='index')
site_ideos.sample(5)
Out[23]:
R D
yeswenative.com 0 15
factcheck.afp.com 2 15
nbcnewsdigitaljobs.com 0 4
eurointelligence.com 7 5
europeslamsitsgates.foreignpolicy.com 3 9
In [24]:
site_ideos = site_ideos.join(site_ideos / accts['Party_Followed'].value_counts(), rsuffix='_pct')

site_ideos['R+D'] = site_ideos['R'] + site_ideos['D']
site_ideos['R_pct+D_pct'] = site_ideos['R_pct'] + site_ideos['D_pct']

site_ideos['D/R+D'] = site_ideos['D'] / site_ideos['R+D']
site_ideos['D_pct/R_pct+D_pct'] = site_ideos['D_pct'] / site_ideos['R_pct+D_pct']

_ = site_ideos['D/R+D'].plot.hist(bins=300)
In [18]:
_ = site_ideos['D_pct/R_pct+D_pct'].plot.hist(bins=300)
In [25]:
BUCKET_BREAKS = [0, 0.2, 0.4, 0.6, 0.8, 1]
BUCKET_LABELS = ['right', 'center-right', 'center', 'center-left', 'left']
site_ideos['ideo_group_by_count'] = pd.cut(site_ideos['D/R+D'], breaks, labels=BUCKET_LABELS)
site_ideos['ideo_group_by_pct'] = pd.cut(site_ideos['D_pct/R_pct+D_pct'], breaks, labels=BUCKET_LABELS)
site_ideos.sample(5)
Out[25]:
R D R_pct D_pct R+D R_pct+D_pct D/R+D D_pct/R_pct+D_pct ideo_group_by_count ideo_group_by_pct
ukranews.com 5 2 1.003016e-06 0.000002 7 0.000003 0.285714 0.645835 center-right center-left
theviralpatriots.com 90 0 1.805429e-05 0.000000 90 0.000018 0.000000 0.000000 NaN NaN
datpiff.com 7 11 1.404222e-06 0.000010 18 0.000011 0.611111 0.877510 center-left left
columbusceo.com 3 2 6.018095e-07 0.000002 5 0.000002 0.400000 0.752428 center-right center-left
gearjunkie.com 13 27 2.607841e-06 0.000025 40 0.000027 0.675000 0.904474 center-left left
In [33]:
ax = site_ideos['ideo_group_by_count'].value_counts().reindex(reversed(BUCKET_LABELS))\
    .plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Site Ideology by Party Follower Shares (Raw Counts)')
In [35]:
ax = site_ideos['ideo_group_by_pct'].value_counts().reindex(reversed(BUCKET_LABELS))\
    .plot.bar(color=['#0d3b6e', '#869db6', '#2a7526', '#d8919e', '#b1243e'])
_ = ax.set_title('Site Ideology by Party Follower Shares (%)')
In [36]:
site_ideos.groupby('ideo_group_by_count').apply(lambda g: g.sort_values('R+D', ascending=False).head(10))
Out[36]:
R D R_pct D_pct R+D R_pct+D_pct D/R+D D_pct/R_pct+D_pct ideo_group_by_count ideo_group_by_pct
ideo_group_by_count
right breitbart.com 4973 1074 0.000998 0.000982 6047 0.001980 0.177609 0.496109 right center
dailycaller.com 4637 1151 0.000930 0.001053 5788 0.001983 0.198860 0.530869 right center
insider.foxnews.com 4194 929 0.000841 0.000850 5123 0.001691 0.181339 0.502442 right center
dailywire.com 4071 344 0.000817 0.000315 4415 0.001131 0.077916 0.278095 right center-right
thegatewaypundit.com 4066 256 0.000816 0.000234 4322 0.001050 0.059232 0.223018 right center-right
thefederalist.com 3552 570 0.000713 0.000521 4122 0.001234 0.138282 0.422490 right center
townhall.com 3479 413 0.000698 0.000378 3892 0.001076 0.106115 0.351151 right center-right
zerohedge.com 3197 436 0.000641 0.000399 3633 0.001040 0.120011 0.383373 right center-right
judicialwatch.org 3415 173 0.000685 0.000158 3588 0.000843 0.048216 0.187617 right right
truepundit.com 3293 194 0.000661 0.000177 3487 0.000838 0.055635 0.211714 right center-right
center-right foxnews.com 6353 2469 0.001274 0.002258 8822 0.003532 0.279869 0.639215 center-right center-left
dailymail.co.uk 4205 2785 0.000844 0.002547 6990 0.003390 0.398426 0.751204 center-right center-left
washingtonexaminer.com 4407 2292 0.000884 0.002096 6699 0.002980 0.342141 0.703350 center-right center-left
whitehouse.gov 3533 1925 0.000709 0.001760 5458 0.002469 0.352693 0.712969 center-right center-left
nationalreview.com 3662 1495 0.000735 0.001367 5157 0.002102 0.289897 0.650489 center-right center-left
washingtontimes.com 3783 1357 0.000759 0.001241 5140 0.002000 0.264008 0.620537 center-right center-left
ijr.com 2641 897 0.000530 0.000820 3538 0.001350 0.253533 0.607595 center-right center-left
realclearpolitics.com 2352 1092 0.000472 0.000999 3444 0.001470 0.317073 0.679138 center-right center-left
weeklystandard.com 1931 1138 0.000387 0.001041 3069 0.001428 0.370805 0.728753 center-right center-left
rt.com 1997 925 0.000401 0.000846 2922 0.001247 0.316564 0.678626 center-right center-left
center twitter.com 12780 10467 0.002564 0.009572 23247 0.012136 0.450252 0.788752 center center-left
youtube.com 9181 7579 0.001842 0.006931 16760 0.008773 0.452208 0.790065 center center-left
facebook.com 7082 6717 0.001421 0.006143 13799 0.007564 0.486774 0.812167 center left
nytimes.com 5457 7754 0.001095 0.007091 13211 0.008186 0.586935 0.866271 center left
washingtonpost.com 5233 7441 0.001050 0.006805 12674 0.007855 0.587107 0.866353 center left
cnn.com 5078 6837 0.001019 0.006253 11915 0.007271 0.573815 0.859905 center left
instagram.com 5770 5436 0.001157 0.004971 11206 0.006129 0.485097 0.811141 center left
thehill.com 4925 6058 0.000988 0.005540 10983 0.006528 0.551580 0.848660 center left
politico.com 4021 5580 0.000807 0.005103 9601 0.005910 0.581189 0.863507 center left
pscp.tv 5267 4223 0.001057 0.003862 9490 0.004919 0.444995 0.785187 center center-left
center-left nbcnews.com 3310 5430 0.000664 0.004966 8740 0.005630 0.621281 0.882058 center-left left
huffingtonpost.com 2475 5799 0.000496 0.005303 8274 0.005800 0.700870 0.914395 center-left left
theguardian.com 3042 5162 0.000610 0.004721 8204 0.005331 0.629205 0.885531 center-left left
npr.org 2491 5250 0.000500 0.004801 7741 0.005301 0.678207 0.905733 center-left left
medium.com 2611 4607 0.000524 0.004213 7218 0.004737 0.638265 0.889428 center-left left
latimes.com 2817 4399 0.000565 0.004023 7216 0.004588 0.609618 0.876833 center-left left
theatlantic.com 2168 4742 0.000435 0.004337 6910 0.004772 0.686252 0.908854 center-left left
time.com 2360 4473 0.000473 0.004091 6833 0.004564 0.654617 0.896272 center-left left
thedailybeast.com 2253 4458 0.000452 0.004077 6711 0.004529 0.664283 0.900205 center-left left
newsweek.com 2373 4310 0.000476 0.003942 6683 0.004418 0.644920 0.892242 center-left left
left thinkprogress.org 723 4423 0.000145 0.004045 5146 0.004190 0.859503 0.965385 left left
msnbc.com 996 4103 0.000200 0.003752 5099 0.003952 0.804668 0.949444 left left
rawstory.com 944 4096 0.000189 0.003746 5040 0.003935 0.812698 0.951879 left left
motherjones.com 567 4267 0.000114 0.003902 4834 0.004016 0.882706 0.971678 left left
actblue.com 226 4286 0.000045 0.003920 4512 0.003965 0.949911 0.988566 left left
dailykos.com 507 3665 0.000102 0.003352 4172 0.003453 0.878476 0.970549 left left
thenation.com 723 3340 0.000145 0.003054 4063 0.003200 0.822053 0.954670 left left
aclu.org 454 3196 0.000091 0.002923 3650 0.003014 0.875616 0.969782 left left
shareblue.com 148 3467 0.000030 0.003171 3615 0.003200 0.959059 0.990723 left left
propublica.org 459 2975 0.000092 0.002721 3434 0.002813 0.866337 0.967265 left left
In [37]:
site_ideos.groupby('ideo_group_by_pct').apply(lambda g: g.sort_values('R+D', ascending=False).head(10))
Out[37]:
R D R_pct D_pct R+D R_pct+D_pct D/R+D D_pct/R_pct+D_pct ideo_group_by_count ideo_group_by_pct
ideo_group_by_pct
right judicialwatch.org 3415 173 0.000685 0.000158 3588 0.000843 0.048216 0.187617 right right
westernjournal.com 3139 112 0.000630 0.000102 3251 0.000732 0.034451 0.139904 right right
conservativereview.com 2856 135 0.000573 0.000123 2991 0.000696 0.045135 0.177288 right right
hannity.com 2806 84 0.000563 0.000077 2890 0.000640 0.029066 0.120085 right right
waynedupree.com 2521 63 0.000506 0.000058 2584 0.000563 0.024381 0.102274 right right
twitchy.com 2425 101 0.000486 0.000092 2526 0.000579 0.039984 0.159575 right right
lifezette.com 2384 120 0.000478 0.000110 2504 0.000588 0.047923 0.186643 right right
bizpacreview.com 2358 103 0.000473 0.000094 2461 0.000567 0.041853 0.166066 right right
saraacarter.com 2355 26 0.000472 0.000024 2381 0.000496 0.010920 0.047919 right right
bluntforcetruth.com 2357 22 0.000473 0.000020 2379 0.000493 0.009248 0.040815 right right
center-right dailywire.com 4071 344 0.000817 0.000315 4415 0.001131 0.077916 0.278095 right center-right
thegatewaypundit.com 4066 256 0.000816 0.000234 4322 0.001050 0.059232 0.223018 right center-right
townhall.com 3479 413 0.000698 0.000378 3892 0.001076 0.106115 0.351151 right center-right
zerohedge.com 3197 436 0.000641 0.000399 3633 0.001040 0.120011 0.383373 right center-right
truepundit.com 3293 194 0.000661 0.000177 3487 0.000838 0.055635 0.211714 right center-right
infowars.com 2700 165 0.000542 0.000151 2865 0.000693 0.057592 0.217892 right center-right
dailysignal.com 2648 210 0.000531 0.000192 2858 0.000723 0.073478 0.265538 right center-right
pjmedia.com 2475 148 0.000496 0.000135 2623 0.000632 0.056424 0.214213 right center-right
cnsnews.com 2424 196 0.000486 0.000179 2620 0.000666 0.074809 0.269337 right center-right
wnd.com 2061 142 0.000413 0.000130 2203 0.000543 0.064458 0.239022 right center-right
center breitbart.com 4973 1074 0.000998 0.000982 6047 0.001980 0.177609 0.496109 right center
dailycaller.com 4637 1151 0.000930 0.001053 5788 0.001983 0.198860 0.530869 right center
insider.foxnews.com 4194 929 0.000841 0.000850 5123 0.001691 0.181339 0.502442 right center
thefederalist.com 3552 570 0.000713 0.000521 4122 0.001234 0.138282 0.422490 right center
freebeacon.com 3001 448 0.000602 0.000410 3449 0.001012 0.129893 0.404961 right center
theblaze.com 2921 519 0.000586 0.000475 3440 0.001061 0.150872 0.447517 right center
video.foxnews.com 2876 492 0.000577 0.000450 3368 0.001027 0.146081 0.438167 right center
foxbusiness.com 2579 417 0.000517 0.000381 2996 0.000899 0.139186 0.424336 right center
redstate.com 2081 560 0.000417 0.000512 2641 0.000930 0.212041 0.550924 center-right center
newsmax.com 1966 607 0.000394 0.000555 2573 0.000949 0.235911 0.584638 center-right center
center-left twitter.com 12780 10467 0.002564 0.009572 23247 0.012136 0.450252 0.788752 center center-left
youtube.com 9181 7579 0.001842 0.006931 16760 0.008773 0.452208 0.790065 center center-left
pscp.tv 5267 4223 0.001057 0.003862 9490 0.004919 0.444995 0.785187 center center-left
foxnews.com 6353 2469 0.001274 0.002258 8822 0.003532 0.279869 0.639215 center-right center-left
dailymail.co.uk 4205 2785 0.000844 0.002547 6990 0.003390 0.398426 0.751204 center-right center-left
nypost.com 4152 2777 0.000833 0.002540 6929 0.003373 0.400779 0.753032 center center-left
washingtonexaminer.com 4407 2292 0.000884 0.002096 6699 0.002980 0.342141 0.703350 center-right center-left
whitehouse.gov 3533 1925 0.000709 0.001760 5458 0.002469 0.352693 0.712969 center-right center-left
nationalreview.com 3662 1495 0.000735 0.001367 5157 0.002102 0.289897 0.650489 center-right center-left
washingtontimes.com 3783 1357 0.000759 0.001241 5140 0.002000 0.264008 0.620537 center-right center-left
left facebook.com 7082 6717 0.001421 0.006143 13799 0.007564 0.486774 0.812167 center left
nytimes.com 5457 7754 0.001095 0.007091 13211 0.008186 0.586935 0.866271 center left
washingtonpost.com 5233 7441 0.001050 0.006805 12674 0.007855 0.587107 0.866353 center left
cnn.com 5078 6837 0.001019 0.006253 11915 0.007271 0.573815 0.859905 center left
instagram.com 5770 5436 0.001157 0.004971 11206 0.006129 0.485097 0.811141 center left
thehill.com 4925 6058 0.000988 0.005540 10983 0.006528 0.551580 0.848660 center left
politico.com 4021 5580 0.000807 0.005103 9601 0.005910 0.581189 0.863507 center left
abcnews.go.com 4084 4985 0.000819 0.004559 9069 0.005378 0.549675 0.847668 center left
nbcnews.com 3310 5430 0.000664 0.004966 8740 0.005630 0.621281 0.882058 center-left left
wsj.com 4291 4393 0.000861 0.004017 8684 0.004878 0.505873 0.823547 center left
In [39]:
site_ideos.to_csv('data/media_source_ideo_with_party_pole_buckets.csv')