In [1]:
%matplotlib inline
import gzip, pickle, collections

import umap
import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt

plotly.init_notebook_mode()

Media Source Ideology from Twitter Audience Ideology

Updated Oct 29, 2018

Here is a cleaned up version of media source ideology estimation from Barbera (2015). Specifically, I have cleaned up the following:

  • User ideology estimates are based solely on Congress and Executive branch Twitter accounts.
  • We're using a larger sample of Twitter users now - 30k - which gives us estimates for a larger number of sites.
  • We're now only considering domain sharing that has occurred since Jan 1, 2017.
  • We're considering URLs that were shared as part of retweets separate from those shared as original tweets.

I think we should be calling this an estimate of a media source's "content alignment" to a set of users rather than the sources's estimated ideology. From Bakshy (2015): "Alignment is not a measure of media slant; rather, it captures differences in the kind of content shared among a set of partisans, which can include topic matter, framing, and slant."

Here's a breakdown of our dataset:

Users w/ est. ideology:   2,926,841
Users in sample:             30,000
Total tweets:            45,316,914
Tweets w/ URLs:          19,563,609
Retweets w/ URLs:        13,008,418
Original tweets w/ URLs:  6,555,488
Users w/ (re)tweets w/ URLs: 17,171

Summary of Results

Things look pretty similar to last time and look like they make a good deal of sense. We could still benefit from more data. I did a clustering of sites based on the distributions of their users' ideologies, and it looks interesting and useful. The data can be downloaded here.

Data

Let's see what the new ideology estimates look like.

In [2]:
est_ideo = pd.read_csv('../data/ideology-estimates-20180705.csv.gz', index_col=0)
est_ideo['normed_theta'] = (est_ideo['theta'] - est_ideo['theta'].mean()) / est_ideo['theta'].std() * -1
display(est_ideo.describe())
_ = est_ideo['normed_theta'].plot.hist(bins=30)
/home/jclark/miniconda3/envs/mediacloud/lib/python3.6/site-packages/numpy/lib/arraysetops.py:522: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

theta normed_theta
count 2.926841e+06 2.926841e+06
mean 4.750500e-02 9.850546e-17
std 1.059081e+00 1.000000e+00
min -3.611931e+00 -1.379603e+00
25% -6.338784e-01 -7.715532e-01
50% 4.506754e-01 -3.806796e-01
75% 8.646420e-01 6.433726e-01
max 1.508615e+00 3.455295e+00

Observations

  • The left is a tightly clustered chunk, while the right is spread over a much larger ideology range.
  • The small number of far right is because of the wide variance of the right - "far right" starts a lot further out than "far left" would. Does trying to make the ideology range symmetrical make any sense here?

Let's get the rest of the data together.

In [3]:
datasets = {}
with open('../all_sub.pkl', 'rb') as f:
    datasets['subdomain_to_users_all'] = pickle.load(f)
with open('../orig_sub.pkl', 'rb') as f:
    datasets['subdomain_to_users_orig'] = pickle.load(f)
with open('../rt_sub.pkl', 'rb') as f:
    datasets['subdomain_to_users_rt'] = pickle.load(f)
with open('../all_reg.pkl', 'rb') as f:
    datasets['domain_to_users_all'] = pickle.load(f)
with open('../orig_reg.pkl', 'rb') as f:
    datasets['domain_to_users_orig'] = pickle.load(f)
with open('../rt_reg.pkl', 'rb') as f:
    datasets['domain_to_users_rt'] = pickle.load(f)
In [4]:
def create_dataframe(domain_to_users):
    MIN_USERS = 30
    domain_to_ideo = collections.defaultdict(list)
    for domain, uids in domain_to_users.items():
        for uid in uids:
            try:
                user_ideo = est_ideo.at[uid, 'normed_theta']
            except KeyError:
                continue
            domain_to_ideo[domain].append(user_ideo)

    dom_ideo = pd.DataFrame.from_dict({k:v for k,v in domain_to_ideo.items() if len(v) >= MIN_USERS}, orient='index').T
    dom_ideo_means = pd.DataFrame(dom_ideo.mean(), columns=['mean_ideology'])
    domain_to_num_users = pd.DataFrame.from_dict(
        {d: len(u) for d, u in domain_to_users.items()}, 
        orient='index', columns=['num_uniq_users'])
    
    BINS = 5
    RANGE = (dom_ideo.min().min(), dom_ideo.max().max())

    dom_ideo_means.name = 'mean_ideology'

    dom_ideo_hist = pd.DataFrame.from_dict(
        {d: np.histogram(ideos.values, bins=BINS, range=RANGE, density=False)[0]
         for d, ideos in dom_ideo.T.iterrows()}, orient='index')
    dom_ideo_hist.columns = [
        'users_left', 'users_center_left', 'users_center', 'users_center_right', 'users_right']
    dom_ideo_normed_hist = pd.DataFrame.from_dict(
        {d: np.histogram(ideos.values, bins=BINS, range=RANGE, density=True)[0]
         for d, ideos in dom_ideo.T.iterrows()}, orient='index')
    dom_ideo_normed_hist.columns = [
        'normed_left', 'normed_center_left', 'normed_center', 'normed_center_right', 'normed_right']
    
    BINS=20
    dom_ideo_hist_big = pd.DataFrame.from_dict(
        {d: np.histogram(ideos.values, bins=BINS, range=RANGE, density=False)[0]
         for d, ideos in dom_ideo.T.iterrows()}, orient='index')
    dom_ideo_hist_big.columns = [f'bin_{i+1}' for i in range(BINS)]
    dom_ideo_normed_hist_big = pd.DataFrame.from_dict(
        {d: np.histogram(ideos.values, bins=BINS, range=RANGE, density=True)[0]
         for d, ideos in dom_ideo.T.iterrows()}, orient='index')
    dom_ideo_normed_hist_big.columns = [f'normed_bin_{i+1}' for i in range(BINS)]

    df = dom_ideo_means.join(domain_to_num_users)\
                       .join(dom_ideo_hist)\
                       .join(dom_ideo_normed_hist)\
                        .join(dom_ideo_hist_big)\
                       .join(dom_ideo_normed_hist_big)\
                       .sort_values(by='num_uniq_users', ascending=False)
    return df
In [5]:
dataframes = {}
for name, dataset in datasets.items():
    dataframes[name] = create_dataframe(dataset)
/home/jclark/miniconda3/envs/mediacloud/lib/python3.6/site-packages/numpy/lib/histograms.py:746: RuntimeWarning:

invalid value encountered in greater_equal

/home/jclark/miniconda3/envs/mediacloud/lib/python3.6/site-packages/numpy/lib/histograms.py:747: RuntimeWarning:

invalid value encountered in less_equal

In [6]:
for name, df in dataframes.items():
    df.to_csv(f'{name}_content_alignment_estimates_20180705.csv', index_label='domain')
    display(f'{name}: {df.shape[0]} sites')
'subdomain_to_users_all: 6569 sites'
'subdomain_to_users_orig: 1600 sites'
'subdomain_to_users_rt: 5736 sites'
'domain_to_users_all: 6020 sites'
'domain_to_users_orig: 1586 sites'
'domain_to_users_rt: 5273 sites'

Analysis

I'm going to do all the analysis on the dataset with the most sites: tweets and retweets grouped by subdomain.

In [7]:
site_ideo = dataframes['subdomain_to_users_all']
_ = site_ideo['mean_ideology'].plot.hist(bins=30)

Observations

  • Clearly bimodal
  • No center-right
  • Does not look quite the same as the ideology distribution.
In [8]:
#domain_ideo_means = dom_ideo.loc[:,dom_ideo.count().sort_values().tail(100).index].mean()
news_media_domains = [
    'english.alarabiya.net', 'aljazeera.com', 'americanthinker.com', 'bbc.com',
    'bbc.co.uk', 'bloomberg.com', 'bostonglobe.com', 'breitbart.com',
    'buzzfeed.com', 'cbc.ca', 'cbsnews.com', 'chicagotribune.com', 'cnbc.com',
    'cnn.com', 'csmonitor.com', 'dailycaller.com', 'dailykos.com',
    'dailymail.co.uk', 'economist.com', 'forbes.com', 'foreignpolicy.com',
    'fortune.com', 'foxnews.com', 'haaretz.com', 'hindustantimes.com',
    'huffingtonpost.com', 'huffpost.com', 'independent.co.uk', 'infowars.com',
    'latimes.com', 'miamiherald.com', 'motherjones.com', 'msnbc.com',
    'nationalreview.com', 'nbcnews.com', 'newsweek.com', 'newyorker.com',
    'npr.org', 'nydailynews.com', 'nypost.com', 'nytimes.com', 'pbs.org',
    'politico.com', 'propublica.org', 'realclearpolitics.com','reuters.com',
    'rollcall.com', 'rt.com', 'salon.com', 'news.sky.com', 'slate.com',
    'sputniknews.com', 'theatlantic.com', 'theguardian.com', 'thehill.com',
    'time.com', 'usatoday.com', 'vox.com', 'washingtonpost.com',
    'washingtontimes.com', 'weeklystandard.com', 'westernjournal.com', 'wsj.com',
    'zerohedge.com',
]
non_news_domains = [
    'aclu.org', 'change.org', 'cosmopolitan.com', 'facebook.com', 'google.com',
    'harvard.edu', 'hbr.org', 'mit.edu', 'patreon.com', 'politifact.com',
    'reddit.com', 'reuters.com', 'twitter.com', 'wikileaks.org', 'youtube.com',
]
domains = news_media_domains + non_news_domains
_ = site_ideo.loc[news_media_domains,'mean_ideology'].sort_values(ascending=False).plot.barh(figsize=(10, 20), legend=False)

Observations

  • Looks pretty decent.
  • Looks pretty similar to last time.
  • HuffPo isn't as far left as I'd imagine.

So far, we've reduced this big set of numbers we have for each site (the ideology of each user) to a single number (the mean). I'd like to be able to better characterize the audience of these sites. To do that, I'm going to create a histogrammed version of the distribution for each site. That way, each site has an equivalent vector representing it's users' ideologies regardless of the number of users who shared it.

Once I have those histograms, I'm going to give them to a dimensionality reduction algorithm so I can stick the sites on a scatter plot.

In [9]:
dom_ideo_normed_hist = site_ideo.loc[:,['normed_left', 'normed_center_left', 'normed_center', 'normed_center_right', 'normed_right']]
ideo_hist_embedded = umap.UMAP(random_state=1, n_neighbors=50, min_dist=0.05).fit_transform(dom_ideo_normed_hist)
In [10]:
scatter1 = go.Scattergl(
    y=ideo_hist_embedded[:, 0],
    x=ideo_hist_embedded[:, 1],
    mode='markers',
    text=dom_ideo_normed_hist.index,
    hoverinfo='text',
    marker=dict(
        color=site_ideo['mean_ideology'].loc[dom_ideo_normed_hist.index],
        cmin=-1, cmax=1,
        size=site_ideo['num_uniq_users'].loc[dom_ideo_normed_hist.index],
        sizemode='area',
        sizemin=2,
        sizeref=30
    )
)

layout1 = go.Layout(
    title ='Domain Similarity by Distribution of Audience Ideology',
    hovermode = 'closest',
    #xaxis = dict(visible = False),
    #yaxis = dict(visible = False),
    width = 800,
    height = 800
)
plotly.iplot(go.Figure(data=[scatter1], layout=layout1))

On this plot, each point represents a media source. Points that are close together have similar audience ideology distributions, and the distance between points corresponds to that similarity. Media sources are colored by the mean audience ideology and sized by the number of unique Twitter users who shared the domain. You can zoom in by clicking and dragging a box around the area you're interested in, and zoom back out by double clicking. The actual coordinates are meaningless - they're just useful when referring to locations.

Observations

  • This is colored by the mean, and the mean is clearly important here. It need not be. It's possible that different distributions can result in roughly the same mean. We see that a little bit with some color mixing near the middle.
  • We've got a red end and a blue end, and it tapers toward both ends. That makes sense as there are fewer ways for the distributions to look towards the far ends of the spectrum.
  • There's a red island, but not much of a blue island (though I've seen a smaller blue island appear with certain sets of parameters).
  • The fact that there's an island suggests there are ideology distributions that do not occur. We can think about it more generally: the more dispersed a cluster of the same color is, the most ways there are to exist as a site with a given mean audience ideology. The reverse holds true: the thinner an area is, the fewer sites there are that look like that (from an audience ideology perspective). This is yet another way of depicting the absense of a center right.
  • Most of the popular sites are near the middle.
  • There's a weird disconnected edge surrounding the far left suggesting distributions that are similar to each other but different than the rest of the far left, need to look into.
  • There's a tiny red island off to the top right, and a tiny orange island tucked near the inside of the curve. Almost all of the sites in these islands are in Spanish. That suggests there's something about the audience ideology that's common among each cluster, and distinct from the rest of the media. (Some) Spanish media operates by a different set of rules? Are there two ecosystems? What is the behavior of the users that take part in both ecosystems?
  • There are big gaps throughout the center. Why?

I want to know what the actual ideology distributions look like for different parts of this map, so I'll pull out three sites each from a number of different sections.

In [11]:
sampled_domains = [
    'digbysblog.blogspot.com', 'lucyforcongress.com', 'curvemag.com', # blue tip near (-6.1, -11.0)
    'patagonia.com', 'thedailydemocracy.org', 'factsdomatter.com', # isolated blue edge (-6.3, -6.9)
    'mediamatters.org', 'theroot.com', 'rawstory.com', # popular blue sites near (-0.4, -5.0)
    'politico.com', 'apnews.com', 'businessinsider.com', # popular light blue sites near (4.1, -0.7)
    'foxnews.com', 'rt.com', 'washingtonexaminer.com', # popular gray sites near (4.0, 4.3)
    '20minutos.es', 'antena3.com', 'infobae.com', # small blue-gray island near (2.6, 3.6)
    'lapatilla.com', 'larepublica.pe', 'larazon.es', # small orange-gray island near (6.2, 7.2)
    'dailycaller.com', 'ijr.com', 'navy.mil', # right before red breaks off (3.4, 6.0)
    'canoe.com', 'nraila.org', 'gop.com', # right after red breaks off (2.5, 7.8)
    'terrencekwilliams.com', 'creepingsharia.wordpress.com', 'borderwallbricks.com', # middle of red island near (0.2, 11.0)
    'ignet.gov', 'readthememo.org', 'kennedyforutah.com', # red tip near (-2.2, 11.5)
]
_ = dom_ideo_normed_hist.loc[sampled_domains,:].T.plot(subplots=True, kind='bar', layout=(11,3), figsize=(16,20), sharex=True)

Each row in this array depicts three samples from different parts of the overall graph. The subdomain label is above the graph, which is not obvious. Looking at the legend in each graph helps.

Observations

  • The samples in each row look roughly the same, so this map is doing something useful.
  • Things look pretty much how I'd expect, except that the right never really exists. This is a function of the binning. There are very few users on the far right, so when broken into equal width buckets, there are very few to fill that bucket.
  • The Spanish sites are the only ones with a very prominent center. The two islands are different from each other because the orange one leans more left.
  • It's hard to see a clear difference between the two sides of the channel dividing the red island from the mainland. The side on the mainland has more mass in the center bucket. Is that lack of a center audience the most prominent trait that makes the red island look like a categorically different group?

Future Work

  • Incorporate Rob's insularity metrics
  • Different binning
  • Better distance metric
  • Spanish media?
  • What is the biggest factor separating the right from the rest? Is it enough to say their mean is far more right? Is there a signficant gap in center-aligned audience between the two sides of the channel?
  • Model the ideology space given the umap coordinates to determine what ideology distributions are not present.

Co-tweeting Matrix

Distance between audience ideologies is one way to lay out a media source graph. There's a more intutitive way that this team has used in the past to lay out a graph: if two sites are shared by the same person, they should be closer together than two sites that are not. I create that graph below.

In [12]:
def plot(embedded, df):
    scatter1 = go.Scattergl(
        y=embedded[:,0],
        x=embedded[:,1],
        mode='markers',
        text=df.index,
        hoverinfo='text',
        marker=dict(
            color=site_ideo.reindex(df.index)['mean_ideology'],
            cmin=-1, cmax=1,
            size=site_ideo.reindex(df.index)['num_uniq_users'],
            sizemode='area',
            sizemin=2,
            sizeref=30
        )
    )

    layout1 = go.Layout(
        title ='Site Similarity by Co-tweeting',
        hovermode = 'closest',
        #xaxis = dict(visible=False),
        #yaxis = dict(visible=False),
        height=800,
        width=800
    )
    plotly.iplot(go.Figure(data=[scatter1], layout=layout1))
In [13]:
with open('../sharing_patterns/cotweet_dict_5000.pkl', 'rb') as f:
    cotweet_dict = pickle.load(f)
cotweets = pd.DataFrame.from_dict({d: pd.Series(e) for d, e in cotweet_dict.items()}, dtype='int64')
# Order of the two domains was not consistent, so add the two triangles together to get full counts
cotweets = cotweets + cotweets.T
cotweets.head(5)
Out[13]:
100percentfedup.com 10best.com 10news.com 10tv.com 11081920.com 11alive.com 12news.com 12up.com 13abc.com ... zeldin.house.gov zembla.bnnvara.nl zenith.news zerohedge.com zestynews.com zillow.com zing.11081920.com zocalopublicsquare.org zoom.us zurl.co
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
100percentfedup.com NaN NaN 5.0 NaN 52.0 NaN 70.0 65.0 15.0 NaN ... NaN NaN NaN 764.0 NaN NaN NaN NaN NaN 20.0
10best.com NaN 5.0 NaN 6.0 5.0 NaN NaN 9.0 NaN NaN ... 2.0 NaN NaN 10.0 NaN NaN NaN NaN NaN NaN
10news.com NaN NaN 6.0 NaN NaN NaN 36.0 NaN 11.0 NaN ... NaN 7.0 NaN NaN 8.0 NaN NaN 9.0 11.0 NaN
10tv.com NaN 52.0 5.0 NaN NaN 8.0 44.0 NaN 11.0 18.0 ... NaN 9.0 8.0 84.0 NaN NaN NaN 6.0 12.0 5.0

5 rows × 5000 columns

In [14]:
exclude = ['twitter.com', 'cards.twitter.com', '']
wo_twitter = cotweets.drop(index=exclude, columns=exclude).fillna(0)
distances = 1 - wo_twitter / wo_twitter.sum()
distances = (distances - distances.mean()) / distances.std()
display(distances.head(5))
%time embedded = umap.UMAP(metric='precomputed', random_state=1, min_dist=0.05).fit_transform(distances)
100percentfedup.com 10best.com 10news.com 10tv.com 11081920.com 11alive.com 12news.com 12up.com 13abc.com 13newsnow.com ... zeldin.house.gov zembla.bnnvara.nl zenith.news zerohedge.com zestynews.com zillow.com zing.11081920.com zocalopublicsquare.org zoom.us zurl.co
100percentfedup.com 0.371729 -0.391799 0.363134 -1.767860 0.455409 -0.889940 -1.627263 -1.277052 0.445447 0.601921 ... 0.323484 0.430422 0.375237 -4.427219 0.195958 0.517488 0.415739 0.521791 0.414770 -2.305952
10best.com 0.322785 0.415816 -0.003127 0.412996 0.455409 0.659386 0.247118 0.534584 0.445447 0.601921 ... 0.146135 0.430422 0.375237 0.431438 0.195958 0.517488 0.415739 0.521791 0.414770 0.591746
10news.com 0.371729 -0.553322 0.363134 0.645002 0.455409 -0.137410 0.548358 -0.793949 0.445447 -0.818112 ... 0.323484 -0.172096 0.375237 0.495877 -1.175471 0.517488 0.415739 -0.489566 -0.483877 0.591746
10tv.com -0.137291 -0.391799 0.363134 0.645002 -0.143986 -0.314476 0.548358 -0.793949 -1.017783 -0.567518 ... 0.323484 -0.344244 -0.691047 -0.045406 0.195958 0.517488 0.415739 -0.152447 -0.565572 -0.132678
11081920.com 0.371729 0.415816 0.363134 0.273792 0.455409 0.659386 0.548358 0.534584 0.445447 0.267795 ... 0.323484 -0.774614 0.375237 0.495877 0.195958 0.517488 -1.050505 -0.377193 0.414770 0.591746

5 rows × 4997 columns

/home/jclark/miniconda3/envs/mediacloud/lib/python3.6/site-packages/umap/umap_.py:1419: UserWarning:

Using precomputed metric; transform will be unavailable for new data

CPU times: user 1min 3s, sys: 556 ms, total: 1min 3s
Wall time: 57 s
In [15]:
plot(embedded, distances)

This graph can be interpreted in much the same way as Gephi graphs, except this layout algorithm is a lot more careful in setting the distances between nodes. Again, points are media sources colored by mean ideology and sized by the number of unique sharers. The actual coordinates are meaningless - they're just useful when referring to locations.

Observations

  • Far right is really pulling away
  • Easily observable clusters:
    • far right (4.0, -5.5)
    • local news (5.0, -3.0)
    • national news (6.3, -3.5)
    • Mostly Spanish, but unclear mix (2.4, -0.8)
    • UK media (2.7, -2.2)
    • African American media (5.4, -0.2)
    • animal-focused (6.4, 0.1)
    • nerdy (4.7, -0.1)
    • environment-focused (5.9, -0.4)
    • sports (4.1, 0.3)
  • There's a lot of info encoded here that might be useful in other things.

Future Work

  • This distance normalization procedure was the one of many I tried. It has the best looking results, but I haven't thought through how to interpret it yet because it's pretty difficult to interpret.
  • Using this embedding to cluster