Finding the Political Center of Twitter Accounts

Updated Jan 16, 2019

We have ideology estimates for a large number of Twitter accounts now. The problem is that 0.0 represents the mean of the account ideologies, not something about the center of the political spectrum. It would be nice if 0.0 represented something about the world rather than the mean of the accounts we happen to have. The way we're going to get an estimate of that is looking at how accounts describe themselves, and compare that to their estimated ideologies to find center. We're going to follow Barbera "Birds of the Same Feather".

The general outline is:

  1. Look for two groups of political keywords in account descriptions that relate to the poles, e.g. "Conservative" and "Liberal"
  2. Find where the use of the two groups of keywords are approximately the same. That's the "center".
In [1]:
%matplotlib inline
import gzip, pickle, collections, itertools, random, json, re

import pandas as pd
import plotly.offline as plotly
import plotly.graph_objs as go
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbs
from scipy import stats, optimize

plotly.init_notebook_mode()

This is the command I used on my home machine to pull out user descriptions that included political keywords.

In [2]:
filter_terms = ['conservative', 'gop', 'republican', 'liberal', 'progressive', 'democrat', 'moderate', 'independent']
jq_filter = r'(contains("{}"))'.format(r'") or contains("'.join(filter_terms))
jq_filter = r'test("({})"; "i")'.format('|'.join(filter_terms))
print(f"jq -c 'select(.description | {jq_filter}) | [.id_str, .description]' data/users_with_ideo_estimates.ndjson | gzip > user_descs_with_political_keywords.ndjson.gz")
jq -c 'select(.description | test("(conservative|gop|republican|liberal|progressive|democrat|moderate|independent)"; "i")) | [.id_str, .description]' data/users_with_ideo_estimates.ndjson | gzip > user_descs_with_political_keywords.ndjson.gz
In [3]:
acct_descs = {}
with gzip.open('data/user_descs_with_political_keywords.ndjson.gz', 'r') as f:
    for line in f:
        aid, desc = json.loads(line)
        acct_descs[int(aid)] = desc.strip()
        
acct_descs = pd.DataFrame.from_dict(acct_descs, orient='index', columns=['description'])

Here we assign an ideology to a user based on the first use of an ideological keyword in their description.

In [4]:
ideo_group_terms = {
    'conservative': 'right',
    'gop': 'right',
    'republican': 'right',
    'liberal': 'left',
    'progressive': 'left',
    'democrat': 'left',
    'moderate': 'center',
    'independent': 'center'
}
pattern = re.compile(f"({'|'.join(ideo_group_terms.keys())})", re.I)

def desc_to_ideo_group(desc):
    match = re.search(pattern, desc)
    try:
        return ideo_group_terms[match.group(1).lower()]
    except KeyError:
        raise ValueError(f'"{desc}" did not contain a keyword')

acct_descs['ideo_group'] = acct_descs['description'].apply(desc_to_ideo_group)
acct_ideos = pd.read_csv('data/cleaned_user_ideology_estimates_20180705.csv.gz', index_col=0)
accts = acct_ideos.join(acct_descs, how='inner')
accts.sample(5)
/berkman/home/jclark/miniconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py:522: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Out[4]:
theta normed_theta description ideo_group
836244598532026370 0.851510 -0.752313 Writer, author, journalist. Politics junkie, l... left
17140485 1.024982 -0.918466 Vietnam Vet, Retired, Proud Liberal left
904875612 -1.344085 1.350637 God Bless America and God Bless Donald Trump, ... right
785122911908757504 -0.653317 0.689016 Swing voter. Voted for Clinton and Obama. Tur... right
964123514855919617 -2.054256 2.030842 #Trump #conservative #constitution #MAGA\nI lo... right
In [5]:
accts['ideo_group'].value_counts()
Out[5]:
right     46418
left      42801
center     8973
Name: ideo_group, dtype: int64
In [6]:
ideo_range = (accts['normed_theta'].min(), accts['normed_theta'].max())
ax = accts[accts['ideo_group'] == 'center']['normed_theta'].plot.density(alpha=0.8, color='g', xlim=ideo_range)
ax = accts[accts['ideo_group'] == 'right']['normed_theta'].plot.density(alpha=0.8, ax=ax, color='r')
_ = accts[accts['ideo_group'] == 'left']['normed_theta'].plot.density(alpha=0.8, ax=ax, color='b')

It looks like they actually cross, which is good. Let's actually get an estimate of where they cross relative to our ideology estimates.

In [7]:
num_bins = 40
l_hist, l_edges = np.histogram(accts.loc[accts['ideo_group'] == 'left','normed_theta'], bins=num_bins, density=True, range=ideo_range)
r_hist, r_edges = np.histogram(accts.loc[accts['ideo_group'] == 'right','normed_theta'], bins=num_bins, density=True, range=ideo_range)
hists = pd.DataFrame({'left': l_hist, 'right': r_hist, 'bin_ideo_center': pd.Series(l_edges).rolling(2).mean().round(3)[1:]})
_ = hists.plot.bar(x='bin_ideo_center', figsize=(16, 6))
In [8]:
l_kernel = stats.gaussian_kde(accts.loc[accts['ideo_group'] == 'left', 'normed_theta'])
r_kernel = stats.gaussian_kde(accts.loc[accts['ideo_group'] == 'right', 'normed_theta'])

fig = plt.figure()
ax = fig.add_subplot(111)

x_eval = np.linspace(ideo_range[0], ideo_range[1], num=500)
ax.plot(x_eval, l_kernel(x_eval), 'b-')
_ = ax.plot(x_eval, r_kernel(x_eval), 'r-')
In [9]:
print('Left minimum:', optimize.minimize_scalar(l_kernel.evaluate, bounds=(-1, 2), method='bounded').x[0])
print('Right minimum:', optimize.minimize_scalar(r_kernel.evaluate, bounds=(-1, 2), method='bounded').x[0])
center = optimize.minimize_scalar(lambda f: np.abs(l_kernel.evaluate(f) - r_kernel.evaluate(f)), bounds=(-1, 2), method='bounded').x[0]
print(f'Where left and right cross: {center}')
Left minimum: 1.2184007900650722
Right minimum: 0.09588414105472824
Where left and right cross: 0.3951960351597004

The center is about 0.395

How the Center Maps to Media Scores

Let's look at how that center relates to our media source ideology scores. They're the same scale, so the center might be meaningful if it's simply lifted and applyed to the media.

In [10]:
df = pd.read_csv('media_source_ideologies_all_data.csv', index_col=0)
df['diff_from_center'] = (np.abs(df['mean_sharer_ideo'] - center))
interesting_cols = ['mean_sharer_ideo', 'stddev_sharer_ideo', 'num_sharers', 'num_uniq_urls', 'diff_from_center']
interesting_domains = ['english.alarabiya.net', 'aljazeera.com', 'americanthinker.com', 'bbc.com',
    'bbc.co.uk', 'bloomberg.com', 'bostonglobe.com', 'breitbart.com',
    'buzzfeed.com', 'cbc.ca', 'cbsnews.com', 'chicagotribune.com', 'cnbc.com',
    'cnn.com', 'csmonitor.com', 'dailycaller.com', 'dailykos.com',
    'dailymail.co.uk', 'economist.com', 'forbes.com', 'foreignpolicy.com',
    'fortune.com', 'foxnews.com', 'haaretz.com', 'hindustantimes.com',
    'huffingtonpost.com', 'huffpost.com', 'independent.co.uk', 'infowars.com',
    'latimes.com', 'miamiherald.com', 'motherjones.com', 'msnbc.com',
    'nationalreview.com', 'nbcnews.com', 'newsweek.com', 'newyorker.com',
    'npr.org', 'nydailynews.com', 'nypost.com', 'nytimes.com', 'pbs.org',
    'politico.com', 'propublica.org', 'realclearpolitics.com','reuters.com',
    'rollcall.com', 'rt.com', 'salon.com', 'news.sky.com', 'slate.com',
    'sputniknews.com', 'theatlantic.com', 'theguardian.com', 'thehill.com',
    'time.com', 'usatoday.com', 'vox.com', 'washingtonpost.com',
    'washingtontimes.com', 'weeklystandard.com', 'westernjournal.com', 'wsj.com',
    'zerohedge.com']
display(df.where(df['num_sharers'] > 1000).sort_values('diff_from_center').loc[:,interesting_cols].head(10))
_ = (df.loc[interesting_domains, interesting_cols]['mean_sharer_ideo'] - center).sort_values(ascending=False).plot.barh(figsize=(12, 20))
mean_sharer_ideo stddev_sharer_ideo num_sharers num_uniq_urls diff_from_center
nypost.com 0.3960 1.236 8120.0 31812.0 0.000804
mysanantonio.com 0.3987 1.297 1579.0 33690.0 0.003504
news.sky.com 0.3896 1.258 2790.0 9522.0 0.005596
usmagazine.com 0.3870 1.300 1218.0 8453.0 0.008196
mirror.co.uk 0.3850 1.273 2835.0 8895.0 0.010196
archives.gov 0.3818 1.269 1042.0 352.0 0.013396
tmz.com 0.3794 1.275 3332.0 10874.0 0.015796
pittsburgh.cbslocal.com 0.4111 1.285 1241.0 1013.0 0.015904
nbcmiami.com 0.3787 1.294 1418.0 1294.0 0.016496
stripes.com 0.3760 1.250 2151.0 6495.0 0.019196

Observations

  • This seems too far right when comparing to media source scores, but the method seems sound.
  • When calculating the center here, we're doing things in density space, so left and right are the same "size". When taking the mean user score to come up with the media score, there are fewer accounts on the right than there are on the left. This would move all the media scores to the left relative to this center.
  • The Twitter account "center" need not be the same as the media source "center". It could be I just shouldn't compare the two.
  • It could be the case that conservatives on Twitter are more conservative than conservatives in the rest of the country. Barbera sees this in "Understanding the Political Representativeness of Twitter Users"
  • It could be the case that the politically engaged conservatives on Twitter are more extreme than the politically engaged liberals. Which makes sense now that I type it. We're assuming uniformity of engagement here, but it's entirely possible that extreme liberals don't use Twitter and extreme conservatives use Twitter a lot.
  • The use of the various ideological terms could differ between groups.