Political Ideology Estimates of Twitter Accounts in France

Updated Mar 7, 2019

Changes

  • Mar 7: Added "Decomposing Unique Follower Sets Matrix" section and incorporated number of followers into graphs

Description

The goal here is to replicate Pablo Barberá's work on estimating the political ideologies of Twitter accounts from their follower lists. The general approach is:

  1. Select a set of political "elite" Twitter accounts that should span the ideological space
  2. Collect all followers of those accounts
  3. Run correspondence analysis on the followers by elites binary matrix, where an entry holds 1 if following and 0 otherwise
  4. Externally validate the first n dimensions of the decomposed matrix

As a first pass, the "elite" space is made up entirely of deputes and senateurs. We've already collected the list of followers for each of these accounts.

In [1]:
%matplotlib inline
import os, logging

import pandas as pd
import numpy as np
import seaborn as sns
import prince, tqdm, scipy.sparse

sns.set()
In [2]:
with open('data/elite_screen_names.txt', 'r') as f:
    elite_screen_names = [line.strip() for line in f]
len(elite_screen_names)
Out[2]:
872
In [3]:
elite_to_followers = {}
for sn in tqdm.tqdm_notebook(elite_screen_names):
    try:
        with open(f'data/followers/{sn}.txt', 'r') as f:
            elite_to_followers[sn] = [int(line) for line in f]
    except FileNotFoundError:
        logging.warning(f'Follower list not found for {sn}')
WARNING:root:Follower list not found for fmeunier19
WARNING:root:Follower list not found for AlainVasselle
WARNING:root:Follower list not found for e807Limon

fmeunier19's and AlainVasselle's accounts are protected. Account e807Limon does not exist.

Data Cleaning

We want to make sure each of our elites has a minimum number of followers. This is so we can subset the users into two groups by engagement (to better estimate the ideological space) and make sure all the elites have followers in both groups. On the first pass of this, this became an issue for Twitter account DidierParis.

In [4]:
elite_to_num_followers = {e: len(f) for e, f in elite_to_followers.items()}
pd.Series(elite_to_num_followers).sort_values().head()
Out[4]:
DidierParis         5
MIZZONJeanMari1     7
SPanonacle         25
genevievejean84    36
ydecourson         53
dtype: int64
In [5]:
MIN_ELITE_FOLLOWERS = 6

for elite in list(elite_to_followers.keys()):
    if elite_to_num_followers[elite] < MIN_ELITE_FOLLOWERS:
        logging.warning(f'@{elite} has {elite_to_num_followers[elite]} followers and will be removed')
        del elite_to_followers[elite]
WARNING:root:@DidierParis has 5 followers and will be removed

Follower x Elite Matrix Construction

In [6]:
%%time

# Canonical ordering of elites and followers
elites = list(set(elite_to_followers))
followers = list(set([follower for followers in elite_to_followers.values() for follower in followers]))
elite_to_i = {e: i for i, e in enumerate(elites)}
follower_to_i = {f: i for i, f in enumerate(followers)}
data, follower_idx, elite_idx = [], [], []
for e, fs in elite_to_followers.items():
    for f in fs:
        data.append(1)
        elite_idx.append(elite_to_i[e])
        follower_idx.append(follower_to_i[f])
CPU times: user 17.5 s, sys: 664 ms, total: 18.1 s
Wall time: 18.1 s
In [7]:
follower_matrix = scipy.sparse.coo_matrix((data, (follower_idx, elite_idx)), dtype=np.int8)
follower_matrix = follower_matrix.tocsr()
follower_matrix.shape
Out[7]:
(4481707, 868)
In [8]:
follower_df = pd.DataFrame(follower_matrix.toarray()).rename(dict(enumerate(followers))).rename(dict(enumerate(elites)), axis=1)
In [9]:
num_followers = follower_df.sum()
num_followers.name = 'num_followers'
num_followers.sort_values(ascending=False).head(10)
Out[9]:
MLP_officiel      2186032
jlmelenchon       1991122
manuelvalls       1054789
BrunoLeMaire       425704
BGriveaux          229937
dupontaignan       192105
GilbertCollard     184918
CCastaner          173337
mounir             168773
jpraffarin         166911
Name: num_followers, dtype: int64
In [10]:
num_followed = follower_df.sum(axis=1)
num_followed.name = 'num_followed'
num_followed.sort_values(ascending=False).head(10)
Out[10]:
712961701642108928    804
3234062294            787
4136470043            782
965885841607446528    761
4892797133            755
554174318             750
252233750             742
947563355442737153    740
2863374502            739
718409564333424640    732
Name: num_followed, dtype: int64

We're interested in the most engaged accounts, here measured by the number of political elites they follow. Let's see what that distribution looks like.

In [11]:
num_elite_followed = follower_df.sum(axis=1)
In [12]:
ax = pd.Series(num_elite_followed).value_counts().sort_values(ascending=False).head(20).plot.bar(logy=True)
_ = ax.set_ylabel('# Twitter accounts')
_ = ax.set_xlabel('# political elites followed')

Let's filter to accounts following 3 or more political elites. That will both trim down the data to a more managable size and hopefully also make the space better representative of ideology.

In [13]:
MIN_FOLLOWED_FOR_ENGAGED = 3
engaged_df = follower_df.reindex(follower_df.index[num_elite_followed >= MIN_FOLLOWED_FOR_ENGAGED])
engaged_df.shape
Out[13]:
(1069150, 868)

Decomposing Engaged Follower x Elite Matrix

In [14]:
%%time
ca = prince.CA(n_components=3)
ca = ca.fit(engaged_df)
CPU times: user 3min 42s, sys: 49.2 s, total: 4min 31s
Wall time: 2min 1s
In [15]:
%%time
engaged_coords = ca.row_coordinates(engaged_df)
CPU times: user 36.7 s, sys: 11.2 s, total: 47.8 s
Wall time: 43 s
In [16]:
_ = sns.pairplot(engaged_coords, height=5, aspect=1.8, plot_kws={'alpha': 0.1}, diag_kws={'bins': 50})
In [17]:
engaged_coords[0].value_counts().sort_values(ascending=False).head()
Out[17]:
-1.227023    156965
-1.125949     39792
-1.138645     26753
-1.134111     13722
-1.139916     11966
Name: 0, dtype: int64
In [18]:
def get_following(account_id):
    return follower_df.loc[account_id,follower_df.loc[account_id,:] == 1].index.values
In [20]:
list(map(get_following, engaged_coords[engaged_coords[0].round(6) == -1.227023].sample(5).index.values))
Out[20]:
[array(['MLP_officiel', 'jlmelenchon', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'jlmelenchon', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'jlmelenchon', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'jlmelenchon', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'jlmelenchon', 'manuelvalls'], dtype=object)]
In [22]:
list(map(get_following, engaged_coords[engaged_coords[0].round(6) == -1.125949].sample(5).index.values))
Out[22]:
[array(['MLP_officiel', 'BrunoLeMaire', 'jlmelenchon', 'JVPlace',
        'jclagarde', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'BrunoLeMaire', 'jlmelenchon', 'JVPlace',
        'jclagarde', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'BrunoLeMaire', 'jlmelenchon', 'JVPlace',
        'jclagarde', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'BrunoLeMaire', 'jlmelenchon', 'JVPlace',
        'jclagarde', 'manuelvalls'], dtype=object),
 array(['MLP_officiel', 'BrunoLeMaire', 'jlmelenchon', 'JVPlace',
        'jclagarde', 'manuelvalls'], dtype=object)]

There are a lot of accounts with exactly the same score. This is because there are a lot of accounts with exactly the same set of accounts they're following. This isn't a problem per-se, but it can be a limitiation in follow-on analysis. (Sets of politicians that are being followed because they are popular or influential muddies the ideological space we're looking for.) This limitation can be worked around in two ways: raising the minimum number of followed elites to consider an account "engaged", or doing that plus adding more elite accounts in a principled way. We're investigating the second.

In [23]:
elite_coords = ca.column_coordinates(engaged_df)
elite_coords.sample(5)
Out[23]:
0 1 2
JoelGiraud05 1.060830 -0.212918 -0.011426
MoniqueIborra 0.999604 -0.179385 -0.095716
mnlienemann 0.367596 1.006506 0.297542
olivierpaccaud 1.018135 -0.495472 1.377454
gdarrieussecq 0.857916 -0.270739 -0.552622
In [24]:
_ = sns.pairplot(elite_coords.join(np.log(num_followers)), height=5, aspect=1.8,
                 plot_kws={'alpha': 0.3, 'size': np.log(num_followers), 'hue': np.log(num_followers)}, diag_kws={'bins': 50})
In [25]:
not_engaged_df = follower_df.reindex(follower_df.index[num_elite_followed < MIN_FOLLOWED_FOR_ENGAGED])
not_engaged_df.shape
Out[25]:
(3412557, 868)
In [26]:
%%time
not_engaged_coords = ca.transform(not_engaged_df)
not_engaged_coords.shape
CPU times: user 1min 59s, sys: 33.2 s, total: 2min 32s
Wall time: 2min 18s
In [27]:
_ = sns.pairplot(not_engaged_coords, height=5, aspect=1.8, plot_kws={'alpha': 0.3}, diag_kws={'bins': 50})

Decomposing Unique Follower Sets Matrix

Looking at the graphs above, the number of followers still seems like it plays a large role in determining where elites fall in the space. I think some of this might be coming in from the heavily repeated number of unique sets of followers. For example, the collection of ['manuelvalls', 'jlmelenchon', 'MLP_officiel'] is followed 157K times. I'd guess the prominence of those accounts is more explanatory of their co-occurence than ideological similarity is. As a first pass at compensating for this, I'm only going to look at unique follow-sets of elites, i.e. ['manuelvalls', 'jlmelenchon', 'MLP_officiel'] is only going to count once instead of 157K times.

In [28]:
%%time
unique_follower_sets_df = engaged_df.drop_duplicates()
CPU times: user 15.5 s, sys: 2.88 s, total: 18.4 s
Wall time: 18.4 s
In [29]:
unique_follower_sets_df.shape
Out[29]:
(400997, 868)
In [30]:
%%time
ca_unique = prince.CA(n_components=3)
ca_unique = ca_unique.fit(unique_follower_sets_df)
CPU times: user 1min 24s, sys: 16.5 s, total: 1min 41s
Wall time: 43.9 s
In [31]:
%%time
unique_sets_coords = ca_unique.row_coordinates(unique_follower_sets_df)
CPU times: user 12.8 s, sys: 4.45 s, total: 17.3 s
Wall time: 15.1 s
In [32]:
_ = sns.pairplot(unique_sets_coords, height=5, aspect=1.8, plot_kws={'alpha': 0.1}, diag_kws={'bins': 50})
In [33]:
unique_sets_coords[0].value_counts().sort_values(ascending=False).head()
Out[33]:
-0.820511    1
 0.040295    1
-0.552672    1
 0.237615    1
-0.627044    1
Name: 0, dtype: int64
In [34]:
elite_unique_sets_coords = ca_unique.column_coordinates(unique_follower_sets_df)
elite_unique_sets_coords.sample(5)
Out[34]:
0 1 2
NadiaHai78 1.440964 0.484406 -0.344192
AGenetet 1.207474 0.349901 -0.234444
Jacquin_Olivier 0.409559 0.257227 0.964587
BorisVallaud -0.032740 0.558542 0.083208
JocelyneGuidez 0.653951 -0.236811 1.256129
In [35]:
_ = sns.pairplot(
    elite_unique_sets_coords.join(np.log(num_followers)),
    height=5,
    aspect=1.8,
    plot_kws={'alpha': 0.5, 'size': np.log(num_followers), 'hue': np.log(num_followers)},
    diag_kws={'bins': 50})
In [36]:
elite_unique_sets_coords.describe()
Out[36]:
0 1 2
count 868.000000 868.000000 868.000000
mean 0.518953 0.055064 0.536072
std 0.669025 0.623870 0.679345
min -1.581902 -1.076943 -0.821578
25% 0.119173 -0.439015 -0.063888
50% 0.445705 0.098343 0.421353
75% 1.099590 0.394563 1.145328
max 1.645325 3.043696 2.139162

Exporting Data

In [ ]:
elite_coords.to_csv('data/political_elite_account_coordinates.csv.gz')
engaged_coords.to_csv('data/engaged_account_coordinates.csv.gz')
not_engaged_coords.to_csv('data/not_engaged_account_coordinates.csv.gz')
elite_unique_sets_coords.to_csv('data/political_elite_account_unique_follower_sets_coordinates.csv.gz')
unique_sets_coords.to_csv('data/engaged_account_unique_follower_sets_oordinates.csv.gz')