~ / Brian Norlander / projects /
Reddit Bot Classifier

Published on December 03, 2018.

Detecting Bots on Reddit

Overview

Here is my Official report

Here is my code on Github

In my last semester of university at the Hong Kong University of Science and Technology, under the supervision of Professor David Rossiter, I took an independent research course for credit where I was able to lead a semester long solo project.

The focus of my project was on detecting Russians bots on Reddit. I built a classifier that analyzed many thousands of posts, comments and user metadata from a list known Russian accounts. The results of my project were very good with accuracy and precision often well over 0.80 (see my Official report for more detailed analysis).

My paper is published on Prof. Rossiter's website.

Collecting Data

The first step was to collect the user data from Reddit.

I had a list of 944 known Russian accounts from Reddit's 2017 Transparency Report that I later used as the ground truth for my classifiers. These accounts made posts and comments starting in approximately April of 2015 and some continued to make submissions as late as April 2018. I selected normal user accounts from the same time period that the Russian accounts were active.

I extracted the following data for each user:

  • username: Username of account
  • created_utc: Day of account creation
  • comments: All of the comments from the account. Each comment has body text and a timestamp.
  • posts: All of the posts from the account. Each post has a title, description and a timestamp
  • comment_karma: The total number of upvotes for all of the comments from the user.
  • link_karma: The total number of upvotes for all of the posts from the user.

The scripts I wrote to extract the data were written in python. To collect user metadata I used python's popular API praw. To collect user posts and comments I used a 3rd party API called PushShift, which had no limits on how many comments and posts you could extract (praw was limited to 1000).

Finally, I stored all of the data locally in Mongodb where I created the tables and data objects for User, Comment, Post, etc.

Classification

Once I collected the user data I could then build a classifier.

I created classifiers on four attributes: post title, comment text, post subreddit, and comment subreddit. The comment text classification saw mixed results while all other methods had very high accuracy and precision.

Detailed classification results are in my Official report.

Click on any of the pictures below to view an interactive web page of my results.

Clicking on the interactive images below will load VERY slowly

Post Title Visualization: This graph shows the words of a title post that most strongly indicate whether the user is a bot or a normal user. The blue dots signify a normal user and the red dots signify a bot. The further to the right the word is the more characteristic it is to the word corpus.

Comment Text Visualization: This graph shows the words in the text of a comment that most strongly indicate whether the user is a bot or a normal user. The blue dots signify a normal user and the red dots signify a bot. The further to the right the word is the more characteristic it is to the word corpus.

Post Subreddit Visualization: This graph shows the subreddits that, when posted in, are likely to originate from a bot or a normal user. The blue dots are indicative of a normal user and the red bots are indicative of a bot. The further to the right the subreddit name is the more common it is in the corpus.

Comment Subreddit Visualization: This graph shows the subreddits that, when commented in, are likely to originate from a bot or a normal user. The blue dots are indicative of a normal user and the red bots are indicative of a bot. The further to the right the subreddit name is the more common it is in the corpus.

Account Activity Analysis

These graphics show that the Reddit bot accounts were active during the business hours of Moscow while the normal Reddit bot accounts roughly resemble the time zone of America. America has by far the most Reddit accounts.

These two graphs show the time of the day that comments were made for the bot accounts and normal accounts. The time is based on the GMT timezone (London).

Normal users
Bot users

These two graphs show the time of the day that posts were made for the bot accounts and normal accounts. The time is based on the GMT timezone (London).

Normal users
Bot users

These two graphs show the time of the day that comments were made for the bot accounts and normal accounts. The time is based on the GMT timezone (London).

Normal users
Bot users

Number of Comments and Posts Per Account

These graphics show that the Reddit bot accounts typically have a higher number of posts compared to comments.

These two graphs show the number of comments per account for bots and normal users.

Normal users
Bot users

These two graphs show the number of posts per account for bots and normal users.

Normal users
Bot users

In conclusion, it seems that the Russian bot accounts tend to conduct their activity during working hours of Moscow while most other typical Redditors activity alines with the timezone of America. Additionally, bot accounts appear to have a high amount of posts compared to comments when shown against normal users. These two trends are by no means enough to classify an account but they do provide additional meaningful information that could be added to an aggregate classifier later on.

In the future I would like to add an aggregate classifier that combined all of the methods previously described. I would also like to do deeper analysis on the actual content of the text such as sentiment analysis and trying to detect non-native English speaking accounts.

View Reddit Bot Classifier on GitHub

/ Projects