Wordcloud on Ashes Series

ASHES – Desire for Domination

“England have only three major problems. They can’t bat, they can’t bowl and they can’t field.” – Martin Jonson (England’s tour of Australia 1986-7)

With the ongoing Ashes series gathering steam, I decided to marry analytics with data to gather some interesting insights. For this, I have picked up data in the form of tweets from Twitter with hashtags #ashesThe data processing has been done using a statistical software called R.

A wordcloud or a tag cloud highlights the frequency of occurrence of words in a text document using very intuitive and easy visualization techniques. The larger the text size ; the greater the frequency. Also, words with same color and size have the same occurrence rate.AshesTechnical Details & Code

 I have broken down the overall process into numerous steps for ease of reading.

  1. The text mining program makes use of 4 important R packages namely RoAuth, twitteR,tm,wordcloud and RJSonio. Install the requisite packages and get authorized to access content from Twitter.
  2. The authorization process gets completed when the program asks you to enter the ‘token’.
  3. Next pull the tweets with specified hashtag by setting the no. of tweets that you want.
  4. The data now needs to be cleaned, post which the frequency and maximum word limits can be set to plot the wordcloud.

Please note: Though the code specifies 1500 tweets, only 799 were returned by the twitter API.

#Installing Packages
#Loading Packages
#Registering on twitter API 
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "http://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
#Important step for Windows users
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
#Follow the link:https://twitter.com/apps/new to get your consumer key and secret.
consumerKey <- "Enter your Consumer Key"
consumerSecret <- "Enter your consumer secret key"
Cred <- OAuthFactory$new(consumerKey = consumerKey,consumerSecret = consumerSecret, requestURL = reqURL,accessURL = accessURL, authURL = authURL)
Cred$handshake(cainfo = "cacert.pem")
#When complete, record the PIN given to you and provide it on the console 
save(Cred, file = "twitter_auth.Rdata")

#Extracting tweets
Ashes <- searchTwitter('#ashes', n = 1500,lang = 'en', cainfo = "cacert.pem")
Ashes <- sapply(Ashes, function(x) x$getText())
#Create a corpus
Ashes_corpus <- Corpus(VectorSource(Ashes))
#Cleaning of data
Ashes_corpus <- tm_map(Ashes_corpus, tolower)
Ashes_corpus <- tm_map(Ashes_corpus, removePunctuation)
Ashes_corpus <- tm_map(Ashes_corpus, function(x) removeWords(x, stopwords()))
#Selecting color palettes for wordcloud
pal2 <- brewer.pal(8,"Pastel2")
wordcloud(Ashes_corpus, scale = c(4,1),min.freq=5,random.order = T, random.color = T,colors = pal2)


The following resources have been used for this post.

1. Tweetsent

2. Mining Twiiter with R

3. One R tip a day


, , , ,

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: