Function to do sentiment analysis (on twitter data for example)

# based on https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107

score.sentiment = function(sentences, pos.words, neg.words, .progress=’none’)
{
require(plyr)
require(stringr)

# we got a vector of sentences. plyr will handle a list or a vector as an “l” for us
# we want a simple array of scores back, so we use “l” + “a” + “ply” = laply:
scores = laply(sentences, function(sentence, pos.words, neg.words) {

# clean up sentences with R’s regex-driven global substitute, gsub():
sentence = gsub(‘[[:punct:]]’, ”, sentence)
sentence = gsub(‘[[:cntrl:]]’, ”, sentence)
sentence = gsub(‘\\d+’, ”, sentence)
# and convert to lower case:
sentence = tolower(sentence)

# split into words. str_split is in the stringr package
word.list = str_split(sentence, ‘\\s+’)
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)

# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)

# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)

# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) – sum(neg.matches)

return(score)
}, pos.words, neg.words, .progress=.progress )

scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}

Analysing twitter data with twitteR package

R code:

library(twitteR)
library(tm)
library(wordcloud)
library(stringr)
library(plyr)

#Log in to twitter (you can find these details on https://apps.twitter.com)
#Replace string between “” with the codes from your own twitter account.
consumer_key <- “Consumer Key (API Key)”
consumer_secret <- “Consumer Secret (API Secret)”
access_token <- “Access Token”
access_secret <- “Access Token Secret”

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

#Some examples
userTimeline(“utwente”, n=10) # search for tweets of specific user
searchTwitter(“utwente”, n = 10) # search for keywords

#Discover twitteR object
?searchTwitter
tweetsList <- searchTwitter(“utwente”, n = 10)
tweet <- tweetsList[[1]]
tweet$getScreenName()
tweet$getText()
tweet$favoriteCount
tweet$retweetCount

#Harvest tweets based on keyword.
mach_tweets <- searchTwitter(“#prayforparis”, n=1500, lang=”en”)

#Extract the text from the tweets in a vector
#See http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/ for an
#approach in Windows.
mach_text <- sapply(mach_tweets, function(x) x$getText())
mach_text <- iconv(mach_text, to = “utf-8-mac”)

###Some initial cleaning
# Remove URLs
mach_text <- gsub(“(f|ht)(tp)(s?)(://)(.*)[.|/](.*)”, “”, mach_text, ignore.case = TRUE)

# Remove @UserName
#mach_text <- gsub(“@\\w+”, “”, mach_text)

# Create a corpus
mach_corpus <- Corpus(VectorSource(mach_text))

# create document term matrix applying some transformations
tdm <- TermDocumentMatrix(mach_corpus,
control = list(removePunctuation = TRUE,
stopwords = c(“prayforparis”, “paris”, “http”, “https”, stopwords(“english”)),
removeNumbers = TRUE, tolower = TRUE))
## further exploration termd document matrix
#frequent words
findFreqTerms(tdm, lowfreq = 100)

#association?
findAssocs(tdm, terms = “syria”, corlimit = 0.3)

## define tdm as matrix
tdMatrix <- as.matrix(tdm)

# get word counts in decreasing order
word_freqs <- sort(rowSums(tdMatrix), decreasing=TRUE)

# create a data frame with words and their frequencies
df <- data.frame(word=names(word_freqs), freq=word_freqs)

# plot wordcloud
pdf(“wcParis.pdf”)
wordcloud(df$word, df$freq, random.order=FALSE, colors=brewer.pal(8, “Dark2”), min.freq = 20)
dev.off()
####Sentiment analyses####
# based on https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107
# download Opinion Lexicon (Hu and Liu, KDD-2004) http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

listPosWords <- scan(“positive-words.txt”, what = “character”, comment.char = “;”)
listNegWords <- scan(“negative-words.txt”, what = “character”, comment.char = “;”)

sentScoreTweets <- score.sentiment(mach_text, listPosWords, listNegWords, .progress = “text”)
hist(sentScoreTweets$score)

Time & location of the next meeting

Due to the low responses, we decided to change the location of the next meeting (so you don’t have to bicycle to far-away Boekelo anymore).

The meeting will now take place on 15 December (next Monday) at 12 o’clock in the Cubicus building, room C232a.

The February-meeting has been postponed to 26 February (time, place + topic to be announced).

We hope to see you then!

Next TRUG meetings!

I am very happy to finally (!) pronounce the next two TRUG meetings!

Next meeting, Stéphanie van den Berg will give a presentation on a pipeline for analysing data on twins, using the R library knitr. The pipeline automatically reads in the data, runs the analysis and creates a LateX file with the results of the analysis. The LateX file is then automatically converted into a pdf file.

Depending on how many of you are able to come, the meeting will take place on December 15th (monday) or December 18th (thursday). Please indicate your preference:

We will inform you by e-mail about the exact date and location as soon as there is a clear preference for one of the dates. As usual, the meeting will take place at Stephanie’s little farm in Boekelo (time to be announced).

The meeting after that will take place on February 12th (thursday). Presenter will be Sukaesi Marianti (topic to be announced).

There is of course still a lot of time left to make up your mind, but if you already know that you can (or cannot) attend, please let us know:

The dplyr package

In the June ’14 meeting of the Twente R User group, Martin Schmettow gave a presentation on the R package ‘dplyr’. Code of this package runs fast, can transparently deal with remote data and produces readable code. Furthermore, it interfaces well with the plyr and ggplot package. You can find the slides of the presentation here: Dplyr package.

Next meeting: The dplyr package

At the next meeting, Wednesday 25 2014, Martin Schmettow will give a presentation on the R library dplyr.  The dplyr package provides useful tools for efficiently manipulating datasets in R. For those who are familiar with the package plyr, dplyr is the ‘next iteration’ of plyr. It focuses on data frames and is faster and easier to use than plyr.

We are meeting at Stephanie’s little  farm in Boekelo at 17.30. A group of TRUG members is going by bicycle to Boekelo. In case you want to join us, we are meeting at the entrance of the Cubicus building at 17.00 o’clock.