Implementation of Sentiment Analysis of Customer Reviews

Steps in Sentiment Analysis for Customer Reviews

Now we will implement the Sentiment Analysis of Customer Reviews in R Programming Language.

Dataset Used: TripAdvisor Hotel Reviews

Step 1: Install and Load Required Packages

First, we need to install and load the required R packages. These packages provide the necessary functions for text mining, sentiment analysis, visualization, and data manipulation.

# Install required packages
install.packages("tm")         # Text Mining package
install.packages("SnowballC")  # Snowball stemmer for text processing
install.packages("syuzhet")    # Sentiment analysis package
install.packages("tidyverse")  # Comprehensive data manipulation package
install.packages("wordcloud")  # Word cloud generation package
install.packages("ggplot2")    # Powerful visualization package


# Load required packages
library(tm)
library(SnowballC)
library(syuzhet)
library(tidyverse)
library(wordcloud)
library(ggplot2)

Step 2: Read and Inspect the Data

Next, we read the CSV file containing the reviews and inspect its structure.

# Reading in data
data <- read.csv("tripadvisor.csv", header = TRUE)
# Check structure of csv file
str(data)

Output:

'data.frame':    20491 obs. of  3 variables:
 $ S.No. : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Review: chr  "nice hotel expensive parking got good deal stay hotel anniversary \tarrived late evening took 
      advice previous r"| __truncated__ "ok nothing special charge diamond member hilton decided chain 
       shot 20th anniversary seattle \tstart booked suit"| __truncated__ "nice rooms not 4* experience hotel 
        monaco seattle good hotel n't 4* level.positives large bathroom mediterranea"| __truncated__
          "unique \tgreat stay \twonderf
 $ Rating: int  4 2 3 5 5 5 5 4 5 5 ...

The str(data) function will display the structure of the dataframe, showing the data types of each column and a preview of the data.

Step 3: Create and Inspect the Corpus

We convert the review column to a character vector and create a corpus, which is a collection of text documents.

# Convert review column of dataframe to character vector
corpus <- iconv(data$Review)
# Create corpus from character vector above
corpus <- Corpus(VectorSource(corpus))
# Inspect first five rows
inspect(corpus[1:5])

Output:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5

[1] nice hotel expensive parking got good deal stay hotel anniversary \tarrived late evening took advice previou
 reviews did valet parking \tcheck quick easy \tlittle disappointed non-existent view room room clean nice size
tbed comfortable woke stiff neck high pillows \tnot soundproof like heard music room night morning loud
bangsdoors opening closing hear people talking hallway \tmaybe just noisy neighbors \taveda bath products
tdid not goldfish stay nice touch taken advantage staying longer \tlocation great walking distance shopping
toverall nice experience having pay 40 parking night \t

Step 4: Clean the Corpus

We clean the text data by converting it to lowercase, removing punctuation, numbers, stopwords, and extra whitespaces, and performing stemming.

# Convert the text to lower case
cleaned_corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
cleaned_corpus <- tm_map(cleaned_corpus, removePunctuation)
# Remove numbers
cleaned_corpus <- tm_map(cleaned_corpus, removeNumbers)
# Remove common English stopwords
cleaned_corpus <- tm_map(cleaned_corpus, removeWords, stopwords('english'))
# Remove extra whitespaces
cleaned_corpus <- tm_map(cleaned_corpus, stripWhitespace)
# Reduce words to their root form (stemming)
cleaned_corpus <- tm_map(cleaned_corpus, stemDocument)

# Inspect first five rows after cleaning
inspect(cleaned_corpus[1:5])

Output:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5

[1] nice hotel expens park got good deal stay hotel anniversari arriv late even took advic previous review valet 
park check quick easi littl disappoint nonexist view room room clean nice size bed comfort woke stiff neck high 
pillow soundproof like heard music room night morn loud bang door open close hear peopl talk hallway mayb 
just noisi neighbor aveda bath product nice goldfish stay nice touch taken advantag stay longer locat great
 distanc shop overal nice experi pay park night

Step 5: Sampling the Data

Sample a subset of the reviews to create a smaller, more manageable dataset for analysis.

# Sample a subset of the data
set.seed(123)  # for reproducibility
sampled_reviews <- sample(data$Review, 200)  # adjust the sample size as needed
sampled_corpus <- Corpus(VectorSource(iconv(sampled_reviews)))

Step 6: Clean the Sampled Corpus

Clean the sampled corpus similarly to how we cleaned the full corpus.

# Clean the sampled corpus
cleaned_sampled_corpus <- tm_map(sampled_corpus, content_transformer(tolower))
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removePunctuation)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeNumbers)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, removeWords, 
                                 stopwords('english'))
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stripWhitespace)
cleaned_sampled_corpus <- tm_map(cleaned_sampled_corpus, stemDocument)

Step 7: Create Sparse Term Document Matrix

Create a sparse Term Document Matrix (TDM) to reduce memory usage.

# Create sparse term document matrix
tdm_sparse <- TermDocumentMatrix(cleaned_sampled_corpus, 
                                 control = list(weighting = weightTfIdf))
tdm_m_sparse <- as.matrix(tdm_sparse)

Step 8: Analyze Term Frequencies

Convert the sparse matrix to a dataframe of term frequencies and display the most frequent terms.

# Show frequency of terms
term_freq <- rowSums(tdm_m_sparse)
term_freq_sorted <- sort(term_freq, decreasing = TRUE)
tdm_d_sparse <- data.frame(word = names(term_freq_sorted), freq = term_freq_sorted)
# Show top 5 most frequent words
head(tdm_d_sparse, 5)

Output:

       word     freq
locat locat 2.893330
great great 2.753243
hotel hotel 2.679352
good   good 2.373272
stay   stay 2.363538

The head(tdm_d_sparse, 5) function will display the top 5 most frequent words in the sampled and cleaned corpus.

Step 9: Sentiment Analysis

We use three different methods (syuzhet, bing, afinn) to perform sentiment analysis on the text data.

# Convert review column of dataframe to character vector
text <- iconv(data$Review)
# Generating sentiment scores

# Syuzhet method
syuzhet_vector <- get_sentiment(text, method = "syuzhet")
# See first row of vector
head(syuzhet_vector)
# See summary statistics of vector
summary(syuzhet_vector)

# Bing method
bing_vector <- get_sentiment(text, method = "bing")
# See first row of vector
head(bing_vector)
# See summary statistics of vector
summary(bing_vector)

# Afinn method
afinn_vector <- get_sentiment(text, method = "afinn")
# See first row of vector
head(afinn_vector)
# See summary statistics of vector
summary(afinn_vector)

Output:

[1]  3.25 10.70  5.10  8.75  6.30 12.20

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-12.250   3.050   5.650   6.127   8.550  52.750 

[1]  3 11  5  9  7  7

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-23.000   2.000   6.000   5.931   9.000  43.000 

[1] 14 28  5 21 15 23

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -38.00    7.00   14.00   14.29   21.00  107.00

head(syuzhet_vector), head(bing_vector), and head(afinn_vector) will display the first few sentiment scores from each method.
summary(syuzhet_vector), summary(bing_vector), and summary(afinn_vector) will display the summary statistics of the sentiment scores from each method.

Step 10: Compare Sentiment Methods

We compare the sentiment scores from the three methods.

# Compare first row of each vector (sign function creates common scale)
rbind(
  sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)

Output:

A matrix: 3 × 6 of type dbl
1    1    1    1    1    1
1    1    1    1    1    1
1    1    1    1    1    1

This will display a comparison of the sign (positive, neutral, negative) of the first few sentiment scores from each method.

Step 11: Visualization of Sentiment Analysis for Customer Reviews in R

Now we will visualize the different types of Sentiment Analysis for Customer Reviews in R.

Word Cloud

We create a word cloud to visualize the most frequent terms in the reviews.

wordcloud(words = tdm_d_sparse$word, freq = tdm_d_sparse$freq, 
          min.freq = 5, max.words = 100, colors = brewer.pal(8, "Dark2"))

Output:

Word Cloud

The output is a visual representation of the word cloud, where words are displayed with font sizes proportional to their frequencies. Words with higher frequencies will appear larger and more prominent in the word cloud. The colors of the words are determined by the specified color palette, with each color representing a different word in the cloud. The word cloud provides a quick and intuitive way to visualize the most common words in a text corpus, making it easier to identify patterns and trends.

Sentiment Histogram

We create a histogram to visualize the distribution of sentiment scores using the Syuzhet method.

# Using the full dataset for sentiment analysis might still 
text_sampled <- iconv(sampled_reviews)
syuzhet_vector_sampled <- get_sentiment(text_sampled, method = "syuzhet")

ggplot(data.frame(syuzhet_vector_sampled), aes(x = syuzhet_vector_sampled)) + 
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") + 
  labs(title = "Sentiment Distribution using Syuzhet Method (Sampled Data)", 
       x = "Sentiment Score", y = "Frequency") + 
  theme_minimal()

Output:

Sentiment Histogram

The output is a histogram plot illustrating the distribution of sentiment scores obtained from the Syuzhet sentiment analysis method applied to the sampled dataset. Each bar in the histogram represents a range of sentiment scores, and the height of the bar indicates the frequency of occurrence of sentiment scores within that range. This visualization allows for a quick assessment of the overall sentiment distribution within the sampled text data.

Bar Plot of emotions

To create a bar plot of emotions, you can use the ggplot2 package along with the sentiment scores categorized into different emotions.

nrc_sampled <- get_nrc_sentiment(text_sampled)
nrct_sampled <- data.frame(t(nrc_sampled))
nrcs_sampled <- data.frame(rowSums(nrct_sampled))
nrcs_sampled <- cbind("sentiment" = rownames(nrcs_sampled), nrcs_sampled)
rownames(nrcs_sampled) <- NULL
names(nrcs_sampled)[1] <- "sentiment"
names(nrcs_sampled)[2] <- "frequency"
nrcs_sampled <- nrcs_sampled %>% mutate(percent = frequency/sum(frequency))
nrcs2_sampled <- nrcs_sampled[1:8, ]
colnames(nrcs2_sampled)[1] <- "emotion"

ggplot(nrcs2_sampled, aes(x = reorder(emotion, -frequency), y = frequency, 
                          fill = emotion)) + 
  geom_bar(stat = "identity") + 
  labs(title = "Emotion Distribution (Sampled Data)", x = "Emotion", y = "Frequency") + 
  theme_minimal() + 
  scale_fill_brewer(palette = "Set3")

Output:

Emotion Bar Plot

The output is a bar plot illustrating the distribution of emotions based on sentiment analysis using the NRC lexicon on the sampled dataset. Each bar represents a different emotion, and the height of the bar indicates the frequency of that emotion within the text data. The colors of the bars are determined by the specified color palette, allowing for easy visualization of different emotions.

Bar Plot of Most Popular Words

Creating a bar plot of the most popular words in a text dataset involves visualizing the frequency distribution of words within the corpus.

#bar plot of most popular words

tdm_d_sparse <- tdm_d_sparse[1:10, ]
tdm_d_sparse$word <- reorder(tdm_d_sparse$word, tdm_d_sparse$freq)
ggplot(tdm_d_sparse, aes(x = word, y = freq, fill = word)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  labs(title = "Most Popular Words", x = "Word", y = "Frequency") + 
  theme_minimal()

Output:

Bar plot of most popular word

The output is a horizontal bar plot illustrating the frequency of the top 10 most popular words in the text data. Each bar represents a word, and the length of the bar indicates the frequency of that word in the dataset. The colors of the bars are determined by the words themselves, providing visual differentiation between them. This visualization helps in identifying the most common words in the text data.

Pie Chart of Sentiment Distribution

Creating a pie chart of sentiment distribution involves visualizing the proportion of different sentiment categories within a dataset.

#Pie Chart of Sentiment Distribution

library(ggplot2)
library(RColorBrewer)

# Create a data frame with sentiment and count
sentiment_df <- data.frame(
  sentiment = c("Positive", "Negative", "Neutral"),
  count = c(sum(syuzhet_vector_sampled > 0), sum(syuzhet_vector_sampled < 0), 
            sum(syuzhet_vector_sampled == 0))
)

# Create a pie chart
ggplot(sentiment_df, aes(x = "", y = count, fill = sentiment)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Sentiment Distribution", x = "", y = "") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Output:

Pie Chart of Sentiment Distribution

The output is a pie chart illustrating the distribution of sentiment categories within the dataset. Each segment of the pie chart represents a sentiment category (“Positive”, “Negative”, “Neutral”), and the size of each segment corresponds to the count of that sentiment category in the dataset. The colors of the segments are determined by the specified color palette, allowing for easy differentiation between sentiment categories.

From the results above, we can see that the majority of customers had a positive experience using TripAdvisor, expressing emotions of trust, joy, and anticipation most often. Clearly, customers enjoy using TripAdvisor to help book their accommodations for their trips!

Challenges and Considerations

Handling Subjective Language and Sarcasm: Contextual understanding is crucial to accurately interpret sentiment, especially in cases of nuanced or sarcastic language.
Dealing with Imbalanced Datasets: Customer review datasets often exhibit class imbalance, with a majority of reviews being neutral or positive. Proper sampling techniques or class weighting can mitigate this issue.
Domain-specific Lexicons: Generic sentiment lexicons may not capture domain-specific nuances. Building or fine-tuning lexicons for specific industries or product categories can improve accuracy.

Sentiment Analysis for Customer Reviews in R

In today’s digital age, businesses thrive or perish based on their ability to understand and respond to customer sentiment. Customer reviews on platforms such as Amazon, Yelp, or TripAdvisor provide a treasure trove of data, offering insights into consumer opinions, preferences, and satisfaction levels. Sentiment analysis, a branch of natural language processing (NLP), empowers businesses to extract meaningful insights from these reviews. In this article, we delve into the world of sentiment analysis for customer reviews using the R Programming Language.

Implementation of Sentiment Analysis of Customer Reviews

Step 1: Install and Load Required Packages

Step 2: Read and Inspect the Data

Step 3: Create and Inspect the Corpus

Step 4: Clean the Corpus

Step 5: Sampling the Data

Step 6: Clean the Sampled Corpus

Step 7: Create Sparse Term Document Matrix

Step 8: Analyze Term Frequencies

Step 9: Sentiment Analysis

Step 10: Compare Sentiment Methods

Step 11: Visualization of Sentiment Analysis for Customer Reviews in R

Word Cloud

Sentiment Histogram

Bar Plot of emotions

Bar Plot of Most Popular Words

Pie Chart of Sentiment Distribution

Challenges and Considerations

Sentiment Analysis for Customer Reviews in R

Similar Reads

Contact Us