1. Text Analysis of SafeBoda App Google Play Store Reviews in R

Safe Boda is an app with a social mission: to improve the welfare and livelihoods in Africa by empowering people. Safe Boda also works on providing value to consumers and drivers with additional financial services, payments and other on-demand services to keep Africa moving forward.

In this notebook, I took a deep dive into the reviews to uncover what their customers think about the application, what they like/dislike about the application and uncover some patterns.

2. Source of the Data

The Dataset is freely available on Google Play Store and was scrapped with Beautiful Soup, a python library for scrapping websites and Loaded into R for further Text Analysis

. The script is not included here because this blog is focussed on R but you can achieve the same with Rvest

3. Load the Libraries

Some of the Packages used in the Analysis

library(tidyverse) 
library(Amelia) ## missingness map
library(rebus)  ## regular expressions
library(bbplot) 
library(tidytext)  ## text mining
library(tidymodels) ## modeling
library(lubridate) ## working with dates and time
library(patchwork) ## patching multiple graphs
library(ggthemes) 
library(knitr)
library(emo)  ## for emojis

3. Read in the Dataset

Let’s read the Dataset into R. Some of the Columns like names, user images will be left out for obvious privacy reasons and also wont be necessary in this analysis, but you can always find them on Google Play Store 🔏

reviews <- readr::read_csv(file.choose())  %>% 
    select(-c(reviewId, userName, userImage, appId))

reviews_copy <- reviews

4. Quick Glimpse of the Dataset

glimpse(reviews_copy)
## Rows: 14,290
## Columns: 8
## $ content              <chr> "I don't understand why my account is blacklis...
## $ score                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ thumbsUpCount        <dbl> 12, 0, 0, 1, 74, 28, 134, 1, 15, 3, 11, 0, 6, ...
## $ reviewCreatedVersion <chr> "3.3.14", "3.3.14", "3.3.16", "3.3.15", "3.3.1...
## $ at                   <dttm> 2020-06-04 13:44:25, 2020-06-26 10:48:06, 202...
## $ replyContent         <chr> "Hello Michael Alvin, we sincerely apologize f...
## $ repliedAt            <dttm> 2020-06-08 08:37:42, 2020-06-27 09:53:50, 202...
## $ sortOrder            <chr> "most_relevant", "most_relevant", "most_releva...

5. Missingness Map

Some of the Columns are missing some observations for obvious reasons e.g The company doesn’t reply to every single review and thus the column will miss some data.

missmap(reviews_copy, col = c("Black", "Yellow"))

6. Some Basic Cleanup and Processing

Let’s extract the year, months and major version numbers in into separate columns, will be helpful for further analysis down the road.

## This will come in handy when am modelling
pattern <- DGT %R% optional(DGT)

reviews_processed <- reviews_copy %>% 
        # na.omit(reviewCreatedVersion) %>% 
        mutate(version_extracted = str_extract(reviewCreatedVersion, pattern = pattern)) %>%
        mutate(version_nmbr = as.numeric(version_extracted)) %>% 
        mutate(year = year(at),
               month = month(at, label = TRUE), 
               week_day = wday(at, label = TRUE))

7. What are the Most Common Used Words in the Reviews?

Top 30 most common words in the reviews

Stop Words and also Words like “Safe Boda”, “Safe”, “Boda” are filtered out as they don’t bring much value to this analysis and way too common ✂️

reviews_processed %>% 
  unnest_tokens(word, content) %>% 
  anti_join(stop_words, by="word") %>% 
  filter(!word %in% c("app", "safe", "boda", "safeboda")) %>% 
  count(word, sort = TRUE) %>% 
  head(30) %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  labs(x="", y="Count")

8. What are the Most Common Positive and Negative Words?

Using the Bing Lexicons, you get scores for Positive/Negative Words, these are the Top 20 most common -ve and +ve Words

reviews_processed %>% 
  unnest_tokens(word, content) %>% 
  inner_join(get_sentiments("bing")) %>% 
  anti_join(stop_words, by="word") %>% 
  select(word, sentiment) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup() %>% 
  group_by(sentiment)  %>% 
  top_n(20) %>% 
  ungroup() %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") + 
  coord_flip() +
  labs(y = "Contribution to Sentiment", x="")

The word safe tops the positive words tho its clearly because of the company name.

8.1 It is important to see which words contribute to your sentiment scores.

What exactly contribute most the different sentiment like anger, disgust, fear etc

reviews_processed %>%
    unnest_tokens(word, content) %>% 
    anti_join(stop_words, by="word") %>% 
    inner_join(get_sentiments("nrc")) %>% 
    # Count by word and sentiment
    count(word, sentiment) %>% 
    filter(sentiment %in% c("anger", "disgust", "trust", "joy")) %>% 
    # Group by sentiment
    group_by(sentiment) %>%
    # Take the top 10 words for each sentiment
    top_n(10) %>%
    ungroup() %>%
    mutate(word = reorder(word, n)) %>%
    # Set up the plot with aes()
    ggplot(aes(word, n, fill=sentiment)) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ sentiment, scales = "free") +
    coord_flip() +
    theme_fivethirtyeight()

Money is the looks to be the biggest driver of the anger Sentiment.

8.2 Sentiment changes with time

How have the different sentiments faired over the years, Let’s look at Positive, Negative, Trust and Anger

sentiment_by_time <- reviews_processed %>%
    unnest_tokens(word, content) %>% 
    anti_join(stop_words, by="word") %>% 
    # Define a new column using floor_date()
    mutate(date = floor_date(at, unit = "3 months")) %>%
    # Group by date
    group_by(date) %>%
    mutate(total_words = n()) %>%
    ungroup() %>%
    # Implement sentiment analysis using the NRC lexicon
    inner_join(get_sentiments("nrc"), by="word")


sentiment_by_time %>%
    # Filter for positive and negative words
    filter(sentiment %in% c("positive", "negative", "trust", "anger")) %>%
    # Count by date, sentiment, and total_words
    count(date, sentiment, total_words) %>%
    ungroup() %>%
    mutate(percent = n / total_words) %>%
    # Set up the plot with aes()
    ggplot(aes(date, percent, color = sentiment))+
    geom_line(size = 1.5) +
    geom_smooth(method = "lm", se = FALSE, lty = 2) +
    expand_limits(y = 0) +
    theme_fivethirtyeight()

Positive energy and trust from the customers have been growing since 2017. 👍

8.3 What is the Average Rating for a Word

These are words that appeared more than x100

## Best avg Rating
reviews_processed %>%
  unnest_tokens(word, content) %>% 
  anti_join(stop_words, by="word") %>% 
  group_by(word) %>% 
  summarize(avg_rating = mean(score, na.rm = TRUE),
            n = n()) %>%
  filter(n > 100) %>% 
  arrange(desc(avg_rating))
## # A tibble: 50 x 3
##    word       avg_rating     n
##    <chr>           <dbl> <int>
##  1 perfect          4.92   233
##  2 fantastic        4.91   115
##  3 excellent        4.90   334
##  4 amazing          4.82   218
##  5 wonderful        4.77   122
##  6 convenient       4.74   204
##  7 transport        4.72   166
##  8 cheap            4.72   134
##  9 reliable         4.68   221
## 10 easy             4.67   265
## # ... with 40 more rows
## Worst avg Rating
reviews_processed %>%
  unnest_tokens(word, content) %>% 
  anti_join(stop_words, by="word") %>% 
  group_by(word) %>% 
  summarize(avg_rating = mean(score, na.rm = TRUE),
            n = n()) %>%
  filter(n > 100) %>% 
  arrange(avg_rating)
## # A tibble: 50 x 3
##    word     avg_rating     n
##    <chr>         <dbl> <int>
##  1 error          1.83   186
##  2 version        2.03   109
##  3 phone          2.18   149
##  4 update         2.24   224
##  5 location       2.25   233
##  6 download       2.25   102
##  7 rider          2.53   197
##  8 slow           2.61   158
##  9 takes          2.70   121
## 10 trip           2.73   105
## # ... with 40 more rows

error and location also get a very low average rating obviously.💩

PART 2

So far we’ve considered words as individual units, and considered their relationships to sentiments. However, many interesting text analyses are based on the relationships between words, e.g examining which words tend to follow others immediately

9 Visualizing a network of bigrams

Lets visualize all of the relationships among words simultaneously, rather than just the top few at a time.

library(igraph)
library(ggraph)
library(widyr)

set.seed(12345)

bigrams_ratings <- reviews_processed %>%
  unnest_tokens(bigrams, content, token = "ngrams", n = 2) %>% 
  select(bigrams, everything())
  # sample_n(10) %>% 
  # pull(bigrams)

bigrams_ratings_separated <- bigrams_ratings %>% 
  separate(bigrams, c("word1", "word2", sep = " ")) %>% 
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  count(word1, word2, sort = TRUE)

bigram_graph <- bigrams_ratings_separated %>% 
  filter(n > 10) %>% 
  graph_from_data_frame()


a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

App is one of the common centers of nodes which is often followed by words like amazing, lovely, cool, beautiful etc

We also see pairs or triplets along the outside that form common short phrases like : “takes forever”, “uknown error”, “code verification”

9.1 Words preceded by Not, No, Never, Without

By performing sentiment analysis on the bigram data, we can examine how often sentiment-associated words are preceded by “not” or other negating words like “no”, “Never” and “Without”

negation_words <- c("not", "no", "never", "without")
AFINN <- get_sentiments("afinn")
bigrams_ratings %>%
  separate(bigrams, into = c("word1", "word2"), sep = " ") %>% 
  filter(word1 %in% negation_words)  %>%   
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE) %>% 
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(30) %>% 
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment value * number of occurrences") +
  coord_flip() +
  labs(title = "Words Preceeded by NOT...")

  # facet_wrap(~word1, ncol = 2)

The bigrams “not good” and “not happy” were overwhelmingly the largest causes of misidentification, making the text seem much more positive than it is. But we can see phrases like “not bad” and “not problem” sometimes suggest text is more negative than it is.

9.2 Word Cloud

Text analysis is never complete without a word cloud. 😄

library(wordcloud)

reviews_processed %>%
  unnest_tokens(word, content) %>% 
  anti_join(stop_words, by="word") %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 200))

Future Work

  1. A Sentiment Model to Predict a Rating Based the content in the Review.

  2. Work on an Interactive Web Application to bring the Analysis to Life for any Application on Google Play Store

  3. An R Package for easier and further Analysis.