Chapter 14 Text Mining

As you dive deeper into fundraising analytics, you may find you do not have the luxury of datasets that have already been organized, cleaned, and neatly prepared for you. Embracing this reality, you will first need to acquire data relevant to your purpose (why) and business questions before you begin your data analysis.

In addition to loading data from spreadsheets and databases (see Loading Data), the web offers a wealth of publicly available data for acquisition, preparation, exploration, analysis, and translation into actionable insights and recommendations using any of the data analytics methods previously covered.

R, with the installation of a couple additional packages, can connect directly to the web and harness the power of public data to further qualify, contextualize, and enhance your capacity to build data-driven solutions. In addition, the creative use of public data sources may inspire you to investigate new questions you might not have previously considered during your original analysis and research.

Let’s explore some text mining recipes using R packages that highlight innovative ways to rethink, redesign, and elevate traditional deliverables and information collection processes, including prospect reports, benchmark reports, and contact reports.

14.1 Bio Generation

Forbes is a household name synonymous with success, prominence, and wealth for both individuals and businesses.

As Bresler (2016) stated in their forbesListR package, “Forbes is the preeminent maintainer of covering a wide range of business related topics including sports, entertainment, individual wealth, and locations. The lists are chock-full of phenomenal data that can be analyzed, visualized, and merged with other data”. Specifically, the forbesListR package provides “an easy way to access the data contained in lists maintained by the fine folks at Forbes.”

You can use the following recipe to extract wealthy individual entities from the Forbes 400 list (2017) and dynamically generate a prospect bio, including net worth, rank, geography, and so on. First, let’s load the libraries.

The following recipes are for demonstration purposes only. While we want to show how you can extract data from the web, we don’t endorse any behavior that goes against the terms of services or diminishes the intellectual property of any websites. Read and follow the terms of services and copyright guidelines of each website.
devtools::install_github("abresler/forbesListR")
library(forbesListR)
library(dplyr)
library(stringr)

Then, using the get_year_forbes_list_data function, get all of the Forbes 400 data.

forbes400_data <- get_year_forbes_list_data(
  list = "Forbes 400", year = 2017)

Next, let’s add a row number as an ID, and let’s take only five of the 400 to make sure everything works.

forbes400_data <- mutate(forbes400_data, uniqueid = 1:n()) 
forbes400_data_ss <- head(forbes400_data, 5)

Let’s use the rvest library functions to download the Forbes bios.

get_bio <- function(url) {
  read_html(url) %>% 
    html_nodes(css = "#contentwrapper > 
               div.content > 
               div.featured > 
               div.featured-text > 
               ul") %>%
    html_text(trim = TRUE) %>%
    gsub("[\n\t]", "", .)
}

forbes400_data_ss <- rowwise(forbes400_data_ss) %>% 
  mutate(bio = get_bio(url.bio.forbes))

Then, let’s download the photos of people in the Forbes list.

if (!(dir.exists("imgs"))) dir.create("imgs")

forbes400_data_ss %>%  
  do(d = download.file(url = .$url.image,
                       destfile = paste0("imgs/",
                                         .$uniqueid, 
                                         ".jpg")))

Next, we will prepare markup to create our final bio PDF file. is a typesetting markup language and system to create high-quality and great-looking documents.

sanitizeLatexS <- function(str) {
  #http://stackoverflow.com/q/5406071/934898
  gsub('([#$%&~_\\^\\\\{}])', '\\\\\\1', str, perl = TRUE);
}

#create a bio page for every person
forbes400_data_ss %>% 
  arrange(rank) %>%  
  transmute(
    mrkup = paste0(
      "\\section{", sanitizeLatexS(name), 
      "}\n",
      "\\begin{wrapfigure}[9]{l}[0pt]{0.35\\linewidth}\n",
      "\\centering\n",
      "\\vspace{-13pt}\n",
      "\\includegraphics[scale=0.6]{imgs/", uniqueid, "}\\\\\n",
      "\\end{wrapfigure}\n",
      "\\textbf{Forbes Rank:} ", rank, "\\\\\n",
      "\\textbf{Net Worth:} \\$", net_worth.millions, "M\\\\\n",
      "\\textbf{Residence:} ", state,  "\\\\\n",
      "\\textbf{Forbes Bio:}", "\\\\\n", 
      sanitizeLatexS(str_sub(bio, end = 1000)), 
      "\\ldots\n",
      "\\textbf{Source:} \\url{", url.bio.forbes, "}\\\\\n")) %>%
  do(prnt = writeLines(text = .$mrkup,
                       con = "./forbesbios.tex"))

file_conn <- file('forbesbios.tex', 'r+') 
file_contents <- readLines(file_conn) 
writeLines(
  c(paste(
    "\\documentclass[11pt]{article}",
    "\\usepackage{graphicx}",
    "\\usepackage{wrapfig}",
    "\\usepackage[urlbordercolor={1 1 1},urlcolor=red,colorlinks]{hyperref}",
    "\\setcounter{secnumdepth}{0}",
    "\\begin{document}", sep = "\n"), 
    file_contents,
    "\\end{document}"),
  con = file_conn) 
close(file_conn)

Finally, we compile the code to create a PDF.

tools::texi2pdf("forbesbios.tex")

Once you run the complete code, the final PDF will look like Figure 14.1.

Forbes bio example

FIGURE 14.1: Forbes bio example

14.2 Endowment Benchmarking

In higher-education fundraising, endowment figures are often used as a proxy or baseline measurement of institutional resource levels, with the assumption that larger endowment figures translate into increased support for educational programs, research activities, and so on. You may already be familiar with or responsible for institutional benchmarking survey requests. While the purpose of benchmark reporting varies, these surveys usually involve a comparative analysis of peer institutional metrics for various purposes, including campaign-planning activities.

Suppose you receive a time-sensitive inquiry to conduct a benchmarking analysis of endowment levels of peer educational institutions across the U.S. Rather than searching for this information one by one, you would presumably prefer to find a more efficient solution to reduce the time and effort required to complete this project.

In the following recipe, we will use R to connect to the web and retrieve endowment information for various higher-education institutions.

Let’s install and load the XML and httr packages.

# Install XML and httr packages
install.packages("XML", 
                 repos = "http://cran.us.r-project.org")
install.packages("httr", 
                 repos = "http://cran.us.r-project.org")

Now, let’s load the XML and httr packages to connect R to the web.

# Load XML package
library(XML)

# Load httr package
library(httr)

Next, let’s read publicly available information directly from Wikipedia:

# Extract HTML data from a webpage
url <- "https://tinyurl.com/zsbw279"
webpage <- GET(url, user_agent("httr"))

Next, let’s extract and store this information into a dataframe:

endowments <- readHTMLTable(text_content(webpage), as.data.frame = TRUE)

Finally, let’s display endowment information from 2007 to 2016 for the first 10 colleges.

# Display First 10 Colleges Endowment Figures 2007-2016
endows <- endowments[[1]]
head(endows, n = 10)

We need to perform further clean-up to remove the dollar signs and other extra characters.

remove_unnecessary_chars <- function(x){
  ret <- gsub(x = x, pattern = "\\[\\d+\\]", replacement = "" )
  ret <- gsub(x = ret, pattern = "[\r\n]", replacement = "")
  gsub(x = ret, pattern = "\\$", replacement = "")
}
endows <- mutate_all(endows,
                     .funs = funs(remove_unnecessary_chars)) %>%
  mutate_at(.vars = -1,
            .funs = as.numeric) %>%
  gather(key = "year", value = "value", -1) %>%
  mutate(year = as.numeric(str_trim(str_replace(string =  .$year, 
                                       pattern = "\\(billions USD\\)",
                                       replacement = "")))) %>%
  mutate(value = value * 10^9)

In this recipe, we used R to pull publicly available endowment information from Wikipedia, which allows you to spend more time focusing on how to most effectively summarize and present your insights rather than having to look up this information manually.

Modify the existing recipe to retrieve your institution’s endowment figures as well as of its peer institutions. If you don’t work in higher education, modify the code to pull down your alma mater’s endowment information or a university you admire.

Let’s continue to explore how we can use R to extract web-based text information to generate actionable insights and recommendations.

14.3 Geo-Coded Prospect Identification

Suppose you have a list of over 30,000 constituents to prioritize for prospect research and lead generation for frontline fundraisers. Let’s also imagine you have outdated and sparsely populated wealth rating information in your current database. While your first proposed solution might be to conduct a bulk wealth screening, let’s also assume there is limited capacity and resources, which motivates a different approach.

To manage an inquiry of this scale or beyond, you will need a practical, efficient, and repeatable method to identify the best prospects. To put this inquiry into perspective, 30,000 constituents at a rate of 30 minutes of research per constituent translates to 900,000 minutes, which is 15,000 hours or approximately 7.8 years of work… and, hence, the motivation to find a practical solution.

Assuming you’ve already explored the various exploratory data analysis (EDA) and machine learning (ML) methods previously covered, another solution is to acquire public data that can be used as a proxy for wealth capacity. One potential proxy for wealth capacity is the median real estate value associated with a constituent’s residential ZIP code, which is driven by census data.

While using median price ZIP code data as a proxy for wealth capacity estimates is notably sensitive to the accuracy and integrity of the constituent address information in your database, it provides a helpful way to prioritize and segment prospects. This approach assumes that prospects with higher wealth capacity tend to live on average in more expensive geographic areas relative to other prospects. This is certainly not always the case, but it is a useful starting point and filter for prospect research and development.

A quick search for “wealthiest zips 2017” returns a Forbes article “Full List: America’s Most Expensive ZIP Codes 2017,” which we can use to identify affluent constituent geographies. If you are looking for a simple web-based text mining solution to acquire this data, you may prefer to use the rvest package developed by Hadley Wickham. The rvest package was designed for simple web scraping and, unlike the httr and XML packages, does not require a deep understanding of the structure of web-based data objects such as XML and JSON.

The following recipe shows how to use the rvest package to extract Forbes 2017 wealthy ZIP codes and blend geo-coded wealth information acquired via the web with our example donor file.

We will also use regular expressions (also known as regex) to parse the text. Regexes are very powerful and help match complex patterns in strings. Read this site for more information: http://www.regular-expressions.info/ .

# Install Rvest
install.packages('rvest', 
                 repos = "http://cran.us.r-project.org")

Let’s load all the libraries first.

# Load rvest, stringr, tidyr, readr, dplyr
library(rvest)
library(stringr)
library(tidyr)
library(readr)
library(dplyr)

Then, let’s read the HTML code of the web page with all the wealthy ZIP codes.

# Web page for extraction
wealth_zips <- read_html("https://tinyurl.com/yd9gzrrp")

Then, using the rvest library, we’ll extract the list with all the details.

wealth_zips_df <- wealth_zips %>% 
  html_nodes("ol") %>% 
  html_nodes("li") %>% 
  html_nodes("ol") %>% 
  html_nodes("li") %>% 
  html_text() %>% 
  as_data_frame()
head(wealth_zips_df)
#> A tibble: 6 x 1
#>                                                                                               value
#>                                                                                               <chr>
#> 1 "94027 ATHERTON CA Median Price: $9,686,154 \n    
#> 2 33462 MANALAPAN FL Median Price: $8,368,431 Days o
#> 3 94022 LOS ALTOS HILLS CA Median Price: $7,755,000 
#> 4 94301 PALO ALTO CA Median Price: $7,016,631 Days o
#> 5 94957 ROSS CA Median Price: $6,939,423 Days on Mar
#> 6 11962 SAGAPONACK NY Median Price: $6,852,692 Days 

Using stringr library functions and regex, extract the ZIP code and its median home price from the extracted string.

wealthy_zips_medprice <- transmute(
  wealth_zips_df,
  zip = str_sub(value, end = 5),
  MedianPrice = str_extract(value, "(?<=\\$)[0-9,]+"))

wealthy_zips_medprice <- mutate(
  wealthy_zips_medprice,
  MedianPrice = as.numeric(str_replace_all(MedianPrice, ",", "")))

To filter out donor sample data with ZIP codes that have median house prices over $1,000,000, let’s join the wealthy ZIPs with the donor sample data.

donor_data <- read_csv("data/DonorSampleDataML.csv",
                       col_types = cols(ZIPCODE = col_character()))

select_vars <- c('MARITAL_STATUS', 'GENDER', 
               'ALUMNUS_IND', 'PARENT_IND',
               'HAS_INVOLVEMENT_IND', 'DEGREE_LEVEL',
               'PREF_ADDRESS_TYPE', 'EMAIL_PRESENT_IND', 
               'ZIPCODE')

donor_data <- select(donor_data,
                     select_vars,
                     AGE,
                     TotalGiving,
                     DONOR_IND)

# Merge with SuperZip Index
dd_superzips <- inner_join(donor_data, 
                      wealthy_zips_medprice, 
                      by = c("ZIPCODE" = "zip"))

Finally, let’s filter the sample data.

prospects <- filter(dd_superzips, 
                    TotalGiving >= 10000 & ALUMNUS_IND == "Y" &
                      HAS_INVOLVEMENT_IND =="Y" & 
                      EMAIL_PRESENT_IND =="Y" &
                      PREF_ADDRESS_TYPE =="HOME" & 
                      MedianPrice > 1000000 & 
                      AGE >= 40)

Let’s inspect our sample data.

glimpse(prospects)
#> Observations: 14
#> Variables: 13
#> $ MARITAL_STATUS      <chr> "Unknown", "Marri...
#> $ GENDER              <chr> "Female", "Female...
#> $ ALUMNUS_IND         <chr> "Y", "Y", "Y", "Y...
#> $ PARENT_IND          <chr> "N", "N", "N", "N...
#> $ HAS_INVOLVEMENT_IND <chr> "Y", "Y", "Y", "Y...
#> $ DEGREE_LEVEL        <chr> "Undergrad", "Und...
#> $ PREF_ADDRESS_TYPE   <chr> "HOME", "HOME", "...
#> $ EMAIL_PRESENT_IND   <chr> "Y", "Y", "Y", "Y...
#> $ ZIPCODE             <chr> "90265", "90265",...
#> $ AGE                 <int> 44, 60, 54, 52, 4...
#> $ TotalGiving         <int> 12850, 12309, 302...
#> $ DONOR_IND           <chr> "Y", "Y", "Y", "Y...
#> $ MedianPrice         <dbl> 4266731, 4266731,...

In this recipe, we used the rvest package to extract Forbes 2017 wealthy ZIP codes and use them as a wealth capacity proxy and filter for our example donor file. We selected the following prospect criteria: Alumni prospects who are 40 years or older with total giving of $10,000 or more, institutional involvement flag, active email address, and preferred home address ZIP with median price above $1,000,000. The resultant output is a set of 14 prospect leads for recommended research, outreach, and qualification.

14.4 Social Media Analytics

Social media is a powerful online platform that gives voice to a variety of ideas, opinions, and feedback. Platforms such as Twitter are dynamic forums driven by a broad community of users (over 300 million users as of 2017) who can instantly tap into a global conversation. Whether you actively use Twitter or other social media, you should recognize there is a wealth of social media information that you can explore and analyze to extract relevant insights.

In the following recipe, we will explore how to use R and some text mining packages to see what people are currently saying about machine learning, artificial intelligence, and deep learning on Twitter.

In order to extract tweets, you will need a Twitter account. If you don’t have one, you can sign up here. Once you have an account, you can use your Twitter login ID and password to create a Twitter application here. For detailed instructions on how to configure your Twitter account so you can pull data using R, you can refer to this article.

First, install the twitteR, rwteet packages to extract Twitter data and the tm and wordcloud packages to perform text mining analysis.

# Install twitteR, rtweet, tm
install.packages('twitteR', 
                 repos = "http://cran.us.r-project.org")
install.packages('rtweet', 
                 repos = "http://cran.us.r-project.org")
install.packages('tm', 
                 repos = "http://cran.us.r-project.org")
install.packages('wordcloud', 
                 repos = "http://cran.us.r-project.org")

Next, load the Twitter and text mining packages.

# Load twitteR, stringr and tidyr
library(twitteR)
library(rtweet)
library(tm)
library(wordcloud)
library(stringr)
library(tidyr)

Now, define your Twitter authentication credentials.

# Twitter Authentication
requestURL = "https://api.twitter.com/oauth/request_token"
accessURL = "https://api.twitter.com/oauth/access_token"
authURL = "https://api.twitter.com/oauth/authorize"
consumerKey = "INSERT_YOUR_CONSUMER_KEY_HERE"
consumerSecret = "INSERT_YOUR_CONSUMER_Secret_HERE"
accessToken = "INSERT_YOUR_access_Token_HERE"
accessSecret = "INSERT_YOUR_accessSecret_HERE"
setup_twitter_oauth(consumerKey, consumerSecret, 
                    accessToken, accessSecret)

Next, let’s search Twitter for machine learning, AI, and deep learning topics.

# Search Twitter 
tweets <- searchTwitter("MachineLearning AND AI AND DeepLearning", 
                        lang = "en", n = 1000)

Next, let’s extract tweets.

# Extract Text
tweets.txt <- sapply(tweets, function(x) x$getText())

Let’s convert tweets to plain text format.

# Convert Tweets to Plain Test
tweets.txt <- plain_tweets(tweets.txt)

Next, let’s clean up our tweet formatting.

# Remove Retweet
tweets.txt <- gsub("^RT", "", tweets.txt)

# Remove @UserName
tweets.txt <- gsub("@\\w+", "", tweets.txt)

# Remove Links
tweets.txt <- gsub("http\\w+", "", tweets.txt)

# Remove Tabs
tweets.txt <- gsub("[ |\t]{2,}", "", tweets.txt)

Now, let’s build a text corpus object.

# Create Text Corpus
tweets.corpus <- Corpus(VectorSource(tweets.txt))

Next, let’s pre-process our tweet corpus for analysis.

# Text Pre-Processing
tweets.corpus <- tm_map(tweets.corpus, content_transformer(tolower))
tweets.corpus <- tm_map(tweets.corpus, removePunctuation)
tweets.corpus <- tm_map(tweets.corpus, stripWhitespace)
tweets.corpus <- tm_map(tweets.corpus, removeWords, stopwords())

Let’s check a sample of tweets.

# Inspect Pre-Processed Text Sample
inspect(tweets.corpus[1:50])

Finally, let’s build a wordcloud of tweets.

# Build Wordcloud from Text
wordcloud(tweets.corpus, random.order = FALSE,  min.freq = 20)
Twitter wordcloud: MachineLearning DeepLearning AI

FIGURE 14.2: Twitter wordcloud: MachineLearning DeepLearning AI

In this recipe, we used social media and text mining packages such as twitteR, rtweet, and tm to systematically pull and analyze 1,000 tweets (documents) from Twitter to understand what people are currently saying about #MachineLearning, #DeepLearning and #AI.

Based on the wordcloud visualization we created, higher-frequency words are plotted in larger text and arranged closer to the center to highlight top themes, which, perhaps unsurprisingly, include deep learning, big data, data science, artificial intelligence, and healthcare. Other trending topics include predator vision drones using artificial intelligence to spot poachers; top virtual reality and internet of things (IOT) business trends; financial services tutorials on how to use deep learning, machine learning, neural networks, blockchain, and other cryptocurrency technologies; real estate startups using machine learning, and so on.

Using the current recipe as a baseline, modify the code to search Twitter for recent conversations about your institution or social media campaigns. Store these tweets into a vectorized text corpus object using the tm package and build a wordcloud that reflects a snapshot or pulse of the most popular topics and trends being discussed.

14.5 Text Classification

Naive Bayes, which we introduced in the Machine Learning Concepts chapter, is one of the most commonly used machine learning methods for text classification tasks such as spam filtering. While naive Bayes makes some unrealistic assumptions of variable independence, it is both simple and effective for many real-world classification problems.

To help introduce you to text mining methods, we created a synthetic data set of prospect contact reports, which you will use to build a text classification model (also known as a classifier).

In the following recipe, you will build a text classifier, predict whether a contact report represents a positive prospect interaction, and evaluate these results using the tm package and naiveBayes algorithm included with the e1071 package.

First, let’s load our file manipulation and text mining libraries.

# Load readr
library(readr)

# Load dplyr
library(dplyr)

# Load tm
library(tm)

# Load wordcloud
library(wordcloud)

# Load e1071
library(e1071)

# Load caret package
library(caret)

Next, let’s load and prepare our sample contact file.

# Load Data
contact_data <- read_csv("data/DonorSampleContactReportData.csv")

# Drop 'ID', 'MEMBERSHIP_ID', etc.
pred_vars <- c('Staff Name', 'Method', 'Substantive', 
               'Donor', 'Outcome')

# Convert features to factor
contact_data <- mutate_at(contact_data,
                     .vars = pred_vars,
                     .funs = as.factor)

# Select Variables
contact_data <- select(contact_data, 
  pred_vars,
  Summary)

Next, let’s split our contact data into training and test datasets.

# Split 70% of contact_data into training
# data and 30% into test data
cd_index <- sample(2, nrow(contact_data), replace = TRUE, 
    prob = c(0.7, 0.3))
cd_trainset <- contact_data[cd_index == 1, ]
cd_testset <- contact_data[cd_index == 2, ]

Let’s confirm our training and test dataset sizes and proportions.

# Confirm size of training and test
# datasets
dim(cd_trainset)
dim(cd_testset)

# Check proportions of $Outcome in training and test
# Train dataset 70% Negative, 30% Positive
# Test dataset: 63% Negative, 37% Positive
prop.table(table(cd_trainset$Outcome))
prop.table(table(cd_testset$Outcome))

Now, let’s convert our test and training contact reports to corpus data objects.

# Convert test and training contact report to corpus
cd_trainset_corpus <- Corpus(
  VectorSource(cd_trainset$Summary))
cd_testset_corpus <- Corpus(
  VectorSource(cd_testset$Summary))

Next, let’s pre-process the contact report text.

# Pre-processing contact report corpora
cd_trainset_corpus <- tm_map(
  cd_trainset_corpus, tolower)
cd_trainset_corpus <- tm_map(
  cd_trainset_corpus, removeWords, 
  stopwords())
cd_trainset_corpus <- tm_map(
  cd_trainset_corpus, removePunctuation)
cd_trainset_corpus <- tm_map(
  cd_trainset_corpus, stripWhitespace)
cd_testset_corpus <- tm_map(
  cd_testset_corpus, tolower)
cd_testset_corpus <- tm_map(
  cd_testset_corpus, removeWords, 
  stopwords())
cd_testset_corpus <- tm_map(
  cd_testset_corpus, removePunctuation)
cd_testset_corpus <- tm_map(
  cd_testset_corpus, stripWhitespace)

Let’s read a sample of example contact reports.

# Inspect contact report corpora sample
inspect(cd_trainset_corpus[1:5])
inspect(cd_testset_corpus[1:5])

Now, let’s build document term matrices (DTM) for analysis.

# Build Document Term Matrices (DTM) 
cd_trainset_dtm <- DocumentTermMatrix(
  cd_trainset_corpus)
cd_testset_dtm <- DocumentTermMatrix(
  cd_testset_corpus)

Let’s inspect a couple contact report DTM samples.

# Inspect Contact Report DTM sample
inspect(cd_trainset_dtm[1:5,])
inspect(cd_testset_dtm[1:5,])

Next, let’s explore associations between terms and plot term distribution.

# Explore Associations Between Terms
findAssocs(cd_trainset_dtm, "meeting", 0.2)

# Plot Zipf distribution of trainset DTM
Zipf_plot(cd_trainset_dtm)

Now, let’s create word clouds for positive and negative contact outcomes.

# Create Word clouds for Positive and Negative Contact Outcomes
positive_interaction <- subset(
  cd_trainset, Outcome=="Positive")
negative_interaction <- subset(
  cd_trainset, Outcome=="Negative")

wordcloud(positive_interaction$Summary, 
  random.order = FALSE, 
  min.freq = 2, scale = c(3,1))

wordcloud(negative_interaction$Summary, 
  random.order = FALSE, 
  min.freq = 2, scale = c(3,1))

Let’s store a dictionary of frequent terms.

# Store Dictionary of Frequent Terms
cd_dict <- findFreqTerms(cd_trainset_dtm, 2)

Let’s use frequent terms to limit our training and test datasets.

# Limit Training and Test to Frequent Terms
cd_train <- DocumentTermMatrix(cd_trainset_corpus,
  list(Dictionary = cd_dict))
cd_test <- DocumentTermMatrix(cd_testset_corpus,
  list(Dictionary = cd_dict))

Let’s convert term counts and column counts for DTMs.

# Convert Counts to Factors
convert_counts <- function(x) {
  x <- ifelse(x > 0, 1, 0)
  x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
  x
}

# Convert Column Counts for DTMs
cd_train <- apply(cd_train, MARGIN = 2, convert_counts)
cd_test <- apply(cd_test, MARGIN = 2, convert_counts)

Now, let’s build our naive Bayes classification model.

# Build Naive Bayes classification Model
cd_naivebayes <- naiveBayes(cd_train, 
                            cd_trainset$Outcome, 
                            laplace = 0.5)

# Examine Naive Bayes model
# cd_naivebayes

Finally, let’s use our model to make classification predictions.

# Make Naive Bayes predictions
cd_prediction <- predict(cd_naivebayes, cd_test)

Let’s explore the accuracy of the model.

# Create NB crosstab
naivebayes.crosstab <- table(cd_prediction, cd_testset$Outcome)

# Confusion Matrix
confusionMatrix(naivebayes.crosstab, positive="Positive")

In this recipe, we built a classification model using the naive Bayes algorithm to classify contact reports as positive or negative interactions based on training (input) text data. The naive Bayes method estimates the likelihood of new observation (test) data belonging to various labeled classes (groups). The text classifier we built using example contact report data classified approximately 88% of the contact reports correctly as a positive or negative interaction based on training labels.

In the following recipe, we will explore how to analyze sentiment within text documents that don’t have existing labels in our training (input) data.

14.6 Sentiment Analysis

R offers multiple packages for performing sentiment analysis (also known as opinion mining or emotion AI), which is the process of using machine learning, natural language processing, and text mining techniques to identify, extract, quantify, and study subjective information such as opinions and attitudes.

The purpose of sentiment analysis is to computationally determine user attitudes or emotional reactions to an occurrence, interaction, or event using collected data, surveys, and so on. Sentiment analysis application contexts include customer reviews, survey response analysis, social media analytics, healthcare customer experience research, and so on.

One popular package is RSentiment, which you can download with the following command.

# Install sentiment package
devtools::install_github("okugami79/sentiment140")

In the following recipe, we will build on the previous example and conduct sentiment analysis using the example contact report file.

Let’s load file manipulation and text mining libraries.

# Load readr
library(readr)

# Load dplyr
library(dplyr)

# Load ggplot
library(ggplot2)

# Load sentiment package
library(sentiment)

Next, let’s load and prepare our sample contact report data.

# Load Data
contact_data <- read_csv("data/DonorSampleContactReportData.csv")

pred_vars <- c('Staff Name', 
               'Method', 
               'Substantive', 
               'Donor', 
               'Outcome')

# Convert features to factor
contact_data <- mutate_at(contact_data,
                     .vars = pred_vars,
                     .funs = as.factor)

# Select Variables
contact_data <- select(contact_data, 
                       pred_vars,
                       Summary)

Let’s conduct sentiment analysis on the contact reports.

# Sentiment Analysis
contact_data <- mutate(contact_data, 
  polarity = sentiment(Summary)$polarity)

Next, let’s summarize the sentiment score results.

# Summarize Sentiment Analysis 
table(contact_data$polarity)
#> 
#> negative  neutral positive 
#>        9      180        7

Now, let’s convert sentiment scores to “score” variable with numeric values between -1 to 1.

# Create Sentiment Analysis Score
contact_data <- mutate(contact_data,
  score = ifelse(
  polarity == "positive", 1,
  ifelse(polarity == "negative", -1, 
  0)))

result <- aggregate(score ~ Donor, data = contact_data, sum)

Let’s select positive or negative reports.

# Select positive or negative reports
result.pon <- filter(result, score != 0)

Finally, let’s plot the sentiment results.

# Sentiment Plot
p <- ggplot(result.pon, 
            aes(x = score, 
                y = Donor, 
                colour = score)) + 
   geom_point() + 
  scale_color_continuous("Sentiment Score", 
                         low = "red2", 
                         high = "green3")


p + xlab("Contact Sentiment") + ylab("Donor") 

In this recipe, we used the sentiment140 package originally designed for Twitter sentiment text analysis to conduct out-of-the-box, “headache-free” sentiment scoring on our example contact report file. Unlike the previous recipe, where we built a text classifier from scratch using the naive Bayes algorithm, the sentiment140 package uses a context-free grammar (CFG) language model that is pre-tuned for English and Spanish tweets (140 characters) and requires no natural language processing (NLP) training.

For additional information about CFG models, you can check out this article. To read more about Sentiment140, check out this link and paper.

14.7 Summary

In this chapter, we explore how to use web and text mining packages to connect directly to the web and extract public data to build data-driven solutions. In addition, we explored how to blend public data sources with an example donor file to enhance our analysis. Specifically, we showed how to redesign information collection processes and elevate some traditional deliverables, including prospect reports, benchmark reports, and contact reports.

In the next chapter, we will explore how to analyze and explore relationship data using social network analysis and advanced visualization techniques.

If you’re enjoying this book, consider sharing it with your network by running source("http://arn.la/shareds4fr") in your R console.

— Ashutosh and Rodger

References

Bresler, Alex. 2016. ForbesListR: Access Forbes List Data.