Computer Science‎ > ‎

R for Data: Text Preprocessing In R


It is not surprising to know that while the world is getting used to analyzing structured content and drawing insights still almost 80 percent of the data is unstructured and thus contains an even greater level of critical information that can be used to draw insights and produce better insight.


2. Text Preprocessing in R
2.1 Loading data
2.2 Loading libraries
3. Steps of  text preprocessing
3.1 Corpus
3.2 Removing Numbers
3.3 Removing punctuation
3.4 Stripwhitespace
3.5 Lowercase
3.6 Remove stopwords
3.7 Stemming

Text Preprocessing in R -  

The real power of R language is felt as we look at the packages that R for all specific tasks and in terms of text mining it is no less as there are many packages. In this post, we  will use the following packages
  1. tm, a framework for text mining applications.
  2. SnowballC, text stemming library.
  3. Wordcloud, for making wordcloud visualization.
  4. ggplot2, one of the best data visualization libraries.

1. Loading data- We have created a data frame of our own with all the messy content present which we will get rid of by using preprocessing steps and as we can we have few statement in a text as our dataframe upon which we do further processing to derive value out of it.


m = data.frame(c("I want to    go to school but 34 #", "I love chocolates 345 @","we should stop getting worried so   easily   "))
 names(m) = "text"

1            I want to    go to school but 34 #
2                       I love chocolates 345 @
3 we should stop getting worried so   easily   

2.Loading libraries 

We load the main libraries that are used in the various steps of preprocessing but first  we need to make sure we have installed these libraries otherwise we can use install.packages to load them explicitly anmd then use library as shown below.



Steps of Text preprocessing - Text preprocessing, in general, refers to cleaning of data or making available data available for analysis so further applications like getting the frequency of words, building wordclouds all aimed at deriving value out of textual data however it is not so easy as textual data has a lot of messy portion in it and it requires quite a lot of supervised approach to use  textual data effectively .

1. Corpus - Corpus in dictionary terms means "corpus is a collection of text or speech that has been brought together according to be a certain set of predetermined criteria".   It arranges text in such a manner  so that preprocessing of data can be done or in simple words document term matrix can be created which represent each term present in the text in one hot encoding in the form of a matrix .


post = Corpus(VectorSource(m$text)) # creating corpus
post  # printing corpus
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

2.Remove numbers  
In this step we tend to remove numbers as a number don't really provide us with important info as the task at hand is to derive value out of a text and although it only can remove numbers ,the numbers stated in the form of text remain as they are. Data is very messy and thus it becomes imperative to remove all other impurities that are not text.


post = tm_map(postCorpus, removeNumbers)

writeLines(as.character(post[[1]])) # inspect doc number one

I want to    go to school but  #   

As you can see the number 34 was actually removed from the first document and so numbers are removed from other documents as well.

3. Remove punctuation- It enables us to remove punctuations from the document and these punctutaions form a major chunk of concentration of data messiness as in all information punctuations are used at timres at all places second in proportion to alphabets but here we only need text and not the usual punctuations.


post = tm_map(post, removePunctuation)

 writeLines(as.character(post[[1]]))# Inspect  document no 1
I want to    go to school but   

4.  Strip whitespace- It removes the extra whitespace from documents and thus it, in turn, means document matrix can be formed easily because there are chances that extra space might get considered or any symbol present in white space might get considered into the document term matrix so a proper  option  is to remove the whitespace .


post = tm_map(post, stripWhitespace)

I want to go to school but 

5. Lowercase-  R is case-sensitive and so it is considered a better option to convert into a lowercase and that will only serve better purpose as regular expressions can be applied including matching certain portions only when they are in the same case  and thus it becomes convenient. We generally keep the text i the lower case .


 post <- tm_map(post,content_transformer(tolower))

i love chocolates

6.Remove Stopwords- Stopwords are certain words that are present in high proportion but rarely contribute anything of sense towards serving analytical purpose. These include words such as "a","the","if" etc . These stopwords are sense adding words to the sentences in a grammatical english however here we are after the words that are there for businesss context not for the common ones that are only present to add sense to sentences so it is important to remove them and take care of the few special ones that have a business value attached to it.


post <- tm_map(post, removeWords, stopwords("english"))

 want  go  school  

 love chocolates 

  stop getting worried  easily 

7. Stemming -  A corpus contains words that have a common root such as offering, offered and offer all have a common root word so stemming enables us to replace all these words into a common root form as it helps to reduce complexity and increase meaning.

Stemming algorithms use a chopping approach as they cut down end words at the end of words but it also has the downside of creating unuseful words and thus a better approach known as lemmatization is done.


post <- tm_map(post,stemDocument)

stop get worri easili

These are all the basic steps that are done before we can go on building the document-term matrix and do further analysis getting term frequency, wordcloud and taking sentiment analysis.