Introduction
It is not surprising to know that while the world is getting used to analyzing structured content and drawing insights still almost 80 percent of the data is unstructured and thus contains an even greater level of critical information that can be used to draw insights and produce better insight. Text Preprocessing in R -
1. Loading data- We have created a data frame of our own with all the messy content present which we will get rid of by using preprocessing steps and as we can we have few statement in a text as our dataframe upon which we do further processing to derive value out of it. 2.Loading libraries We load the main libraries that are used in the various steps of preprocessing but first we need to make sure we have installed these libraries otherwise we can use install.packages to load them explicitly anmd then use library as shown below. Steps of Text preprocessing - Text preprocessing, in general, refers to cleaning of data or making available data available for analysis so further applications like getting the frequency of words, building wordclouds all aimed at deriving value out of textual data however it is not so easy as textual data has a lot of messy portion in it and it requires quite a lot of supervised approach to use textual data effectively . 1. Corpus - Corpus in dictionary terms means "A corpus is a collection of text or speech that has been brought together according to be a certain set of predetermined criteria". It arranges text in such a manner so that preprocessing of data can be done or in simple words document term matrix can be created which represent each term present in the text in one hot encoding in the form of a matrix . 2.Remove numbers In this step we tend to remove numbers as a number don't really provide us with important info as the task at hand is to derive value out of a text and although it only can remove numbers ,the numbers stated in the form of text remain as they are. Data is very messy and thus it becomes imperative to remove all other impurities that are not text. As you can see the number 34 was actually removed from the first document and so numbers are removed from other documents as well. 3. Remove punctuation- It enables us to remove punctuations from the document and these punctutaions form a major chunk of concentration of data messiness as in all information punctuations are used at timres at all places second in proportion to alphabets but here we only need text and not the usual punctuations. 4. Strip whitespace- It removes the extra whitespace from documents and thus it, in turn, means document matrix can be formed easily because there are chances that extra space might get considered or any symbol present in white space might get considered into the document term matrix so a proper option is to remove the whitespace . 5. Lowercase- R is case-sensitive and so it is considered a better option to convert into a lowercase and that will only serve better purpose as regular expressions can be applied including matching certain portions only when they are in the same case and thus it becomes convenient. We generally keep the text i the lower case . 6.Remove Stopwords- Stopwords are certain words that are present in high proportion but rarely contribute anything of sense towards serving analytical purpose. These include words such as "a","the","if" etc . These stopwords are sense adding words to the sentences in a grammatical english however here we are after the words that are there for businesss context not for the common ones that are only present to add sense to sentences so it is important to remove them and take care of the few special ones that have a business value attached to it. 7. Stemming - A corpus contains words that have a common root such as offering, offered and offer all have a common root word so stemming enables us to replace all these words into a common root form as it helps to reduce complexity and increase meaning. Stemming algorithms use a chopping approach as they cut down end words at the end of words but it also has the downside of creating unuseful words and thus a better approach known as lemmatization is done. These are all the basic steps that are done before we can go on building the document-term matrix and do further analysis getting term frequency, wordcloud and taking sentiment analysis. Some of our other tutorials related to data analysis with R R for Data |
Computer Science >