Computer Science‎ > ‎

R for Data: String Manipulation in R


Introduction 

Data is everywhere and important source and outcome in different fields in different organizations and in various forms.
It is no surprise that majority of the data collected and reported is unstructured in nature and thus it becomes quite imperative to use it to our advantage by collecting information and generating insights.

Majority of the unstructured  data that we have is in the form of text and so it becomes increasingly suitable for us to deal with text and in R we have certain packages  that make the process of dealing with strings  efficient and smooth


TABLE OF CONTENTS

1.Introduction
2.Libraries
3. String basics
4. String Functions
     4.1 str_length
     4.2 str_c
     4.3 str_sub
     4.4 str_lower
     4.5 str_sort
     4.6 str_view




2.Libraries 
In order to specifically deal with strings, we require the use of following basic packages that make our job quite easier while dealing with string manipulation in R.

LOADING LIBRARIES

library(tidyverse)
library(stringr)


3. Stringr basics

In order to better understand the functioning of the stringr package lets us perform some basics operations on a string.


STRINGR PACKAGE

a = "we should wake up
> a
[1] " we should wake up "


Further, there are different ways to print the string stored in the variable as shown below. we generally use writeLines to print the raw text in the string.

PRINTING STRING

writeLines(a)
 We are going to learn string manipulation in R 

[1] " We are going to learn string manipulation in R "

writeLines(b)
  "Are u ready  to learn R " "How you like it" 

b
[1] "  \"Are u ready  to learn R \" \"How you like it\" "

# directly printing b also includes / by default


 

Multiple other special characters  "/n" can also be written and other special non-English characters can be encoded separately.

SPECIAL CHARACTERS

x <- "\u00b5"
> x
[1] "µ"

y <-  c("go","for","it") # storing multiple strings
> y 
[1] "go"  "for" "it" 
> writeLines(y) # prints raw content
go
for
it


3. Stringr functions -  There are different useful functions in stringr package  to calculate different aspects related to string.
All different functions in the stringr package start with prefix str and thus make it quite easier to remember them appropriately.

1.Str_length -  It helps us to calculate the length of the strings

STRING LENGTH

g = "we need to calculate length"
> str_length(g)
[1] 27

2- Str_c- It proves to be useful in combining different string and also deciding the combinator to be used while combining them.

STR_C

vc = str_c("x", "y", sep = ", ")
> vc
[1] "x, y"
> vc2 = str_c("x", "y", sep = "__ ")
> vc2
[1] "x__ y"


3 str_sub - Sometimes we only need a certain part of a string  and str_sub provides us with so much control that we can exactly retrieve specific part of string we want  just like we extract specific rows and columns  through indexing 

STR_SUB

x <- c("Apple", "Slows", "Down","Iphones","Deliberately")
> str_sub(x,1,3)
[1] "App" "Slo" "Dow" "Iph" "Del"
> str_sub(x,-4,-1)
[1] "pple" "lows" "Down" "ones" "tely"

4.  str_lower - There are these stringr functions that help us to convert strings from lower case to upper case and vice_versa

TEXT BOX

str_to_upper(c("i am going to a party", "ı"))
[1] "I AM GOING TO A PARTY" "I"    

str_to_lower(c("WHERE YOU HAVE BEEN"))
[1] "where you have been"


However, changing case is more complicated than it might at first appear because different languages have different rules for changing case and in that case, we pick a locale that helps us to implement case function for different languages.

LOCALE

str_to_upper(c("it", "writing"), locale = "tr")
[1] "İT"      "WRİTİNG"

 


In the above code line output, you can observe that there is a special dot on the top of I because in Turkish language (locale = tr) there is generally a dot on top of I, so in the same way we can use a different locale for different languages to perform case operations.

5. str_sort  - It further helps in sorting strings according to their alphabetical order and that order is decided in terms of giving weight to the first alphabet of the string and then comparing the second alphabet if both have the same first alphabet.

STR_SORT

x = c("go"," went", "gate","assign")
> str_sort(x, locale = "en") 
[1] " went"  "assign" "gate"   "go"   

str_sort(x, locale = "haw")  # choosing hawaiian loacle
[1] " went"  "assign" "gate"   "go"   

In the above lines of code, gate comes first in English locale as the second  alphabet in gate being "a" has higher preference in terms of alphabetical order than the second alphabet "o" of word go.


6 str_view - It helps us to search strings for a pattern and match them accordingly

STR_VIEW

str_view(x,"nt")

go
went
gate
assign