We live in a post big-data world. Nowadays is not longer about how to store the massive amounts of data we capture, but about how we draw insights from it.
It is very common now to see jobs advertised for new roles like Data Scientist and the likes. What this new jobs are about is to be able to analyse data that our organisation has access to, present it in useful communicative ways, model it to make predictions, and ultimately make better informed decisions.
Past are the days were we could just whip a spreadsheet and plot some graphs in it. Trying to load a spreadsheet with million of rows and thousand of columns would at best slow your personal computer to a halt. We need to deal with this data more efficiently, by writing code.
The landscape of programming languages is ever changing. In the past years the R programming language has been gaining popularity amongst data analysts/scientist. The reasons for this is that it provides an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:
- an effective data handling and storage facility,
- a suite of operators for calculations on arrays, in particular matrices,
- a large, coherent, integrated collection of intermediate tools for data analysis,
- graphical facilities for data analysis and display either on-screen or on hardcopy, and
- a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
Enough talking, show me the code!
R comes with three basic data types: numbers, strings, and logical values. On top of that we can create vectors, or lists, and a more restricted data structure called Factors.
We can run basic arithmetic and logicaloperations on our data.
1+1 #  2 1 > 2 #  FALSE
We can create vectors, or lists, of values.
# create a vector numbers = c(1, 2, 3) words = c("one", "two", "three") logits = c(True, False, T, F)
We can create factors. A factor is like a vector, but we predefine the valid values it can contain, called levels. By default it will use the provided list of values to generate the valid set, but we can specify it too. The purpose of explicitly specifying levels is so that we can add more values to our factor even even if those values were not present in the original vector.
# create a factor with default levels myFactors = factor(c("one", "two", "three")) #  one two three # Levels: one three two
# create a factor with specific levels myFactors = factor(c("one", "two", "three"), levels = c("one", "two", "three", "four")) #  one two three # Levels: one two three four
We can create data frames. Think of them as tables, or matrices, where each column is a vector or factor. We can label these columns, and later on add more rows to it.
# Data Frame Example Example.df = data.frame(Num = c(1,2,3), Char = c("One", "Two", "Three"), Fact = factor(c("a", "b", "c"), levels = c("a", "b", "c", "d"))) # the Char column has automatically been converted to a Factor, but we can reinstate it as a simple Vector Example.df$Char = as.character(Example.df$Char) # we can create new "rows" of data NewRow = c(4,"Four", "d") # and append it to the data set rbind(Example.df, NewRow)
Data Frames are very important, they are the main data structure we use to manipulate and plot data. We can load Data Frames from files, like CSV or other formats, and also from databases, URLs and more. Once our data is in a data frame, we can access it in many ways.
# Rows Example.df[1,] Example.df[2,] Example.df[c(1,2),] # Columns Example.df[,1] Example.df[,2] Example.df[,c(1,2)] # Import Data Mydata <- read.csv(file.choose(), header = T) # Plot Data with out of the box tools plot(Sales ~ TV, data = Mydata) boxplot(Mydata)
R comes with a battery set of useful functionality out of the box. Specially around being able to explore our data. Two functions in particular are worth mentioning:
Using our first example data frame, we can get a summary of its values. It will calculate useful statistical analysis in any numeric column, and frequency analysis on factors and string vectors.
# Summary summary(Example.df) Num Char Fact Min. :1.0 One :1 a:1 1st Qu.:1.5 Three:1 b:1 Median :2.0 Two :1 c:1 Mean :2.0 d:0 3rd Qu.:2.5 Max. :3.0
The str function will describe each of the columns in the dataset, and give us the first few values of each.
# str str(Example.df) 'data.frame': 3 obs. of 3 variables: $ Num : num 1 2 3 $ Char: Factor w/ 3 levels "One","Three",..: 1 3 2 $ Fact: Factor w/ 4 levels "a","b","c","d": 1 2 3
This is but a mere introduction of R, its capabilities are beyond what can be included in a short post. It is worth noting that the strengths of R lay in the use of libraries, like in any other programming language, that extend the core capabilities even further.
We can find libraries for plotting, data management, and more. Some of the most popular libraries are
We can do more with R too, it out of the box supports machine learning. We can fit models using linear or logistic regression, and use them to predict new values. We can then plot the training data and the predicted outputs to visualise how well our model works. Or we can get a fairly detail description of the model using the
summary function. But that is left for another post.