Erik Lundevall Zara

Posted on Dec 21, 2020

R - the Javascript of data science

#r #datascience

After having used R in a side project for a couple of months, this is part reflections and part introduction to the language. Why the comparison to Javascript? I find similarities in R and Javascript in terms of how the language can surprise you, a mix of ugly and pretty neat bits, both being pretty dynamic languages and with extensive ecosystems that contain both bad, inconsistent and great pieces of software. Focus is more on results rather than being sound and pure in some areas. Also, both R and Javascript were inspired and influenced by Scheme and had their syntax adjusted to be more similar to other non-Lisp languages.

Around this time last year, R was not on my radar. My wife had been doing some research work with Excel as the tool used for exploratory analysis and plotting. The work in Excel would be cumbersome to maintain when the number of observations and measurements increased and adding complexity to clean, analyze and visualize data.
As a software engineer, it seemed to be a better choice to use a programming language for this type of task.
I had just before this started to dabble a bit with Julia at work to perform consistency analysis between some data stores, which has worked quite well. Julia is a language that is growing in science areas, and this looked like a good fit.
I did the initial work in Julia, and I like it a lot. It is a great language. However, parts of the ecosystem were not as mature as I had hoped, especially if I handed this over to a non-software engineer.
When I researched things about Julia, I also encountered comparisons and information about R. Pretty much all comparisons were favourable to Julia in terms of performance, but the computations needed were not on a scale that this was a big deal. When my wife mentioned that one of her colleagues had learned to program a bit in R, then that looked like an option to investigate further.
I did that, and this was a lightbulb moment - a number of the libraries and features that I had found in Julia had similarities in R, plus it was pretty easy to translate it to R for me, even with limited R knowledge. R being the older language, some of the inspiration for Julia related features and ecosystem may have been from R. It was a familiarity that helped my transition into R.
I found that I could get results faster with R, which had a lot to do with that there were more and better documentation, packages as well as examples to be found.
For the languages themselves, I still like Julia better as a software engineer. But the R ecosystem provides a lot of benefits with many resources are useful for the occasional programmer. R is a bit different if you have experience with other, perhaps more general-purpose, programming languages. It is a domain-specific language (DSL) for statistical computing in its heart, or maybe a tool for statistical computing which has a programming language attached to it. If you learn R, you have to learn about statistics, if you do not already know it. There are areas outside of statistics computing where you could use R, but these are exceptions to the rule.

Solving the right problem

When I first suggested that I could help with developing software to perform the exploratory analysis, I was mainly thinking about the calculations required to convert raw measurement data to the desired type of data to use in next phase of the analysis work. It was well-defined what was needed. It was also easy - but what I did not realize initially was that this would only be a small part of the total amount of work required for the analysis, probably less than 5% of the total amount of work.
The first stumbling block was actually to get the data in formats that would be useable for further analysis. On the surface, the data looked very well structured, and by large, it was also. But there were edge cases and exceptions that are very easy for a human to understand how it fits together, but which required extra work to get it in a machine-interpretable format. The conversion was more work than expected, but not too bad. But the part that took most effort was the visualization of the data.
There have been well over a hundred graphs generated in different ways, tweaked, thrown away and re-made in various ways to get a better understanding of the data as well as finding suitable ways to communicate results properly. I used/abused a couple of different plotting packages to get the desired results. For the most part, it was possible to search for a solution to a visualization problem and find a solution - not always a pretty one, though.

Appreciation of R community

The whole journey into R makes me appreciate its ecosystem, all the different tools and packages that people have created. There are some excellent resources to get started with R and find useful material. Cheatsheets, books, courses, videos - there is a substantial amount of material to use, with some examples included here below. I might be wrong here, but it seems that R and its ecosystem and community very much changed and grown in the past 5-10 years. I assume that this has a lot to do with the growing interest and usage of data science. There are many meetups devoted to R, a few conferences and also organizations such as R Ladies, which promote gender diversity within the R community.

Part 2 - Diving into R

R is a somewhat young language, with some old heritage. It is essentially a modernized and open source version of the language S, which has its roots from Bell Laboratories in the 70s and 80s. R itself had its 1.0 release in 2000, about 20 years ago. One of the goals of S was to provide an interactive experience for people to perform analysis and gradually get into programming if this analysis got more complex. I have not used S myself, but I think R does a pretty good job there and my understanding is that this type of functionality was right there already in the beginning.

If you start with R, you are not just going to get R itself, but also some tooling to make you more productive. The primary IDE (Integrated Development Environment) in the R space is RStudio, an open-source IDE specifically targeting R development. It is a capable tool and can be installed easily along with R itself. It is also available as a cloud service in the form of RStudio Cloud - which also is available for free use. The experience is pretty much identical to a local installation.

There are other options also, like the R language for Intellij plugin, which is useable with at least the PyCharm product from Jetbrains (including the community edition of PyCharm). If you are familiar with and like the Intellij suite of products, then this is a good option also. If you have not used PyCharm or Intellij before, then start with RStudio. RStudio is only for R development, so if you want to mix with other languages, then IntelliJ products may be a better option.

R is a dynamically typed, kind-of functional language which supports multiple paradigms - you can do imperative programming and some flavours of object-oriented programming since there are a few object systems in the language itself or add-on packages, including - S3, S4, R6 and RC. S3 (not the AWS storage service) is probably the most common one of these and perhaps the least rigid. It also has a fair bit of metaprogramming capabilities.
Also, various functional programming features are not necessarily part of the core language and the standard libraries but added through community-created packages. Depending on what software packages people use, some coding styles can be quite different. For example, code written using "base" R can look quite different from code using the tidyverse family of packages when it comes to data processing. I am a big fan of tidyverse and using packages included there as well as other packages supporting this style of code has been quite beneficial, I think. They do however introduced some added pitfalls in the syntax also. I believe the benefits outweigh the drawbacks.

The interactive experience - REPL

In R, the interactive part of the language experience is a vital element. As with other REPLs, you can type expressions and have them evaluated to see the results. You also have a built-in help system, accessible via the help() function, but also via simply typing "?" in front of what topic for which you want help or documentation. For example, if you need help with the function "demo", you type "?demo" and it will show the documentation for this function. If you type "??demo" it will search through all documentation for anything that matches the word "demo" and displays a list of those topics.

The demo function is also a nice feature. Executing "demo()" in the REPL will show a list of demos that R distribution includes. Running "demo(package = .packages(all.available = TRUE))" will show a list of demos from all available packages, including any packages you have installed. Specifying the name of the demo as the parameter to the demo() function will execute that demo.

Another neat feature when you start with R is the "data()" function. The R distribution includes many datasets that you can use to play around and test various functions and features in R to perform analysis. Using these data sets makes it easier to try out and experiment with different functionality in R without spending a lot of time finding a suitable data set to work with or understanding a data set when reading an explanation.

A key element in statistical computing and data science is visualization. R has a plotting system built into the base R distribution, which does a decent job. You can do at least simple visualizations right away, without anything extra.

For example, starting R and enter the expression

plot(airmiles)

will give you the graph below:

Or to use a more popular dataset that is included with R, the iris dataset:

plot(iris)

What this shows is that R can derive a lot of information from the data provided and pick something that may be a suitable default.

There are multiple other graphics packages though, where perhaps ggplot2 may be considered the gold standard to compare to, at least when it comes to non-interactive graphics. I used ggplot2 for most of the visualizations I made, although sometimes with extension packages to ggplot2 and in a few cases other plotting packages.

While the interactive experience in the base R command line can be reasonably good, it does improve with an IDE-like interface, such as with RStudio. I guess that for many people, it is RStudio that is the primary tool for them and which uses R the language to accomplish the work. The people who work at RStudio also are behind some quite useful software packages for R, which all are open-source.

Notebooks

When talking about the data science space, one cannot avoid the notebook format, popularized by the Jupyter notebooks. It is a document type with a mix of text and interactive program code which has become quite popular. It started in the Python space and has expanded to other languages. The name Jupyter comes from the three languages Julia, Python and R. However, many more languages have support for Jupyter notebooks nowadays.
Textual representation in such notebooks is Markdown, for the most part. But the internal format for Jupyter notebooks is JSON-based, and it is a bit wonky to work with version control software - not so easy to see what has changed.
In the R ecosystem, there are variations introduced to improve both the markdown format as well as notebook usage. RMarkdown is an extension of Markdown with some added functionality to make it more suitable for scientific articles, for example. R Notebooks is an alternative to Jupyter notebooks, which essentially are the same as RMarkdown documents. R Notebooks are thus more version control friendly, and RMarkdown provides a richer text writing experience than a number of the regular markdown dialects. It also supports other languages than R, although in a bit more limited way than the Jupyter alternative.

Final words

For me, Javascript and R are somewhat similar beasts, although in quite different areas. A lot of people who only know Javascript/R, love them. They are both undeniably useful in their primary areas and extensive ecosystems. It gets the job done. They have quirky and ugly bits which you have to learn to live with if you are using them or use tools to help with that. Plus, they make me cringe a bit, as a software engineer.

I like this quote from Bo Cowgill (Google), who sums it up:

The best thing about R is that it was developed by statisticians. The worst thing about R is that ... it was developed by statisticians.

A few useful links around R and its ecosystem:

DEV Community