Tomer Ben David

Posted on Jan 31, 2018

Which programming language is better R, Scala or Python?

#datascience #python #scala #machinelearning

I recently answered this question in quora, I didn't phrase the question, but it's a good starting point. I basically stay away from language debates as you will see, but this one really interested me. As I have debated with myself alot and was researching this specific question for myself, I basically wanted to know which one should I use for my next data project and here are my personal insights. (Please let me know what you think! :)

This is how I treat R, Scala, Python VS, which to choose saga. I basically use each for it’s better strength, here is the recipe. This is my personal view and usage of the languages.

Use R as a replacement for a spreadsheet **. Together with **RStudio it makes a killer statistics, plotting and data analytics application.

You can take log files, parse them, graph them, pivot table them, filter. And all with great support from RStudio - it’s a killer data analysis language and workspace, you should study as a replacement for spreadsheet workings.

Do you want to grep some lines from a text file no problem just use: dateLines <- grep(x = mylog, pattern = "^--", value = TRUE). It’s a backfiring arrow, it’s both easy to write - once you know the command you need to use! - It’s many times very difficult to figure out what is the correct command to use, practice is the key, note taking is the key, you need time for that, consider do you have the time for that, if not just use it as your little spreadsheet and use it from time to time until you get better with it, save a note or doc with useful R command’s and you will find that with a few commands + few plotting commands you are a small king in it’s realm. This example of grep is only one of a million of crazy abilities and matrix manipulation and plotting and RStudio will have you doing analytics like crazy on data.

If you have no time for the above I still highly recommend you to install RStudio and use it from time to time, get the hang of it, there is nothing like it so far that I know that is so good for quick data analysis, quick statistics, just give it a shot and try to replace your routine calculations, quick data manipulations tasks with it.

You can also move on and do machine learning in R, it has extremely powerful libraries for that (rpart, caret, e1071, …) and by all means if you and your teams are fluent with it feel free to move on, but me personally would use it only for speculations and quick analysis or quick models, I stop there, it can be very quick but this is when I turn to language number 2 python.

Use Python for small to medium sized data processing applications. Python tough introduced some type checking in recent releases (which is awesome), is an interpreted language (just like R) but it's a more of a standard programming language, as such you have the great benefit of speed of programming. You just write your code and run. However the caveat is that you don’t have the amazing compiler and features (the good ones not the kitchen sink one) from scala. Therefore as long as your project is small to medium sized.

It is going to be very helpful as you will utilize NLTK, matplotlib, numpy, pandas, and you will have great time and happy path learning and using them. This will take you on the fast route to machine learning, with great examples bundled in the libraries.

I’m not saying you cannot do it with R or scala with great success I’m saying as for my personal use this is the best most intuitive way to do that I use it for what’s it’s best.

I want quick analysis of csv I turn to R. I want a bulletproof fast app to scale in time I use scala. If my project is expected to be one big with many developers this is where I turn to language/framework number 3 - to java/scala.

Use scala(or java) for larger robust projects to ease maintenance. While many would argue that scala is bad for maintenance, I would argue that it’s not necessarily the case. java and scala with their mostly super strongly typed and compiled features, make them a great language for the large scale. You have spark opennlp libraries for your machine learning and big data. They are robust, they work in scale, it’s true it would take you longer time to code than in python but the maintenance and onboarding of new personal would be easier, at least in my cases.

Data is modeled with case classes.

Proper function signatures.

Proper immutability.

Proper separation of concerns.

While the above could be applied in any of the above languages it’s goes more naturally with scala/java.

But if you don’t have time or want to work with them all, then this is what I would do:

R - Research, plot, data analysis.

Python - small/medium scale project to build models and analyze data, fast startup or small team.

Scala/Java - Robust programming with many developers and teams, less machine learning utilities than python and R, but, it makes up by the increased code maintenance for multiple many developers teams.

It’s a challenge to learn them all and i’m still in this challenge, and it’s a true headache but at the end you benefit. If you want only one of them I would ask:

Am I managing a project with many teams, many workers, speed is not the topmost priority, stability is the priority - java/scala.
A few personal project I need quick results, I need quick machine learning on a startup - python.
I just want to hack on my laptop data analysis and enhance my spreadsheet data analysis, machine learning skills - R.

Top comments (9)

Jason C. McDonald • Jan 31 '18 • Edited

In my experience, a well-designed Python project is just as maintainable, if not more so, as one in any other language, regardless of size...and that's coming from a full-time C++ developer. The fact it is an "interpreted language" (and, actually, that's a popular half-truth we spend a lot of time mopping up after) has no bearing on its practicality. In the nearly seven years I've worked in Python, I have found absolutely no feature of compiled languages that is not reasonably matched in Python.

I'll also say, Java has a lot of issues that negatively impact performance, maintainability, and clean coding practice. I spend more time than I'd like as a trainer unteaching terrible habits formed in Java programming. Practically speaking, it is harder by an order of magnitude to write good Java code than it is to write good Python code.

In short, Java has well-earned the derision it carries from every corner of the coding world besides its own fanbase. ;)

That said, I won't say anything about Scala, as I have never used it.

I would say that, for large projects where you really need a "true" compiled language, C++14 or C++17 should be towards the top of your list of languages to consider. The last two versions of the language are lightyears removed from the traditional C++ that so many know and hate.

Tomer Ben David • Feb 1 '18 • Edited

I would really want to have that experience as well, but my experience was different, my main issue is that with dynamically typed languages when I see a function signature I have no idea what the arguments mean unless I can guess that correctly either by the arguments names or their documentation or context. While this may sound trivial, when me or my peers are faced with maintaining large projects created by other people and teams I have seen this creating maintenance problems and issues over and over again (many multiple team where people with different skills leave and join, large projects codebases, where the maintainer is not the one who wrote the code).

If I take an arbitrary function from github I might see this:

def parse_content_range(content_range, resumed_from):
github.com/jakubroztocil/httpie/bl...

Now lets say I look only at the function signature and I ask:

What does the function return?
What is content_range?
What is resumed_from?
What is the data at stake?
What does the function do?

These are the most fundamental question I ask about every function or piece of code I read or edit, for a function it's input, it's output. The code does not give me hints other than docs and arg names, no entity verifies this for me as an explicit and a must have step.

Unless I look at the documentation which is free text and is not confirmed by anyone (such as compiler) I have to trust that the documentation is correct, examine internal code, ask someone, try the function. I don't want to do that, I want someone (in my case the compiler to do as much of that for me).

Without looking at documentation I cannot really know, again, assuming i'm maintaining a project different people wrote (that is the core of maintenance from my side of issues).

I would really love to be able to maintain such code written by others, but it's much more difficult without the compiler assisting me to know and confirm the most basic thing about each and every function in my code:

What is the input of the function, is it correct?
What is the output of the function, is it correct?

I really love python, I love R, I love scala/java, I love javascript, typescript, shell scripts, ... I try to use each for it's best use case, and again those are my personal view and experience, others might have different one, I wish there was one silver bullet, there are multiple, so i'm faced with using each bullet for the target I think would hit best and this is what I tried to express in my thoughts in the post.

Evan Oman • Feb 1 '18

Great summary of why I like static typing too. To be fair Python 3.5+ supports Type Hints which are a step in the right direction (however these are far from standard so most of your points still stand).

Jason C. McDonald • Feb 1 '18 • Edited

Exactly, I was going to mention that type hints are becoming rather standard for function arguments.

While I would say your concerns are valid, @tomerbendavid , I would like to point out that Python is practically a paradigm all its own. You approach design from a different angle than you do from C-like languages; once you are used to this paradigm, none of the above issues continue to be factors. ;)

Of course, you can always just Systems Hungarian until you're used to static typing. ;)

Evan Oman • Feb 1 '18 • Edited

As a current Java dev who would like to be able to switch languages later I would love to hear what you think are the top bad Java programming habits?

Jason C. McDonald • Feb 1 '18 • Edited

I'd say one of the top ones is a blind dependency on the standard library, above and beyond many other languages. The language is structured in such a way that you basically have to rely on standard library elements whole-hog, where many other languages (including C++) would allow you to use parts-and-pieces as needed for your performance needs. Seeing as most of the standard library is deeply broken in terms of performance, this makes for some very bad messes.

Another is the terrible habit of deeply nesting multiple unnecessary namespaces. Almost every Java program I've seen has its source inside no less than six otherwise empty nested folders. Java coders often try to replicate this in C++ projects, and I have them go back and remove 85% of their namespaces.

I'll need to poke into the language again to remember a few of the others that are just beyond the reach of my immediate memory. When you do academic damage control from a language on a regular basis, you don't make a point to memorize its syntax and idiosyncrasies, for the sake of sanity. ;-)