A Long-Lasting War: Python or R for Data Analysis

Written by Allison Zhang, Data Engineer, Virtulytix

The debate for using Python or R to do data analysis can be dated back to 2015. Before coming to a quick conclusion, you need to know how the Python and R is being used to do data analysis and the differences between them.

Originality

According to Wikipedia - Python is an open source interpreted high-level programming language for general-purpose programming. But in recent years, with the maturity of libraries, Python has become increasingly popular for data analysis. Data analysis is one function of Python. So, it would be easier for programmers to step into the data analysis world using python.

On the other hand, R is a language and environment for statistical computing and graphics from the first place. It was initially used mostly by scholars and researchers but recently became popular in the business world.

Trends

Actually, the battle between Python and R was known to be fierce before 2017. The chart below shows the weekly searching interest on Google in the United States since year 2015 till now. We can see clearly that before April 2017, R outcompetes Python every time, not to mention before year 2015. But the lines mingled after that, t mostly because the great packages, like pandas and matplotlib, allows Python to do almost the same as R now.

PicA.png

Industry Preference

Then why there is still a battle between Python and R? The major cause is that there are experts from different fields using different languages. When you put these experts all together, they will definitely give you different answers to this question and list their own advantages and disadvantages.

Though large companies always have a preference to use both Python and R, different industry still have their own preferences. Burtch Works did a survey about data scientists and predictive analytics professional preference and there are two charts can answer our question.

Though 2018’s report is not published yet, we can still see the trends. SAS still dominates the pharmaceutical industry. The high-tech industry is now seeing the trends to use Python more. The financial industry, on the contrary, chose to stay with R. So, Python and R are both popular in the data analysis field and different industries have their own preferences. These preferences tend to change over time.

Screen Shot 2018-06-06 at 2.54.15 PM.png

Data Analysis Example

The best way to illustrate the differences in the usage of Python and R is to use them. Attached is a classic dataset, iris flower dataset, in the academic world to teach beginners how to do data analysis. Today, we are using this dataset to do the same analysis using both Python and R.

Iris flower data is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper, The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Screen Shot 2018-06-06 at 2.46.51 PM.png

To do data analysis, the following steps need to be followed. There is no business problem associated with this example and we do not need to acquire the data deliberately.

picf.png
Screen Shot 2018-06-06 at 3.02.12 PM.png
Screen Shot 2018-06-06 at 3.04.09 PM.png

We don't need to dig further to know that Python and R have a lot in common with each other. Therefore it is up to you to choose the language you need for data analysis