Python vs. R in Data Science

Illustrated point-by-point differences.

Image from Unsplash

Introduction

Python and R are two of the mainstream languages in data science. Fundamentally, Python is a language for programmers, whereas R is a language for statisticians. In a data science context, there is a significant degree of overlap when it comes to the capabilities of each language in the fields of regression analysis and machine learning. Your choice of language will depend highly on the environment in which you are operating. In a production environment, Python integrates with other languages much more seamlessly and is therefore the modus operandi in this context. However, R is much more common in research environments due to its more extensive selection of libraries for statistical analysis.

Key Differences:

Data Science Libraries (Python Vs R)

Python
R

DATA WRANGLING

Pandas. Python is particularly strong in this area, with the Pandas library being very extensive in this regard. data.table, dplyr, plyr

DATABASE CONNECTIONS

mysql-connector-python, psycopg2, SQLAlchemy. Both Python and R have several libraries available to connect to a SQL database, import data, and commit queries, among other common tasks. rmysql, rpostgresql

MACHINE LEARNING

PyBrain, PyLearn2, scikit-learn, statsmodels. scikit-learn in Python is quite popular for running machine learning algorithms, and the faster processing speed of Python makes it more suitable for this purpose. caret, randomForest, rpart, neuralnet

REGRESSION ANALYSIS

Numpy, scikit-learn, SciPy, statsmodels lmtest, car. Both languages are capable of conducting advanced statistical analysis, including regression analysis. However, the associated packages in R are more extensive and offer more flexiblility in this area.

TIME SERIES

Prophet, PyFlux, statsmodels MASS, tseries, forecast. As in the case of regression analysis, both languages have the capability to conduct analysis on time series data. However, the packages in R are more extensive in conducting such analysis.

VISUALIZATION

matplotlib is the dominant plotting library in Python. Others include Plotly, Pygal, Bokeh, and Seaborn. ggplot2 is the dominant plotting library in R. You can aslo use Plotly in R as well as caret, igraph, and highcharter.

Conclusion

I hope this blog helps to get some idea about the differences between Python and R in the field of Data Science.

Reference: Python vs. R for Data Science by Michael Grogan

Published by O’Reilly Media, Inc., 2018

4 comments

Leave a Reply to Arpita Ghosh Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: