Why should a Data Scientist know C/C++?


Is it true if I say that SAS and R are to build model protoypes. To implement them, is required a programming language like C++.

If this is correct, which of the two buckets do I place Python in?


depends on which python library you want to know.

pandas,seaborn, numpy,scipy,ipython notebook, clearly should be known by a Python Data Scientist.

Not knowing C++ is not an issue if you use rcpp


Sometimes R code just isn’t fast enough. You’ve used profiling to figure out where your bottlenecks are, and you’ve done everything you can in R, but your code still isn’t fast enough. In this chapter you’ll learn how to improve performance by rewriting key functions in C++. This magic comes by way of the Rcpp package, a fantastic tool written by Dirk Eddelbuettel and Romain Francois (with key contributions by Doug Bates, John Chambers, and JJ Allaire). Rcpp makes it very simple to connect C++ to R. While it is possible to write C or Fortran code for use in R, it will be painful by comparison. Rcpp provides a clean, approachable API that lets you write high-performance code, insulated from R’s arcane C API.

Typical bottlenecks that C++ can address include:

Loops that can’t be easily vectorised because subsequent iterations depend on previous ones.

Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than that in R.

Problems that require advanced data structures and algorithms that R doesn’t provide. Through the standard template library (STL), C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues.



Just to add to what @ajay_ohri has already mentioned, it all depends on what prototype you are building. If the model is for executing marketing campaigns - any of these tools is good enough. Delay of a few seconds would not matter.

However, if you want real time output, like credit risk assessment on application stage of credit cards, every second is critical. These are better implemented in C / C++.

Hope this helps.