Is it possible to work on both the software (Hadoop, MapR,Spark) and the hardware (FPGA, GPU, Multicore) aspects of Big Data?




I am a graduate student of Computer Engineering and a newbie in the world of Data Science. But this area has got me interested and I wish to learn more and develop my knowledge and skills. I now have a very basic understanding of Big Data Terminology - Hadoop, MapReduce, Pig, YARN, Spark etc.

However I want to pursue an interesting area - optimizing underlying hardware for efficient Big Data processing. I read that a combination of FPGA’s, GPU’s or so called Many/Multicore computing is well suited for Big Data and solving Data Science related problems.

So if any one has worked in this area or is pursuing research in this area help/guide me by answering my questions:

  1. Is it possible for a person to work on both the software (Hadoop, MapR, Spark etc) and hardware (FPGA, GPU, Many Multicore computing) aspects of Big Data.

  2. If I want to pursue this line of research what skills do you suggest me to develop? For example:

Software - Hadoop, MapR, Pig, Spark, YARN etc ;
Hardware - FPGA, GPU ;
Programming - VHDL, Verilog, OpenCL ;
Scripting Languages ;
Courses - Computer Architecture, Data Science, Data Visualization, Parallel Programming etc.

  1. Are there any research groups or people you know who are working in this area?

  2. General advice or suggestions or links to more resources.

Thanks in advance!


What do you mean by “optimizing underlying hardware for efficient Big Data processing”? Do you want to design your own hardware specialized for big data? Or do you want to optimize data science processes to work with big data?


Thank you for your reply. No I dont want to design new hardware. I want to map the various data science processes to run efficiently on a combination of different hardware units such as FPGA’s, GPU’s etc.


Ok, now I get it. It’s difficult to say, seems like a very steep road. It’s not my domain of study, but I’ll list some subjects that you are likely required to know:

  • Machine Learning: this is the most important requirement. Nothing else matters if you don’t understand the underlying mechanics of the learning algorithms. You just can’t optimize a process that is unknown. Unfortunately, for what you want to do there is no shortcut, you will have to understand every bit and piece there is to understand.
    Material that I recommend: Learning from Data, Elements of
    Statistical Learning
    , Deep Learning. These cover the most widely used learning techniques in detail. Also search for online learning algorithms.
  • Deep Learning: Deep Learning requires a bullet point on its own. It’s currently the hottest thing out there regarding ML and is a process hog, it’s just too damn slow. Besides the book I recommended above, also search for material on Tensorflow and Theano, the most popular deep learning frameworks, CUDA and parallel computing in general.
  • Distributed Computing: This one is a given. Regardless of how you can optimize the processes, you can’t scale without distributed computing. Hadoop, Spark and Flink are probably safe choices. If you want to contribute, you would also need to program in Java and Scala.
  • Databases: what is most notable about big data is that it usually doesn’t fit in memory, and with the disk access being so slow, smart use of databases seems to be key for good performance. Not sure where to begin, just keep this in mind.

I hope you are not disappointed with my answer. From the OP, it seems like you want to work at a lower computational level, but I just don’t see many opportunities for that in the realm of Data Analysis/Science. Pretty much every learning algorithm is a series of matrices/vectors operations, which are already highly optimized. So what is left is to optimize the learning algorithms themselves, to make less operations during training.


caiotaniguchi, Thank you very much for your interest and reply.