Should I use Virtual Machines for data science projects?

virtual_machine

#1

I am an explorer of various data science and web development tools.

I used to frequently install various softwares on my machine. You will find R, RStudiio, Python 2.x, Python 3.x, Orange Canvas, Weka and many other softwares / tools installed on my machine.

It is very difficult to maintain all of them and still not have conflicts among them. One of my friends suggested to use virtual machines for data science tools instead of installing them on my machine (e.g. using virtualenv in Python).

Is it a good choice? If yes, what are the various virtual machine images available with various tools pre-installed which I can use?

Regards,
Chris


#2

Chris,

I think it is a good option, especially so if you have started seeing conflicts between installations. Managing Python 2 and 3 simultaneously can be tricky at times, especially so when you want to update various libraries regularly.

Here is a list of Virtual Machine images, I am aware of:

  1. SAS University Edition - You can go to sas.com and register a profile. Post that you can download a virtual machine image for University edition. You need to have a 64 bit machine to run the virtual machine though.

  2. If you have decent technical skills, you can use following dockers as well:
    a. Data Science for Python: https://registry.hub.docker.com/u/ceshine/python-datascience/
    b. R docker (or rocker as it is called): https://github.com/rocker-org/rocker

  3. You can also look at this project: Data Science Tool box (http://datasciencetoolbox.org/)

  4. Cloudera provides a Quickstart Virtual Machine download. It comes with Mahout and Spark included: http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.html

  5. Another option could be to rent out a machine on AWS and then run RStudio / Python on it.

Hope this should provide you enough ways to start playing with various tools without interference

Regards,
Kunal


#4

Might I recommend Anaconda. It’s a free, cross-platform ‘distribution’, so to speak, which installs by default Python, IPython/Notebook, the Spyder IDE and ton of scientific computing packages (the SciPy stack). You can install multiple versions of Python in separate environments without each affecting the others. If you don’t want to install all the packages available, you can opt for Miniconda instead. It’s a much smaller download to begin with. but you can later choose what else to install. I hear that support for R is also in the works.


#5

BTW, by using virtualenv you will still be installing stuff on your, or a single, machine. It’s just that you can set up ‘walls’ between the different versions of Python without conflict. (Anaconda (see previous post) makes this a lot easier.)


#6

Virtual Machines like Vagrant, can be really good choice if you operate on Windows. It becomes too combersome if you try installing all packages on Windows. Python comes as an inbuilt language in linux and works much more efficiently in Linux. Building a dual boot generally makes it difficult for user to access files across OS. Vagrant allows very easy access.

Hope this helps.
Tavish


#7

try this also http://datasciencetoolbox.org/


#8

You can also have a look at this:

https://registry.hub.docker.com/u/madrossan/r-extended/

Have not used it personally though


#9

Yes Kunal docker is good choice then vm. They are advance and faster then vm. Now a days most IAAS are opting to dockers rather then virtual machines.


#10

a cloud image is better than a VM in my opinion. However a VM is better than native hosting because if something goes wrong atleast your productivity does not suffer due to your OS being corrupted. So a VM Sandbox is the best