Most useful data mining libraries in Python



I have covered basics of Python using the course from Codeacademy and I am now comfortable with Python basics and Object oriented programming.

I have been suggested to start learning libraries used for data mining to continue my learning. What all libraries should I learn? In how much detail? If some one can help me with details on role which each of these libraries would play, it would help me understand the use in more details.

Best Regards

Career Tips - Python or R?

Here is a must know list of libraries, if you want to use Python for data mining:

  1. NumPy - stands for Numerical Python. Most commonly used for n-dimensional arrays, random number operations

  2. SciPy - stands for Scientific Python. Can be used for Fourier transforms, Linear Algebra

  3. Matplotlib - provides MATLAB like plotting functionality in Python

  4. Pandas - brings DataFrame in Python. This should provide you easy ways to aggregate and Pivot data

  5. Scikit-learn - Most useful library for Machine Learning on Python

  6. Regular Expressions - You will use it for data munging and pattern extraction

Additional libraries which you might find useful:

  • BeautifulSoup - for crawling web pages in case you need to extract data from the web
  • Pattern - for NLP, machine learning and network analysis
  • Statsmodels - for Descriptive stats, hypothesis testing
  • NetworkX & igraph - for graph based data manipulation
  • os - for using os inside Python applications
  • urllib - to open web pages and perform file operations
  • NLTK - Natural Language ToolKit for Natural Language Processing

You can read more about these libraries here:

This should be a good list to learn and explore




You might also find this link useful:

Not only Python, it has good collection for other languages used in Machine learning.