Discussions for article "A Complete Tutorial to Learn Data Science with Python from Scratch"

data_science
python

#1

Hi All,

The article “A Complete Tutorial to Learn Data Science with Python from Scratch” is quiet old now and you might not get a prompt response from the author.

We would request you to post your queries here to get them resolved.

A brief description of the article -

This article gives a step by step guide for beginners who wish to start their journey in data science using python. It includes introduction to python, python libraries and data structures. Furthermore the three most common ML algorithms, logistic regression, decision tree and random forest are explained and implemented in this tutorial.


#2

Hey,

Many of the codes in that tutorial is either become obsolete or don’t work as explained by the author.

kindly see to it, as this would create so much distraction for beginners.

Thank you!


#3

Hi @vibhuk16,

Thanks for notifying. Codes have been updated.

Happy learning!!


#4

Code need minor updation like
cross_validation module is deprecated. Therefore
from sklearn.cross_validation import KFold should change to
from sklearn.model_selection import KFold

n_folds changed to n_splits
kf = KFold(n_splits=5)
for train, test in kf: should change to
kf.split(data[predictors]) // I am not sure whether we should pass data[predictors] or some other value. But it compiles fine


#5

Hi @nadeeshtv,

Thanks for pointing it out. We will update the same in the article.


#6

Still I am getting this error …

TypeError Traceback (most recent call last)
in ()
2 model = LogisticRegression()
3 predictor_var = [‘Credit_History’]
----> 4 classification_model(model, df,predictor_var,outcome_var)

in classification_model(model, data, predictors, outcome)
19
20 #Perform k-fold cross-validation with 5 folds
—> 21 kf = KFold(data.shape[0], n_splits=5)
22
23 error =

TypeError: init() got multiple values for argument 'n_splits


#7

Hi,
I am getting a similar error here.
Do post if you find something?


#8

@abhijitaradhye
@data_crat

I got it to work using nadeeshtv advise!

I changed the code into:

kf = KFold(n_splits=5)
error =
for train, test in kf.split(data[predictors]):

Since I am still very new to python, I can’t assure you this is the correct code to use. I can only tell you that using this code I could run the program without errors and it gave me the same result as in the original tutorial.

EDIT: I just realized that this way it is not running well afterall, as it seems to only take into account the first predictor variable instead of all predictor variables in the list… If anyone knows how to resolve this, please let me know!


#9

Hi @nadeeshtv,

My model is working perfectly fine with cross_validation and n_folds. What is the error that you get?


#10

Hi @abhijitaradhye, @data_crat

Please use from sklearn.cross_validation import KFold and n_folds. I copy pasted the code from the article and here are the results-


#11

Hi,

In the tutorial “A Complete Tutorial to Learn Data Science with Python from Scratch” by KUNAL JAIN at

I have used this code

#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
  #Fit the model:
  model.fit(data[predictors],data[outcome])
  
  #Make predictions on training set:
  predictions = model.predict(data[predictors])
  
  #Print accuracy
  accuracy = metrics.accuracy_score(predictions,data[outcome])
  print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

  #Perform k-fold cross-validation with 5 folds
  kf = KFold(data.shape[0], n_folds=5)
  error = []
  for train, test in kf:
    # Filter training data
    train_predictors = (data[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = data[outcome].iloc[train]
    
    # Training the algorithm using the predictors and target.
    model.fit(train_predictors, train_target)
    
    #Record error from each cross-validation run
    error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
  print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

  #Fit the model again so that it can be refered outside the function:
  model.fit(data[predictors],data[outcome]) 

and also

outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)

I am getting this error

TypeError: __init__() got multiple values for argument 'n_splits'

It seems to be indicating this line is the problem

---> 24 kf = KFold(data.shape[0], n_splits=5)

I would really appreciate any suggestions or help

Thank you


#12

Hi @alexoc,

I used the same code you shared in the above post and it works perfectly fine for me. It’s probably because of the difference in versions. Could you try making changes as suggested by @nadeeshtv? Let me know if that solves the problem.


#13

Hi,
You may be using old versions of the library.
sklearn.cross_validation deprecated . please see https://github.com/amueller/scipy_2015_sklearn_tutorial/issues/60

For the n_fold to n_splits, please see the latest doc
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html.

In the old doc, it was n_fold. Please see doc of scikit-learn version 0.17( apparently, I can not put more than 2 links in a post)