How do I process new data and mark it as fraud or not fraud in Random Forest Model


#1

I have created a tuned and cross validated predictive classification model (Random Forest) in scikit-learn using a training set of 50K observations that were all marked as “FRAUD” or “NO FRAUD”. I would now like to run a new set of observations through the tuned model that have not been classified yet and classify them by adding a column to the data set that marks each observation as “FRAUD” or “NO FRAUD” as each observation passes through the model.

I am using python notebook for the project and am seeking a code example that essentially takes the unclassified observations, classifies them and attaches disposition to data set.

All help is greatly appreciated.


#2

Hi,

In python for predictive capabilities, scikit-learn is very simple package to work with various algorithms. It can be done in three simple steps:

  1. Initializing the model
  2. Fitting it to the training data
  3. Predicting new values

Algorithms in scikit-learn share a few common named functions, once they are initialized. You can always find out more about them in the documentation for each model.
some-model-name.fit( )
some-model-name.predict( )
some-model-name.score( )

After reading training data set and separating those into independent(x) and dependent(y) arrays, convert variables into numbers. Now follow the below mentioned process:

Import the random forest package
from sklearn.ensemble import RandomForestClassifier

Create the random forest object which will include all the parameters for the fit
forest = RandomForestClassifier(n_estimators = 100)

Fit the training data to the Survived labels and create the decision trees
forest = forest.fit(x,y)

Apply this model on test data to predict
output = forest.predict(test_data)

The output will be an array with a length equal to the number of observation in the test set and a predict.

Regards,
Steve


#3

So it is as simple as adding the one line marked as such in the code listed below and combining the “output” results back together in a data frame using the row number as the join? Attached is my code?

#Model Random Forest
model_RF = RandomForestClassifier(bootstrap=True, class_weight=None, criterion=‘gini’,
max_depth=None, max_features=‘auto’, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
model_RF.fit(features_train,target_train)
print(model_RF)
#make predictions
expected_RF = target_train
predicted_RF = model_RF.predict(features_train)
#added this line of code based on your suggestion--------------------------------
output = model_RF.predict(features_test)
#summarize the fit of the model
mse_RF = np.mean((predicted_RF-expected_RF)**2)
print “RSquare”, (model_RF.score(features_train,target_train))
print(“MSE”, mse_RF)

print(metrics.classification_report(expected_RF,predicted_RF,target_names=[’<No Fraud’, ‘Fraud’]))
print(metrics.confusion_matrix(expected_RF,predicted_RF))
print “Model Accuracy = “,accuracy_score(expected_RF,predicted_RF)*100,”%”


#4

#establish training and target set = 60/40 split
features_train, features_test, target_train, target_test = train_test_split(
new_df_data.ix[:,1:].values, new_df_data.ix[:,0].values, test_size=0.40, random_state=0)