Data Set for Practice - Build model with data.table and H2O

data.table
h2o

#1

Hello all,

This data set has been used in the article: http://www.analyticsvidhya.com/blog/2016/05/h2o-data-table-build-models-large-data-sets for practice purpose.

This article demonstrates the use of data.table and H2O to build models on large data sets. There package work efficiently and help a user overcome the petty machine memory issues. A lot has already been said in the article.

You can download the data set and get started practicing with me. To download the data, one time login is required.

Below is the complete problem statement and data used in the article:


Problem Statement

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.

Your model performance will be evaluated on the basis of your prediction of the purchase amount for the test data (test.csv), which contains similar data-points as train except for their purchase amount. Your submission needs to be in the format as shown in “SampleSubmission.csv”.

Submissions are scored on the root mean squared error (RMSE). RMSE is very common and is a suitable general-purpose error metric.


Note: This thread will expire on 15th May 2016.
Edit: This thread has expired now. Data is no longer available for download.
Edit: Data set is available for download again.

Download Link:

http://datahack.analyticsvidhya.com/contest/black-friday

#3

@Manish Saraswat: Thank you very much for creating this tutorial.I am currently working on a Deep Learning project in H2O itself,using R, on biomedical datasets. So, this tutorial is a godsend to me! However, as I was executing your program, I came across this error message and I am unable to understand how to correct it. I have used the exact same names and letter-cases that are in the tutorial. Your suggestions are much appreciated!:

combin[,prop.table(table(Gender))]Gender
Error: unexpected symbol in “combin[,prop.table(table(Gender))]Gender”

combin[,prop.table(table(Age))]Age
Error: unexpected symbol in “combin[,prop.table(table(Age))]Age”


#4

@shaw38 I am glad you found it helpful.
Referring to your error, I am not sure if there is some mistake here, but can you try this:

combin[,prop.table(table(Gender))]
combin[,prop.table(table(Age)]

I am not sure why is Gender & Age written outside their respective brackets. Also, make sure you’ve uploaded the data.table package successfully.


#5

Issue resolved. I had misunderstood the LOC as writing ‘Age’ and ‘Gender’ outside the bracket as well. Writing it the way you’ve suggested in the above reply, solved the issue. Thank you!


#6

Hi Manish,

Can you please share the dataset again.


#7

Hi @yashodhanbhatt

Sorry, the data set can’t be made live again as the deadline has crossed.
Do keep a check on AV article mails so that you don’t miss out such opportunities in future.

Regards
Manish


#8

Hello Manish. Can you help me with this query?
I am using H2O’s Deep Learning function to predict mortality of a patient dataset ,where the response variable is binary (i.e values are either ‘0’ or ‘1’). My dataset has 16384 columns,with NO column-names. All the columns are numeric , there are no strings ,characters,etc. When I execute the following code,I get mortality predictions in terms of decimal values (as shown below),whereas I only need predictions that are either 0 or 1. How would I use Deep Learning to do this? Kindly help. Thanks a lot!

Part of the Dataset(in an Excel sheet):
0.131 0.297 0.633 0.492 0.704 0.747 0.491 0.698 0.738 0.481 0.771 0.532
0.311 0.496 0.001 0 0.638 0.009 0.991 0.44 0.414 0.009 0.021 0.999
0.773 0.01 0.032 0.01 0.006 0.042 0.988 0.993 1 0.549 0.577 0.99
0.719 0.534 0.028 0.008 0 0.569 0.983 0.985 0.025 0.022 0.6 0.374

Code:
snips.train<- h2o.importFile(“C:\Users \snp\snp_trainingset_70_13.csv”)
snips.test<-h2o.importFile(“C:\Users\ \snp\snp_testset_69_12.csv”)
dim(snips.train)
#[1] 83 16384
dim(snips.test)
#[1] 81 16384

y.dep<-16384
x.indep<-1:16383
system.time(dlearning.model3<-h2o.deeplearning(y=y.dep,x=x.indep,training_frame=snips.train,activation=“RectifierWithDropout”,hidden=c(1200,50),epoch=100))

user system elapsed

6.42 0.25 668.32

h2o.performance(deeplearning.model3)

** Reported on training data. **

Description: Metrics reported on full training frame

MSE: 0.01964181
R2 : 0.851305
Mean Residual Deviance : 0.01964181

predict.dl2<-as.data.frame(h2o.predict(dlearning.model3,snips.test))

submi_dlearning3<-data.frame(Predicted_Mort=predict.dl2$predict)

write.csv(submi_dlearning3,file=“submi_dlearning3.csv”,row.names=F)

Part of the current output:
Predicted_Mort
0.131749662
0.14698337
0.155728288
0.130509461
0.130420171
0.133914652
0.134027134
0.124258962
0.136126049
0.136019254
0.122301849
.
.
.


#9

Hi @shaw38

Good to know that you’ve reached the level of using deep learning algorithm. I’ve checked your code and query. Below are few observations / ideas which you could use to obtain improved predictions:

  1. Since your data set doesn’t have column names, you should name them. A simple way of naming them is V1, V2, V3 and so on. You can use:
    > colnames(snips.train) <- paste0("V",1:16384)

  2. After you’ve named the columns, your data set get 16384 variables which can you easily use for reference prupose.

  3. You can put a threshold value at your predicted outcomes to convert it into binary (0 or 1) output. You can do it using a simple ifelse command:
    > predict.binary <- ifelse(predict.dl2 < 0.5, 1,0)

Note: I’ve taken the threshold value as 0.5. For optimal value, you should look at the ROC curve.


#10

hi @Manish … I know I’m late. Can you please make the data set available again. That would be a great help.

Regards
Abhilash


#11

Hi @naik_abhilash
The data set can’t be made available again. Regret the inconvenience caused.
May be, you should follow us on email & Facebook so that you don’t miss out our latest updates.


#12

Not able to download the dataset, please help me


#13

@Purnendu
This thread has expired and the data set in not available now. For future reference, please subscribe to the mailing list so that you do not miss out on any datasets.

Regards,
Shashwat