Script in h2o in R to get you into top 30 percentile for the Digit Recognizer competition




I have been trying to break the 96% accuracy barrier in the Digit Recognizer problem for a long time but nothing seemed to work until I finally laid my hands on the deep learning module in h2o library in R.It helped me break into the 97% accuracy slab.The code is very small and sweet(compared to the other things I had tried before).I am sharing it here on AV in case anyone wants to get a headstart.

localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE,min_mem_size = "3g")
## Import MNIST CSV as H2O:
mnistPath = '/home/shuvayan/Downloads/Kaggle/DGr/train.csv'
mnist.hex = h2o.importFile(path = mnistPath, destination_frame = "mnist.hex")
train <-
train$label <- as.factor(train$label)
train_h2o <- as.h2o(train)

#Training a deep learning model:---------------------------------------------------------#
model <- h2o.deeplearning(x = 2:785,
                   y = 1,
                   training_frame = train_h2o,
                   activation = "RectifierWithDropout", 
                   input_dropout_ratio = 0.2,
                   hidden_dropout_ratios = c(0.5,0.5),
                   balance_classes = TRUE, 
                   hidden = c(800,800),
                   epochs = 500)

#Predict on test data:
test_h2o <- h2o.importFile(path = '/home/shuvayan/Downloads/Kaggle/DGr/test.csv', destination_frame = "test_h2o")
yhat <- h2o.predict(model, test_h2o)
ImageId <- as.numeric(seq(1,28000))
names(ImageId)[1] <- "ImageId"
predictions <- cbind(,[,1]))
names(predictions)[2] <- "Label"
write.table(as.matrix(predictions), file="DNN_pred.csv", row.names=FALSE, sep=",")

How to resolve error while importing csv into R through h2o library
How to pass an h2o object to model in R using the h2o library


I think there is script on kaggle which uses Python Neural net using nolearn library and increasing data size by rotating image. It will take you to some what close to 0.985.


Hi @aayushmnit,

Yep there is,but I do not know python as of now,so :smile:
Actually for this competition we had to use Deep Learning and the h2o library in R performs better than the traditional packages for the same in R,so thought of checking h2o out. :slightly_smiling:



What’s your machine configuration and How much time does this code takes to run? Basically I increased the dataset size by 10 folds, so it became impossible for me to use R with my machine configuration that time, so switched to Python for the same. These days I use both and take advantage of both worlds :smiley:


Hi @aayushmnit,

My machine is old: 4GB-i5 Ubuntu.
It took around 4.5 hrs with this model.
What about python’s??


Hi @shuvayan,

I used that Kaggle script which makes the data 10 fold bigger and then use a neural net using no learn library in python. My machine config was 4gb i5 - windows laptop, Anaconda Environment took around 2 hours to run.


It’s a great script @shuvayan. My two cents, you could use convolutional neural networks (CNN) as it decrease training parameters (CNN has shared weights) subsequently decreasing train time.

On using python with nolearn (and a decent GPU), I could get an accuracy of 98-99% on the validation set within a minute. Here’s the link to my code.


Hi @shuvayan
You can look at ConvNets. They give good accuracy on image classification problems. I used CNN with keras on Digit Recognizer problem and moved beyond 99% accuracy with a simple model.
I ran it on my 4 years old laptop with decent gpu and it took around 1 min per epoch.
I used the same model with Identify the Digits problem and it gave 99%+ accuracy.