Error: cannot allocate vector of size - Gradient Boosting Algorithm

r

#1

Hi all,

With reference to the post “Learn Gradient Boosting Algorithm for better predictions (with codes in R)”.
I am trying to run the code mentioned in the blog,but i am not able to do so because of memory limitation.

I am running it on 64 bit machine with 8 GB RAM and enough space of 59 GB in the C drive.

library(caret)
library(Metrics)
setwd("C:/Users/mcvi/Desktop/Modelling/Analytics Vidya/Datascience 3x/Raw data")

complete <- read.csv("train.csv", stringsAsFactors = TRUE)
train <- complete[complete$Disbursed == 1,]
score <- complete[complete$Disbursed != 1,]

set.seed(999)
ind <- sample(2, nrow(train), replace=T, prob=c(0.60,0.40))
trainData<-train[ind==1,]
testData <- train[ind==2,]

set.seed(999)
ind1 <- sample(2, nrow(testData), replace=T, prob=c(0.50,0.50))
trainData_ens1<-testData[ind1==1,]
testData_ens1 <- testData[ind1==2,]

table(testData_ens1$Disbursed)[2]/ nrow(testData_ens1)
fitControl <- trainControl(method = "repeatedcv", number = 4, repeats = 4)
trainData$outcome1 <- ifelse(trainData$Disbursed == 1, "Yes","No")
set.seed(33)
memory.limit(size=90000)
gbmFit1 <- train(as.factor(outcome1) ~ ., data = trainData[,-26], method = "gbm", trControl = fitControl,verbose = FALSE)

while running the above line I am getting error as “Error: cannot allocate vector of size 56.4 Gb”.

Let me if i am doing anything wrong in the code ?


#2

Hi Vishwanath,

Did you remove the ID variable before fitting the GBM model?

With stringsAsFactors = TRUE, the ID variable must have been converted to factors. Since, it is a identifier variable (with lot of unique values), so it gets converted to a factor variable with the number of levels = total number of observations in ID variable (which is pretty huge!!!).

So, the error is more likely due to that.

Please try removing the ID variable before fitting the model and see if you can reproduce the same error.

gbmFit1 <- train(as.factor(outcome1) ~. - ID, data = trainData[,-26], method = "gbm", trControl = fitControl,verbose = FALSE)

#3

The error means that the OS has exhausted all the RAM available.
There are two ways to go about this:
First, after loading the data in R, try the below:

gc()

The above garbage collector will increase the available memory. If that doesn’t work, you should try to decrease the size of the data and do the processing in batch.

Hope this helps! :smile:


#4

Hi,

Debarati points to what I believe is the problem, ID but is not the only one, city, employer will be also a factor if they are part pf train.cv and they could be as you have 26 columns. Even DOB and Lead_Creation date as they have not the standard R format will be factor!
Other point what do you want to achieve? If I understood your code properly, all the observations of train data will set to yes through outcome1. If I am right Boosting will give you nothing interesting as all the observation are positive,
Hope this help.
Alain


#5

Thanks Debarati for the help, Yes the error reduced to some extent in terms of memory.

I have removed the ID from the training set re run the code again.I am getting the below error after running the code.

gbmFit1 <- train(as.factor(outcome1) ~. - ID, data = trainData[,-26], method = “gbm”, trControl = fitControl,verbose = FALSE)

Error: cannot allocate vector of size 14.1 Gb


#6

This error is related to the fact that the size of your model is around 14.1GB but you don’t have enough RAM for that. Try model=FALSE as one of your model parameters and see if that helps. This parameter will exclude a copy of training data from being stored in the model which mostly doesn’t affect your prediction. Try ?model to get more idea on this.


#7

Thanks Amit for the update. Memory issur got resolved now.
I am getting different error now

gbmFit1 <- train(as.factor(outcome1) ~. -Employer_Name, data = trainData[,-22],trControl = fitControl,method = “gbm”, verbose = FALSE)

Something is wrong; all the Accuracy metric values are missing:

Accuracy       Kappa    

Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA’s :9 NA’s :9

Error in train.default(x, y, weights = w, …) : Stopping

In addition: There were 50 or more warnings (use warnings() to see the first 50)
> warnings()

Warning messages:
1: In gbm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, … :
variable 6: Salary_AccountAbhyuday Co-op Bank Ltd has no variation.


#8

this happened with me a few times. 2 things you can try here

  1. set library(pROC) again
  2. Make sure there is no NA values in your dataset. Check sum(is.na(traindata)) for the same

#9

Hi

in your fitControl add summaryFunction = twoClassSummary you want to optimise on ROC so mentioned to train that the optimisation is ROC with metric =“ROC”

Hope this help.

Alain


#10

Thanks a lot Lesaffrea :smile:

It got fixed , Got a accuracy of ~85%

Area under the curve: 0.8498


#11

Very good results well done
Alain