Error in `$<-.data.frame`

r

#1

Hi,

I am using CART algo for predicting score and class for my data set, but when i am trying to predict scores for validation data set. I am getting below error.

Error in $<-.data.frame(*tmp*, “predict.class”, value = c(1L, 1L, :
replacement has 4607 rows, data has 2012
In addition: Warning message:
‘newdata’ had 2012 rows but variables found have 4607 rows

Validation data set str

‘data.frame’: 2012 obs. of 13 variables:
X : int 1625 796 397 1879 2901 4728 3596 6150 3906 2182 ... IT.Exp : int 108 55 99 118 45 132 28 51 113 108 …
Non.IT.Exp : int 0 0 0 0 0 0 0 0 0 0 ... Candidate.Last.Employer.Name: Factor w/ 728 levels “A N IT SOLUTIONS PVT LTD”,…: 169 449 270 603 350 590 257 534 294 636 …
Interview.city : Factor w/ 14 levels "BANGALORE","CHANDIGARH",..: 3 6 1 3 6 1 6 1 1 9 ... Educational.Type : Factor w/ 107 levels “Accountancy”,…: 28 65 38 89 17 73 25 91 65 91 …
Education : Factor w/ 52 levels "ACCOUNTS","APPLIED",..: 13 25 18 39 11 30 13 41 25 41 ... Source : Factor w/ 5 levels “AGENCIES”,“CAREER PORTAL”,…: 5 5 1 4 2 5 2 5 4 4 …
Source.Name.Updated : Factor w/ 20 levels "CAREER PORTAL",..: 14 14 8 10 1 14 1 14 10 9 ... SFO : Factor w/ 2 levels “NO”,“YES”: 1 1 2 1 1 2 1 1 2 1 …
Joined : Factor w/ 2 levels "1","2": 1 1 1 1 1 2 1 1 2 1 ... Last.Employer : Factor w/ 452 levels “A”,“AAKIT”,“ABB”,…: 97 280 169 383 209 378 161 330 185 394 …
$ random : num 0.7 0.7 0.7 0.7 0.701 …

Development data set str

‘data.frame’: 4607 obs. of 16 variables:
X : int 2793 1884 2759 3471 3481 1009 6106 5588 6014 6124 ... IT.Exp : int 46 48 114 49 226 49 40 36 60 43 …
Non.IT.Exp : int 0 0 0 0 0 0 0 0 0 0 ... Candidate.Last.Employer.Name: Factor w/ 1439 levels “1.\tCGI INFORMATION MANAGEMENTS AND CONSULTANTS PVT LTD”,…: 964 1311 1216 1393 1233 1393 43 475 302 1331 …
Interview.city : Factor w/ 17 levels "BANGALORE","BHUBANESWAR",..: 1 1 7 7 1 7 15 1 15 4 ... Educational.Type : Factor w/ 149 levels “Accountancy”,…: 115 119 10 119 119 21 38 91 119 87 …
Education : Factor w/ 71 levels "ACCOUNTS","AERONAUTICAL",..: 48 51 6 51 51 13 18 35 51 33 ... Source : Factor w/ 5 levels “AGENCIES”,“CAREER PORTAL”,…: 1 4 4 4 5 1 5 5 1 5 …
Source.Name.Updated : Factor w/ 21 levels "CAREER PORTAL",..: 13 11 11 11 15 8 15 15 9 15 ... SFO : Factor w/ 2 levels “NO”,“YES”: 1 1 2 2 2 1 2 1 1 1 …
Joined : Factor w/ 2 levels "NO","YES": 1 1 1 2 2 1 1 1 1 1 ... Last.Employer : Factor w/ 818 levels “1.\tCGI”,“14GLOBAL”,…: 539 746 700 797 701 797 14 264 156 761 …
random : num 4.72e-05 6.55e-05 8.59e-05 1.97e-04 3.00e-04 ... predict.class : Factor w/ 2 levels “NO”,“YES”: 1 1 1 1 1 1 1 1 1 1 …
predict.score : num [1:4607, 1:2] 0.99 0.771 0.99 0.771 0.771 ... ..- attr(*, "dimnames")=List of 2 .. .. : chr “1” “2” “3” “4” …
… … : chr "NO" "YES" deciles1 : num 1 9 1 9 9 9 9 9 9 9 …


#2

I found the solution, previously I was forming the model including only significant variable but data frame has all variables from file, so creating model was not a problem but when I was predicting score for validation data it was throwing error, as per error - variables used for modeling and scoring are different. Hence I considered only significant variables and created separate data set which has only these variables, then scoring validation data was not a problem.