How are Rows increasing when i try to do feature engineering?




I am begginer in Data science and I am working in Big market sale dataset using R.

Now when I do Data manipulation and adding “Outlet_year” columns from “Outlet_Establishment_Year”.
then after combine “Outlet_Year” column with Main dataset.

So it should add one variable means only add 1 column name “Outlet_year” only but when I do this step then column added but also rows increasing and it reach 23590924.

How rows increasing I don’t understand. Can any one please tell me whats wrong in it?

Thanks in Advance


@premsheth - Can you please share your code.


Thanks for reply
My code is following:

path for setting dictionary

path <- "/home/tsunami/R/Dataset_with_solution/big_market_sales"

#load dataset
train <- read.csv(“Train.csv”)
test <- read.csv(“Test.csv”)

#load essential libraries
loadlibraries = function(){
print(“The libraries have been loaded”)

Data mining ( Imputing MIssing Values in OUtlet_weight)


Length of both Vector is not same so we need to create one more column for Item_OUtlet_sales

#in Test Vector

test$Item_Outlet_Sales <- 1
combi <- rbind(train,test)

Median is highly robust to Outlier so we replace all missing value from median

combi$Item_Weight[$Item_Weight)] <- median(combi$Item_Weight, na.rm = TRUE)

If we can see iteam_visibility Vs iteam_Outlet_sales graph there is some 0 values

ANd it is not feasible because Item cant occupies shelf space in Store

combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0, median(combi$Item_Visibility),combi$Item_Visibility)

Exploration for categorical Data.

1) In OUtlet_Size there are 4 levels but for first level there is no name so we need to give name for it

levels(combi$Outlet_Size)[1] <- “Others”

2) In Item_fat_content There is some mis matched like LF = Low Fat and low fat = Low Fat

we need to give revalue.

combi$Item_Fat_Content = revalue(combi$Item_Fat_Content, c(“LF” = “Low Fat”, “reg” = “Regular”))
combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content, c(“low fat” = “Low Fat”))

Data Manipultaion

#1) Count of Outlet Identifier
a <- combi%>%
names(a)[2] <- "Outlet_Count"

Now combine a with Outlet_Identifier

combi <- full_join(a,combi,by = “Outlet_Identifier”)

#2) Count of Item Identifier
b<- combi%>%
names(b)[2] <- "Item_count"

#combine with main dataset
combi <- merge(b,combi,by = “Item_Identifier”)

#3) Outlet Year it will show how many years Outlet old
c <- combi%>%
mutate(Outlet_Year = 2013 - combi$Outlet_Establishment_Year)
combi <- full_join(c, combi)

After adding C vector rows increasing
if i write by = Outlet_Establishment_year still same problem


@premsheth - Problem occurs when you are doing full join on combi <- full_join(c, combi) . Instead of this try left join or use sqldf
. I hope it will solve your problem.


Thanks for reply.
I tried full_sqldf function but its give me error
Error: could not find function “full_sqldf”

I installed “sqldf” package also and again try to run this function. but it gives same error.

Also I tried only “sqldf” function. same error

Can you please tell me in which package this function?


You can use library sqldf
i am giving you a sample code for Sqldf
df9 = sqldf(‘select df1., df2. from df1 left join df2 on df1.ID = df2.ID’)

just specify your data frames and columns you need.For left join decide which common attribute you will take,


Sorry I am not able to understand sqldf function.

But I have independent column “c” as one variable. Now I need to add this column variable in my combine data frame “combi”.

How can i add this “c” variable in my dataset?

i tried following code:
combi = sqldf(‘select c, combi from c left join combi on c.ID = combi.ID’)
error in statement: no such column: c

I think sqldf works like take one column from first dataset and add in second dataset.

But in my problem I have single column and need to add in dataset.


the main problem you are facing is you are joining a dataset C with a big data frame combi , because of that you are getting the size of data set around 23590924 .


why dont you use this combi$years <-(2013 -combi1$Outlet_Establishment_Year)
to find out how many years Outlet is old


Yes but I have to join.

If u can see my code I added two column in big data but when I try to join third one it gives me this problem.


Thank you so much HUNAIDKHAN sir learn me to do things simple way.

problem resolved.


Your welcome


but the question still remains…

combi <- full_join(a, combi, by = “Outlet_Identifier”)
combi <- merge(b, combi, by = “Item_Identifier”)

        mutate(Outlet_Year = 2013 - combi$Outlet_Establishment_Year)

combi <- full_join(c, combi)

three additions were done…
the code for the first addition is different from the second…i dont know why this was done…
How is full_join same as/different than merge?

and secondly, why is there a “mutate” command?
i havent read till now what mutate does, but since there is a subtraction going on here, cant we simply write:
Outlet_Year = 2013 - combi$Outlet_Establishment_Year
and also the code for the third addition is different from the first two additions…

the increase in size of the variable is because we get 14204 * 14204 records…
somehow 14204 additions are getting done for each record… i dont know how


The following codes dont work, i dont know why

i have used temp1 instead of combi

  1. temp1 = merge(c, temp1, by = “Outlet_Establishment_Year”)
  2. temp1 <- full_join(c, temp1)
  3. temp1 <- full_join(c, temp1, by = “Outlet_Establishment_Year”)

the following code works, but i was unable to change the column name from c$Outlet_Age…haha
temp1=cbind(temp1,c$Outlet_Age) #works

the best solution, courtesy HUNAIDKHAN2000 was:
temp$Outlet_Age = 2013 -temp$Outlet_Establishment_Year
works perfectly, thanks!

Can anybody explain why the other codes dont work properly?
we get 14204*14204 records
and the size of the variable goes into GBs


Your welcome