How are Rows increasing when i try to do feature engineering?

dataexploration
r

#1

Hello,

I am begginer in Data science and I am working in Big market sale dataset using R.

Now when I do Data manipulation and adding “Outlet_year” columns from “Outlet_Establishment_Year”.
then after combine “Outlet_Year” column with Main dataset.

So it should add one variable means only add 1 column name “Outlet_year” only but when I do this step then column added but also rows increasing and it reach 23590924.

How rows increasing I don’t understand. Can any one please tell me whats wrong in it?

Thanks in Advance


#2

@premsheth - Can you please share your code.


#3

Thanks for reply
My code is following:

path for setting dictionary

path <- "/home/tsunami/R/Dataset_with_solution/big_market_sales"
setwd(path)

#load dataset
train <- read.csv(“Train.csv”)
test <- read.csv(“Test.csv”)

#load essential libraries
loadlibraries = function(){
library(ggplot2)
library(dplyr)
library(plyr)
print(“The libraries have been loaded”)
}
loadlibraries()

Data mining ( Imputing MIssing Values in OUtlet_weight)

dim(train)
dim(test)

Length of both Vector is not same so we need to create one more column for Item_OUtlet_sales

#in Test Vector

test$Item_Outlet_Sales <- 1
combi <- rbind(train,test)

Median is highly robust to Outlier so we replace all missing value from median

combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight, na.rm = TRUE)
table(is.na(combi$Item_Weight))

If we can see iteam_visibility Vs iteam_Outlet_sales graph there is some 0 values

ANd it is not feasible because Item cant occupies shelf space in Store

combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0, median(combi$Item_Visibility),combi$Item_Visibility)

Exploration for categorical Data.

1) In OUtlet_Size there are 4 levels but for first level there is no name so we need to give name for it

levels(combi$Outlet_Size)[1] <- “Others”

2) In Item_fat_content There is some mis matched like LF = Low Fat and low fat = Low Fat

we need to give revalue.

combi$Item_Fat_Content = revalue(combi$Item_Fat_Content, c(“LF” = “Low Fat”, “reg” = “Regular”))
combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content, c(“low fat” = “Low Fat”))
table(combi$Item_Fat_Content)

Data Manipultaion

#1) Count of Outlet Identifier
a <- combi%>%
group_by(Outlet_Identifier)%>%
tally()
names(a)[2] <- "Outlet_Count"
head(a)

Now combine a with Outlet_Identifier

combi <- full_join(a,combi,by = “Outlet_Identifier”)
dim(combi)

#2) Count of Item Identifier
b<- combi%>%
group_by(Item_Identifier)%>%
tally()
names(b)[2] <- "Item_count"
head(b)

#combine with main dataset
combi <- merge(b,combi,by = “Item_Identifier”)
dim(combi)

#3) Outlet Year it will show how many years Outlet old
c <- combi%>%
select(Outlet_Establishment_Year)%>%
mutate(Outlet_Year = 2013 - combi$Outlet_Establishment_Year)
head©
combi <- full_join(c, combi)
dim(combi)

After adding C vector rows increasing
if i write by = Outlet_Establishment_year still same problem


#4

@premsheth - Problem occurs when you are doing full join on combi <- full_join(c, combi) . Instead of this try left join or use sqldf
. I hope it will solve your problem.


#5

Thanks for reply.
I tried full_sqldf function but its give me error
Error: could not find function “full_sqldf”

I installed “sqldf” package also and again try to run this function. but it gives same error.

Also I tried only “sqldf” function. same error

Can you please tell me in which package this function?


#6

You can use library sqldf
i am giving you a sample code for Sqldf
library(sqldf)
df9 = sqldf(‘select df1., df2. from df1 left join df2 on df1.ID = df2.ID’)

just specify your data frames and columns you need.For left join decide which common attribute you will take,


#7

Sorry I am not able to understand sqldf function.

But I have independent column “c” as one variable. Now I need to add this column variable in my combine data frame “combi”.

How can i add this “c” variable in my dataset?

i tried following code:
combi = sqldf(‘select c, combi from c left join combi on c.ID = combi.ID’)
error in statement: no such column: c

I think sqldf works like take one column from first dataset and add in second dataset.

But in my problem I have single column and need to add in dataset.


#8

the main problem you are facing is you are joining a dataset C with a big data frame combi , because of that you are getting the size of data set around 23590924 .


#9

why dont you use this combi$years <-(2013 -combi1$Outlet_Establishment_Year)
to find out how many years Outlet is old


#10

Yes but I have to join.

If u can see my code I added two column in big data but when I try to join third one it gives me this problem.


#11

Thank you so much HUNAIDKHAN sir learn me to do things simple way.

problem resolved.


#12

Your welcome


#13

but the question still remains…

combi <- full_join(a, combi, by = “Outlet_Identifier”)
combi <- merge(b, combi, by = “Item_Identifier”)

        mutate(Outlet_Year = 2013 - combi$Outlet_Establishment_Year)

combi <- full_join(c, combi)

three additions were done…
the code for the first addition is different from the second…i dont know why this was done…
How is full_join same as/different than merge?

and secondly, why is there a “mutate” command?
i havent read till now what mutate does, but since there is a subtraction going on here, cant we simply write:
Outlet_Year = 2013 - combi$Outlet_Establishment_Year
and also the code for the third addition is different from the first two additions…

the increase in size of the variable is because we get 14204 * 14204 records…
somehow 14204 additions are getting done for each record… i dont know how


#14

The following codes dont work, i dont know why

i have used temp1 instead of combi

  1. temp1 = merge(c, temp1, by = “Outlet_Establishment_Year”)
  2. temp1 <- full_join(c, temp1)
  3. temp1 <- full_join(c, temp1, by = “Outlet_Establishment_Year”)

the following code works, but i was unable to change the column name from c$Outlet_Age…haha
temp1=cbind(temp1,c$Outlet_Age) #works

the best solution, courtesy HUNAIDKHAN2000 was:
temp$Outlet_Age = 2013 -temp$Outlet_Establishment_Year
works perfectly, thanks!

Can anybody explain why the other codes dont work properly?
we get 14204*14204 records
and the size of the variable goes into GBs


#15

Your welcome


#16

Hi dear Premsheth, i’m facing the same problem now, can u please show me how did you solve it. Thanks.


#17

HI @kolawole ,

I had just substract Outlet_Establishment_Year from 2013
combi$Outlet_Age = 2013 -temp$Outlet_Establishment_Year
You can see last @HUNAIDKHAN2000 comment. Thats answer.


#18

one last thing, can you send me the full code for this part? Thank you i mean from the begining of Outlet Years to Combi " combi <- full_join(c, combi)"


#19

Hi @kolawole,

You can refer the below mentioned course to have feature engineering ideas and their implementation in R:


#20

Hi Pulkits thank you!