Black Friday Data Hack - Reveal your approach

hackathon
black_friday

#1

Dear all,

Thanks for your participation in this exciting hackathon. The learning on a dataset like this can not be complete until you share and discuss your approach. So, let’s do that!

Some pointers to keep the discussion specific:

  1. What all approaches did you try during the hackathon?
  2. What worked and what did not work?
  3. Did you create any new additional features / variables?
  4. What are the approaches you would have tried, if you had some more time?
  5. What were the problems you faced? How did you solve them? Do you want answers to any unanswered questions you might have?
  6. What data cleaning did you do? Outlier treatment? Imputation?
  7. Any suggestions for us (Team AV) to improve the experience further.

Finally, what was the best moment for you during the hackathon.

Regards,
Kunal


Black Friday practice problem, features
Is there anyway to see scipts from previous contests?
#2

#3

My code can be found here: https://github.com/rouseguy/BlackFridayDataHack

My best model on the public LB(~2475) was a simple average of 3 xgboost models: A) Depth 8, Trees 1450 B) Depth 12, Trees 800 C) Depth 6, Trees 3000. The CV score for this model was around 2490.

Another approach I tried was stacking. Stacked models gave me the best CV score. But on the public LB - they had a score of around 2630.

I decided to trust my CV score more than the public LB and submitted the stacked model output as my final solution.

Random Forest and ExtraTrees both gave public leaderboard results around 2520 (CV of around 2600). Ensembling them with xgboost model outputs didn’t get me any better public LB score (Though, truth be told, I didn’t try enough on that).

I tried L1/L2 Linear Regression models - but the CV score was v bad (~5000). The advantage of it was that the models ran real fast. Neither of them took more than 3 seconds to run.


#4

I’ve written about my model and approach on my blog: http://rohanrao91.blogspot.in/2015/11/black-friday-data-hack.html
The code is on my GitHub: https://github.com/rohanrao91/AnalyticsVidhya_BlackFriday

Hope you enjoy reading and take back something with you :smile:

Looking forward to read about the approaches of other competitors.


#5

My system specs - 3GB Ram, Quad Core, Windows 7, Python 2.7 Anaconda.

My first challenge was, when I used Python to impute (-999) for missing values, my system often got hanged. So, as a workaround, I did the same in R (took seconds), outputted that as csv and then imported that csv in Python.

For modeling part, I had created my customized function for grid search-cum-local validation for Random Forest, keeping in mind the past AV Hackathons. But what stumped me was the size of data this time; I had to spend a lot of time in deconstructing my function into smaller parts and running them separately in smaller chunks while keeping an eye on my laptop’s “CPU Usage” and “Memory Utilisation” graphs :P.

Upon running the RandomForest for better half of a day, I had finally optimized my parameters in RF. My local CV was stable at 26XX but my public lb score was 41XX. I didn’t ensemble enough due to memory constraint, maybe that’s why such large variation.
Then I tried to import xgb in python, but there was some error. So, I tried the next best thing “gbm”. It improved my local CV to 25XX, but degraded my public lb to 52XX, again baffled.

While following the “CPU Utilization” graph in Windows7, I observed an interesting fact. While running the RandomForest, all the 4 processors were diligently working at 100% throughout. But in “gbm” CPU Utilization and Memory Utilization were barely 50%. RandomForest invariably ended with crashing my system, but “gbm” spared it :P. Knowing which algorithm is computationally economical in which scenario would definitely be handy while working on limited resources of Client’s server.

My key take away from this Hackathon is - “Large Data Size Consideration”. Machine Learning Algorithms were more or less a black box to me, but now it’s like I took a fleeting peek under the hood :smiley: .

After this pounding at the hands of “Real World Size Data” I would have to either upgrade my system, familiarize myself with AWS, rely on feature engineering or prepare a “Plan B” for this kind of situation.
All in all it was fun and I learnt something new .


#6

I have a Windows machine with i5 processor & 4 GB ram. I used Python as the programming language, & XGBoost as my model.

For CV i just kept a holdout set using train_test_split (test size =0.33). I used early stopping to find the best iterations, then retrained the model on the full training set for this many iterations. A single xgb model with absolutely no feature engineering is capable of reaching ~2475 on the LB. Combining 3 xgb models didn’t really have much impact (improved only to ~2465 on the LB). In the end I did a post-mortem analysis on my predictions, & found that there were some negative values. I reset these to 0, & it gave about a 0.3-0.4 improvement on the LB.

The gap between the top 4 & the rest of the pack was massive. This proves that you can become a Tsonga or Berdych by tuning XGB, but to reach the level of Federer & Djokovic you definitely need to excel at feature engineering.


#7

My approach was very simple and current LB score is likewise i.e 50,

replaced missing values with 9999

removed outliers > 2 z-score

I created dummy variable for
dummy_var=[ ‘Gender’, ’ Age’, ‘City_Category’, ‘Stay_In_Current_City_Years’, ‘Marital_Status’]

label encoded below mentioned variable
columns=[‘Product_Category_1’, ‘Product_Category_2’, ‘Product_Category_3’,‘Product_ID’]

final variables that i used in models

variables=[‘Occupation’,‘Gender_F’,
‘Gender_M’, ‘Age_0-17’, ‘Age_18-25’, ‘Age_26-35’, ‘Age_36-45’,
‘Age_46-50’, ‘Age_51-55’, ‘Age_55+’, ‘City_Category_A’,
‘City_Category_B’, ‘City_Category_C’, ‘Stay_In_Current_City_Years_0’,
‘Stay_In_Current_City_Years_1’, ‘Stay_In_Current_City_Years_2’,
‘Stay_In_Current_City_Years_3’, ‘Stay_In_Current_City_Years_4+’,
‘Marital_Status_0’, ‘Marital_Status_1’, ‘new_Product_ID’,
‘new_Product_Category_1’, ‘new_Product_Category_2’,
‘new_Product_Category_3’]

target_var=[‘value_customer’]

Than ran simple Gradientboost model with 1000 estimators 5 folds and got LB score of 2736.

LInear regression, extra DT and other models were performing very poor

Than I had a feeling everybody must be using Xgboost. tried to install it but no success.

Will appretiate if somebody please help me in installing Xgboost(Windows 8.1)

Thanks!


#8

Here is my code with Single XGB model in R (Machine Config: i5, 3GB RAM, Windows7). LB Score around 2487.

XGB in R (Local CV ~2485)

rm(list=ls())

setwd(“C:\Users\19501\Documents\AnalyticsVidhya\BFData”)

library(readr)
train <- read_csv(“train.csv”)
test <- read_csv(“test.csv”)

train[is.na(train)] <- -999
test[is.na(test)] <- -999

library(sqldf)

Some Basic Feature Engineering

prod_count <- sqldf(“select User_ID, count(distinct Product_ID)as Product_Count from train group by User_ID”)
cust_count <- sqldf(“select Product_ID, count(distinct User_ID)as User_Count from train group by Product_ID”)

train_new <- sqldf(“select a.,b.Product_Count from train a left join prod_count b on a.User_ID = b.User_ID")
test_new <- sqldf("select a.
,b.Product_Count from test a left join prod_count b on a.User_ID = b.User_ID”)

train_new2 <- sqldf(“select a.,b.User_Count from train_new a left join cust_count b on a.Product_ID = b.Product_ID")
test_new2 <- sqldf("select a.
,b.User_Count from test_new a left join cust_count b on a.Product_ID = b.Product_ID”)

prod_cat1_count <- sqldf(“select User_ID, count(distinct Product_Category_1)as Product_Count_1 from train where Product_Category_1 <> ‘-999’ group by User_ID”)
prod_cat2_count <- sqldf(“select User_ID, count(distinct Product_Category_2)as Product_Count_2 from train where Product_Category_2 <> ‘-999’ group by User_ID”)
prod_cat3_count <- sqldf(“select User_ID, count(distinct Product_Category_3)as Product_Count_3 from train where Product_Category_3 <> ‘-999’ group by User_ID”)

new_feat_prod <- merge(prod_cat1_count,prod_cat2_count)
new_feat_prod <- merge(new_feat_prod, prod_cat3_count)

train.new <- sqldf(“select a.,b. from train_new2 a left join new_feat_prod b on a.User_ID = b.User_ID”)
test.new <- sqldf(“select a.,b. from test_new2 a left join new_feat_prod b on a.User_ID = b.User_ID”)

train.new[is.na(train.new)] <- -999
test.new[is.na(test.new)] <- -999

feature.names <- c(“Product_ID”,
“Gender”,
“Age”,
“Occupation”,
“City_Category”,
“Stay_In_Current_City_Years”,
“Marital_Status”,
“Product_Category_1”,
“Product_Category_2”,
“Product_Category_3”,
“Product_Count”,
“User_Count”,
“User_ID”,
“Product_Count_1”,
“Product_Count_2”,
“Product_Count_3”)

Encoding of Age variable

train.new[which(train.new$Age==“0-17”),“Age”] <- 17
train.new[which(train.new$Age==“18-25”),“Age”] <- 25
train.new[which(train.new$Age==“26-35”),“Age”] <- 35
train.new[which(train.new$Age==“36-45”),“Age”] <- 45
train.new[which(train.new$Age==“46-50”),“Age”] <- 50
train.new[which(train.new$Age==“51-55”),“Age”] <- 55
train.new[which(train.new$Age==“55+”),“Age”] <- 65

Encoding of Stay in Current City Variable

train.new[which(train.new$Stay_In_Current_City_Years==“0”),“Stay_In_Current_City_Years”] <- 1
train.new[which(train.new$Stay_In_Current_City_Years==“1”),“Stay_In_Current_City_Years”] <- 2
train.new[which(train.new$Stay_In_Current_City_Years==“2”),“Stay_In_Current_City_Years”] <- 3
train.new[which(train.new$Stay_In_Current_City_Years==“3”),“Stay_In_Current_City_Years”] <- 4
train.new[which(train.new$Stay_In_Current_City_Years==“4+”),“Stay_In_Current_City_Years”] <- 10

test.new[which(test.new$Age==“0-17”),“Age”] <- 17
test.new[which(test.new$Age==“18-25”),“Age”] <- 25
test.new[which(test.new$Age==“26-35”),“Age”] <- 35
test.new[which(test.new$Age==“36-45”),“Age”] <- 45
test.new[which(test.new$Age==“46-50”),“Age”] <- 50
test.new[which(test.new$Age==“51-55”),“Age”] <- 55
test.new[which(test.new$Age==“55+”),“Age”] <- 65

Encoding of Stay in Current City Variable

test.new[which(test.new$Stay_In_Current_City_Years==“0”),“Stay_In_Current_City_Years”] <- 1
test.new[which(test.new$Stay_In_Current_City_Years==“1”),“Stay_In_Current_City_Years”] <- 2
test.new[which(test.new$Stay_In_Current_City_Years==“2”),“Stay_In_Current_City_Years”] <- 3
test.new[which(test.new$Stay_In_Current_City_Years==“3”),“Stay_In_Current_City_Years”] <- 4
test.new[which(test.new$Stay_In_Current_City_Years==“4+”),“Stay_In_Current_City_Years”] <- 10

for (f in feature.names) {
if (class(train.new[[f]])==“character”) {
levels <- unique(c(train.new[[f]], test.new[[f]]))
train.new[[f]] <- as.integer(factor(train.new[[f]], levels=levels))
test.new[[f]] <- as.integer(factor(test.new[[f]], levels=levels))
}
}

tra <- train.new[,feature.names]
test <- test.new[,feature.names]

RMSE<- function(preds, dtrain) {
labels <- getinfo(dtrain, “label”)
elab<-as.numeric(labels)
epreds<-as.numeric(preds)
err <- sqrt(mean((epreds-elab)^2))
return(list(metric = “RMSE”, value = err))
}

XGBOOST

library(xgboost)
set.seed(100)
h<-sample(nrow(train),10000)

dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=train.new$Purchase[h])
dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=train.new$Purchase[-h])
watchlist<-list(val=dval,train=dtrain)
param <- list( objective = “reg:linear”,
#booster = “gblinear”,
eta = 0.15,
max_depth = 8,
subsample = 0.7,
colsample_bytree = 0.7,
scale_pos_weight = 0.8,
min_child_weight = 10
)

clf <- xgb.train( params = param,
data = dtrain,
nrounds = 830,
verbose = 1,
early.stop.round = 100,
watchlist = watchlist,
maximize = FALSE,
feval=RMSE
)
pred1 <- predict(clf, data.matrix(test[,feature.names]))

submission <- data.frame(User_ID=test.new$User_ID, Product_ID = test.new$Product_ID, Purchase=pred1)
submission_adjust <- submission$Purchase
write_csv(submission, “XgbNew.csv”)


#9

Machine : windows (32 bit) with 2 GB RAM.
Tools : R only.
Algorithms : GBM and XGB

Approach : I used a 80/20 split throughout. I tend to rely more on my hold-out sample score than CV. Although the difference between the two was quite stable this time around.

    1. Simple GBM model without User_IDs and Product_IDs. LB : 28xx
    1. GBM with all parameters and simple avg of purchase as per Product_IDs. LB : 26xx .
    1. Tried various XGB models with all parameters and a couple of new features :
    • computed avg for Product_IDs by weighting simple avg and ± (0.5,1) Std Dev of Purchase amount (Weights decided as a ratio of Information Gain)
    • added randomness to above feature using simple avg as per occupation
    • Count of Product_IDs

    Best hold-out sample score : 2562. LB : 2592

Did not try enough of paramter tuning. Spent most of the time figuring out FE and using simple models.Tried moving to Python on the last day. Was unsucessfull with installation of XGB.

Key Take-aways :

  1. Dealing with significant categorical variables having high cardinality. (My best learning of the competition)
    2)Tuning models. I feel its an art in itself and one that should be pursued with diligence. And given the timelines striking a balance between spending time on FE vs tuning.
  2. Other tricks like one-hot-encoding, mixing similar and dissimilar models,capping and flooring predictions for regression,etc.
  3. Having an array of tools and frameworks to work with if hoping for a top-10 like spot.

#10

Hi Everyone,

First of all thank you AV team again for such a wonderful hackathon. My approach -

  • Looked into levels of data and ran a basic random forest to understand Feature importance, realized Product ID was the most important feature
  • Added a new variable in Excel with mean of each product purchase against each product id
  • Converted all categorical variable in one hot encoded categories
  • Built an XGB over it and optimized parameters
  • Got a RMSE of 2465, Public LB Ranking 7 , Private LB Ranking - 5

Extras -

  • Tried adding average purchase amount on City basis, didn’t work out well so dropped it
  • Tried Divide and Rule concept as the data set was huge. Divided datasets based on customer gender but it gave a RMSE of 2674

Please find my code here.

Regards,
Aayush


#11

Hello everybody,

I have written a blogpost about my approach and posted my thoughts on winners’ approaches and my learnings, tips and tricks in this in this blog post. Do check it out if you are interested - https://medium.com/data-science-analytics/black-friday-data-science-hackathon-4172a0554944#.ggf3jybby

Codes are put up in my github profile - https://github.com/binga/AnalyticsVidhya_BlackFriday

Please reach out to me if you have any questions regarding the approach.

Thank you.


#12

I used Python for this contest. Instead of Pandas I used SFrames which is said to have some advantages over Pandas for large datasets.

I approached this as a recommender problem and used matrix factorization. The library I used for this was graphlab. https://dato.com/products/create/docs/graphlab.toolkits.recommender.html.

My understanding of matrix factorization is that the algo treats the problem as a matrix of the target variable, with user ids and product ids as the two dimensions. It then fills in the blanks in this matrix using factorization. Additionally, it also incorporates features of the users or products such as gender, product category etc. These are called ‘side features’ in recommender jargon.

My final submission was a simple average of the predictions of three such models, each with slightly different hyperparameters. The code for one of the models is below:

import graphlab as gl

cd C:\\Users\\Nalin\\Documents\\R\\Black Friday

train=gl.SFrame('train.csv')

test=gl.SFrame('test.csv')

train['Purchase']=train['Purchase'].astype(float)

trainBasic=gl.SFrame({'user_id':train['User_ID'],'item_id':train['Product_ID'],'Purchase':train['Purchase']})

trainUser=gl.SFrame({'user_id':train['User_ID'],'Gender':train['Gender'],'Age':train['Age'],'Occupation':train['Occupation'],'City_Category':train['City_Category'],'Stay_In_Current_City_Years':train['Stay_In_Current_City_Years'],'Marital_Status':train[ 'Marital_Status']})

trainProduct=gl.SFrame({'item_id':train['Product_ID'],'Product_Category_1':train['Product_Category_1']})

model1= gl.factorization_recommender.create(trainBasic, target='Purchase',  user_data=trainUser, item_data=trainProduct,num_factors=70,side_data_factorization=True,random_seed=50)
                             
testBasic=gl.SFrame({'user_id':test['User_ID'],'item_id':test['Product_ID']})

testUser=gl.SFrame({'user_id':test['User_ID'],'Gender':test['Gender'],'Age':test['Age'],'Occupation':test['Occupation'],'City_Category':test['City_Category'],'Stay_In_Current_City_Years':test['Stay_In_Current_City_Years'],'Marital_Status':test[ 'Marital_Status']})

testProduct=gl.SFrame({'item_id':test['Product_ID'],'Product_Category_1':test['Product_Category_1']})

predictions=model1.predict(testBasic, new_user_data=testUser,new_item_data=testProduct)

Purchase=gl.SArray(predictions)

User_ID=gl.SArray(test['User_ID'])

Product_ID=gl.SArray(test['Product_ID'])

Submission=gl.SFrame({'User_ID':User_ID,'Product_ID':Product_ID,'Purchase':Purchase})

Submission.save('Sub13',format='csv')

#13

Hello Everyone,

I started with applying a simple Decision Tree on the data after applying a few sanity checks. This gave me score of 30XX.

I tried random forest but it didn’t work out well as a config of 8GB Core i7 failed to complete the same.
I used gbm to get my best score 2931. Tried data splitting and a few other things. Tried DT and GBM ensemble (didn’t work well)

Key takeaways:

  1. Feature Engineering (most important): I excuse myself on the fact that this was my first competition.
  2. Know how of available packages.
  3. Knowledge of other languages: although R worked for me in this case but when datasets are big and desires are bigger (random forests) I need to know something outside R.

Public LB: 73
Private LB: 80

How do you think the score is for a first timer?

Again thanks to AV team for this super awesome competition and a perfect data set.


#14

My code is hosted here : https://github.com/vi3k6i5/black_friday_data_hack

My approach i have detailed here : https://medium.com/@vi3k6i5/black-friday-data-science-hackathon-at-analyticsvidhya-com-6489b3298e94#.pmvekr7a0

PS: I lot of my attempts were disastrously bad which i have skipped. :wink:


#15

@ank_dsm,

The Ranking doesn’t matter but the learning does. I personally is in touch with most of the top 10 people and trust me nobody participates here to win, just the immense learning takeaways are worth a million.

So if you have dedicated time and participated diligently, you are a winner :smile: Starting first time is always tough, looking forward to see you next time hopefully on top of leader board.

Happy learning.
Aayush


#16

Hi Kunal,

I missed the hackathon, couldn’t participate. Could you please provide the problem statement. The link seems to have been disabled now.

Thanks,
Ranjan


#18

Hi Rohan - thanks for posting your code. Reading through it was a great learning experience.

I do have one question though. You converted the ‘Age’ and ‘Stay in Current City’ variables to numeric. I can understand that doing this could improve the predictive capability of a linear model as it would reduce the degrees of freedom for such a model. But does the same logic hold true for a tree based model ? Wouldn’t converting it to a numeric variable actually increase the degrees of freedom for such a model ?


#19

The idea behind converting the variables to numeric is only to ‘allow’ a model to quantify the distance between the values for these variables. For example, someone aged 25-35 is near ‘half’ the age of someone aged 55+. The ‘half’ can get lost if they are kept as categories.

Ultimately, quantifying these may or may not make a difference, and that depends on the data, but in general I have found that as far as possible, converting such variables to numeric tends to perform better, irrespective of whether its a linear model or a tree-based model.

In this contest, it made very less difference, but I preferred to keep it this way :smile:

I loved the simplicity and idea of your model. Lot of takeaways from your code too. Thanks for sharing!


#20

Hi Rohan,

First of all job well done. I had a question for you…going through your code I came across that while you were doing feature engineering for test data you put the same value of product mean derived from train data…having said that now any particular user id in the train data will have the same value in test data data as well…problem statement was to predict purchase in the test data…now you derive product mean based on purchase of train data and append the same values of it based on user_id on test data as well…can you please eloborate it further?


#21

Hi,

The idea behind using Product_Mean is to enable the model to cluster products that move in a similar amount range. Since Product_ID was the most important variable, instead of using ‘Id’ which does not provide any information, I replaced it by the mean, which gives an indication of the ‘average’ purchase amount of that product.

Notice that I have removed Product_ID as a variable. Using it in train and test has to be the same since it is a feature built from the target variable of the train data.