I am sharing the first baseline solution for BigMart sales problem. These are very primitive solutions but good to set the ball rolling.
Just to set the context, baseline solutions are the ones for which don’t really need a predictive model. These are the basic solutions against which we should benchmark our first model. I am sharing 2 baseline solution.
Step 1 - Setting Up:
import pandas as pd train = pd.read_csv("train.csv") test = pd.read_csv("test.csv")
Solution 1: Mean Sales: Most intuitive solution is to predict the mean sales of all products
# Determine mean of the output column: mean_sales = train['Item_Outlet_Sales'].mean() #Initialize submission dataframe with ID varaibles base1 = test[['Item_Identifier','Outlet_Identifier']] #Assign outcome variable to mean value: base1['Item_Outlet_Sales'] = mean_sales #Export submission: base1.to_csv("submission_baseline1.csv",index=False)
Model 2: Mean Sales by product: Another intuition might be to predict the mean sales of the particular product as output products
# Determine mean of the output column: mean_item_sales = train.pivot_table(values='Item_Outlet_Sales',index='Item_Identifier') #Initialize submission dataframe with ID varaibles base1 = test[['Item_Identifier','Outlet_Identifier']] #Assign outcome variable to mean value by product: base2['Item_Outlet_Sales'] = base2.apply(lambda x: mean_item_sales[x['Item_Identifier']],axis=1) #Export submission: base2.to_csv("submission_baseline2.csv",index=False)
Actually you don’t need Python to do this. It can be simply done in excel as well. But if your model is below these scores, there is definitely something going wrong!
Does anyone have a better baseline solution? (Remember no modeling technique to be used - just intuition!)