How to decide on feature engineering while making my prediction model?

sas
kaggle

#1

Hi,

I was wondering if there is a good source of materials about feature selection procedure?

I am not talking about PCA or anything automated. I am talking about how you choose which feature to use in your prediction model based on correlation / something else analysis + which transformation should you apply to the existing features to derive a new one, suited better for the prediction. What are current ways to do so? Is there any case studies regarding this? What materials (books / courses / articles) can you recommend about this topic?

Right now I am trying to find the best solution for kaggle bike sharing competition - http://www.kaggle.com/c/bike-sharing-demand/. This kind of a task is the most relevant to my interest, I want to learn some generic principles / workflows of feature selection that is applicable to this kind of tasks. The resulting features should be understand by a business person, not only by a automated algorithms which just somehow combine them in a weird way without any interpretation. I am not interested in automatic selection of features for image / text recognition,

Any links or thoughts are more than appreciated. Thanks.


#2

SAS Enterprise Miner has comprehensive feature selection nodes (e.g. variable clustering, variable selection, principal components and partial least squares) as well as modeling nodes (e.g. regression and decision tree) for variable selection/reduction. Some include the principal component analysis as one component and others are based on something else (e.g. R squared and Chi-Squared). The latter involves using the target (i.e. Dependent variable) for supervised selection. All these are not 100% automated and users need to understand the rationale to fill in options/parameters. Once one is able to master the technique, one should be able to explain the steps. For sure not many companies have this Enterprise Miner license. However, we could always borrow the conceptual framework of any of them and apply it by using some other tool.


#3

Forgot this one:

SAS Enterprise Miner also has a data transformation node to help select the best transformation from e.g. log, x squared, inverse, binning, to name a few…


#4

@Ishitori Awesome question!

A few disclaimers first:

  • I have not spent a lot of time on the competition, though had a look at it when it had started.
  • The nature of your question is very broad and can vary from situation to situation. So whatever I am writing would be only a part of a far bigger canvas.

One of the ways which helps me immensely with feature engineering is thinking about all possible hypothesis and listing them before actually looking at the data. This might sound a bit counter-intuitive, but has served me well. For example, for this competition, you would start by writing various hypothesis like:

  • Demand for bikes on weekdays would be driven by Office goers and registered users. The demand on the weekends would be from tourist population
  • Similarly, there will be hypothesis about time of the day
  • Another hypothesis would be on weather (I think there is a variable called feels like temperature as well)

These hypothesis would also tell you what transformations to use - for example, you might hypothesize that on a wet day, in evening (when feels like temperature is down), the demand should see a steep drop - this should then help you visualize the step function needed for the transformation.

In terms of resources, you can have a look at various machine learning open courses. Here is one of the presentations I found useful some time back:

Hope this helps.


#5

Thanks Kunal,

I completely agree with your idea about hypothesis based feature engineering. I skimmed through the presentation you have sent, and found that it is about speech and image recognition - things I am not really interested in :slight_smile: Or do you find the same approach to be suitable for this kind of business analysis task?

I got your ideas and found some others. Can you tell me how can you leverage these ideas? Do you divide the model based on these ideas or create a new feature to express the idea better?

For example, I found that in 2012 the company had more clients than in 2011. Does it mean that I should create a factor for 2011 and 2012? Or should I divide dataset in 2 and train model on both of these databases separately and while predicting I choose the model which is related to prediction observation year?

The same question related to the hour column. I found that in rush hours (7, 8am and 5, 6pm) during workdays there is a spike in bike’s demand. How can I leverage this knowledge? Probably, I cannot divide dataset now and should create a feature for that, right?

So, my initial question is related to these 2 examples: how do I find more things like that and how exactly should I leverage these findings to build a better model? What are the common ways to do so? And when should I split dataset into separated datasets and treat them individually and when I should just create a feature?

Is there any other ways to introduce more knowledge into a model except direct feature creation and dataset splitting? For example, reading this article http://blog.graphlab.com/using-gradient-boosted-trees-to-predict-bike-sharing-demand I found that the author recommends to:

…, the evaluation metric is the RMSE in the log domain, so we should transform the target columns into log domain as well.

He is changing counts into logarithmic scale and after the prediction converts it back to the normal scale. Does it make sense? Is it a popular approach to transform feature into logarithmic scale?

Thanks.


#6

And also. for example temp and atemp variables are highly correlated. What should you do with that situation? Should you remove one variable in favor of another one?


#7

In order to leverage the ideas, it is usually better to create a new feature than to create a different model. You should look at creating different models if you are not able to create features which capture the hypothesis well.

For the examples you mentioned, you should definitely have a feature for the year in the single year. Similarly, for the second example, you can create a new feature called rush_hours (Y or N). You can also try out creating rush_hour_morning and rush_hour_evening, if you expect them to be different (traffic is more spread in evening).

Splitting the dataset to create different model should usually be applied when you want to use a different modeling techniques on the dataset. There might be other use cases as well.

On the transformations, log transformations are pretty common, especially so , in cases where the growth is exponential. If the growth is non-linear, you can also look at taking square roots. For cyclic data, you can also use trignometric transformations like Sine / Consine or Sine Inverse / Cosine Inverse.

Regards,
Kunal


#8

Thanks for the answer. Kunal.

Can you tell me more about transformations? What is the idea behind them? Are we trying to make dependency as linear as possible in any scale?