Query regarding ML feature extraction and aggregated features

machine_learning
logistic_regression
feature_engineering

#1

Hopefully I’ve articulated my queries as clearly as possible. I’ve provided sample data below (assume this is 100,000+ rows)

Data characteristics

  • Each row is a unique observation
  • The first 3 columns depict the raw data. “Purchased a car” is the outcome (1 or 0)
  • Remaining columns represent extracted features

Objective

  • To build a binary classification model (I’m using logistic regression) to predict if a person is likely to buy a car or not)

Feature extraction

  • While analysing the raw data, I constructed additional features that tell me how many people within a similar age range purchased/didn’t purchase a car. For e.g. if you look at the first entry, +/-5% of 22 is 23 & 21 respectively. The two features “No. of people” help signify how many people within that age range purchased a car or not (aggregated across the whole dataset)

What I need advice on

  • Is this method of feature extraction sound - and are there any potential gotchas I ought to be aware of? I’ve poured through numerous forums / articles on feature engineering, but derived features (in particular those that are aggregated using the raw data) is something where I couldn’t find much literature
  • When it comes to splitting the dataset into test/train … how should this feature be treated? If I were to split the raw data into test/train before feature derivation, the “No. of people” features would look vastly different between test and train
  • Is this misuse of feature extraction?

A possible option … but is this sound?

  • Split the raw data into test/train
  • For the train data, extract these additional “No. of people” features
  • For the test data, extract these additional “No. of people” features using train data. This will ensure that the “No. of people” counts when validating using test data is reflective of the training dataset
  • When predicting any new observations, the “No. of people” features would need to be computed based on the test dataset

Data

+-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Raw data                          | Derived                                                                                                                        |
+-----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Age | Location  | Purchased a car | Age - 5% | Age + 5% | No. of people within ages (+/-5% who purchased) | No. of people within ages (+/-5% who did not purchase) |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 22  | Penrith   | 1               | 21       | 23       | 2                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 33  | Peakhurst | 1               | 31       | 35       | 2                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 21  | Peakhurst | 1               | 20       | 22       | 2                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 33  | Peakhurst | 1               | 31       | 35       | 2                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 29  | Peakhurst | 1               | 28       | 30       | 1                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 18  | Penrith   | 1               | 17       | 19       | 1                                               | 0                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 50  | Penrith   | 0               | 48       | 53       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 52  | Penrith   | 0               | 49       | 55       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 33  | Penrith   | 0               | 31       | 35       | 2                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 61  | Penrith   | 0               | 58       | 64       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 63  | Penrith   | 0               | 60       | 66       | 0                                               | 2                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+
| 77  | Penrith   | 0               | 73       | 81       | 0                                               | 1                                                      |
+-----+-----------+-----------------+----------+----------+-------------------------------------------------+--------------------------------------------------------+

#2

Hi @doraemon_z2000,

First of all, problem statement well presented. Makes it easier to understand.

Regarding the features you have created, this is an interesting approach. You can create more such features using the age columns. For instance, divide rows into age groups like (create a categorical column) like teen, adult, old … etc .

The train test split should be after the feature engineering because the test set should undergo the same preprocessing and feature engineering as the train set. Additionally, in case of categorical variables, splitting train-test before one-hot-encoding might result in varying number of columns for train and test.


#3

Hi Aishwarya,
Thanks heaps for your clear explanation!

The dataset I’ve posted is just something I made up to illustrate my query … but the real dataset I’m using benefits significantly from the way I am constructing the feature.

If I am feature engineering in the manner specified and then using it to predict outcomes for new observations (people) - would this be the correct way to approach?

  • Derive this aggregated feature for any new observations using test / train data
  • Use this to predict the outcome

Have you come across this approach anywhere? Intuitively it makes sense - but I’ve yet to see it done this way anywhere.

Thanks,
Himanshu


#4

I’m not sure what is the other way you could approach. In order to have the same size of training data and test data, you will have to derive the same feature for new observations and predict the values.

On a side note, I haven’t come across a proper implementation of this approach on a real time dataset (if i do, will share in this thread).


#5

Thanks for your help. Much appreciated !