Dealing with multiple levels in Qualitative Data in Machine Learning

While performing data analysis, it is common that we encounter categorical values with numerous levels for example Zip code, States, city etc.,
Analyst’s usual approach is to look for only high frequent levels in variable ad set a threshold levels to ignore rest of them or combine them into one level or ignoring the variable itself and going ahead
with model development.

Are any of these the right approach to deal with categorical variables ?
Can there be any other approach which is perfect in extracting and retaining the right information,which otherwise these levels hide.

This short notes is to discuss the right approach to deal with numerous levels in qualitative variables usually more than 50 levels.

One approach is to find importance of levels is by calculating Frequency and Response rate. These are brief steps to follow (i) Calculate the frequency of every level
within the variable (ii) Add response rate against target variable.

This combination can yield very valuable information about the levels and their impact.

Lets take a practical approach to discuss this, Imagine we are looking at 30,000 zip codes in US both (3 digit ad 5 digit zip codes)
and we are trying to predict customer loan default. (This is just a hypothetical example and by any means does not represent the actual numbers).

These zip codes can be segmented based on whether the frequency of customers are high/low and distribution of default with in the zip code(High/medium/Low).
Consider below numbers for understanding-

Zip code Frequency Response
ZX345 3% 60%

By observing the above date, zip code(ZX345) can be easily segmented as High Response and Low Frequency Bin. A point to note here, High and Low are subjective and might need
business / domain expertise. However, for safe assumption of High/Low impact pick up the range of frequency and response rate
and divide them in to three/two logical bins. For instance, the max value for frequency is 12% and minimum frequency is 1% of population, in case we choose to make three bins, each bin will have 4%
coverage as described below (can be segmented as High and Low, to avoid complexity).

Frequency: Response Rate ( % of distribution of positive class)
1% - 4% -Low 1% - 20%
5% - 8% - Medium 21% - 40%
9% - 12% - High 41% - 60%

Levels in category are coded based on frequency and response rate. This approach will shrink thousands of levels in variables
to 9 levels and to 8 columns when converted to dummies while developing models.

This would help anyone working on data science projects to abstract information from variables,
which otherwise ignored or inappropriately dealt with. This approach is useful when the requirement is to dig deep into low frequency high impact category levels.

© Copyright 2013-2019 Analytics Vidhya