Handle log variables In regression




I am new to data science and learning about regression techniques using R & SAS. I working on a data set to develop a predictive model on credit card spending. I have close to 130 variables and that includes some log variables also (with original variables). Now, there are number of missing values in log variables(also in original) because of two reasons:

  1. Missing value in original variable
  2. value ‘0’ in original variable and that also make sense from a business perspective. For example, toll free calling last month, means some customer may not have any toll free calling and it is possible to have ‘zero’ in the data set.

I understand that log transformation in an important aspect in regression techniques but I would like to understand that how do I deal with this situation while doing outlier and missing value treatment.

Please help!


Hi @sahil_dhingra

You have two problems there, outliers and then missing value.

  1. Outliers, in many case we exclude outliers from model calculation, if you deal with extreme values that will be different. For outliers you have few methods from PCA and check the high level PC for the observations with the high variance. Be careful pca has some assumptions. Clustering could be one other method as well, then the small cluster could point to outliers (the cluster method is important)

  2. Missing value this is one other story on it own, from the brut force to set the NA to the mean, median… not recommended as you build a new distributions to more sophisticated such as combine distributions if you work in R Amelia assuming normal distribution is a a good package. Other methods to build model is to remove the observations with NA not always advised specially if you deal with factor but if numerical and you have lot of observations to build you model this could be the simplest.

Best regards.


Thanks Alain. That was quite helpful. My other question is more related to handling log variables. I have some missing values in log variables either due to missing values in original variables or ‘0’ value in original variables. Now, the concern is that in original variables ‘0’ values are logically correct and would not be considered as missing. Now I cannot impute these values with mean or median because ‘0’ is itself a value but that cause log variable to be null. I want to include log variables as well in my model and see what works best( original or log). In this case how do I handle log variables for missing value imputation?