Exploratory Data Analysis



I’m pretty new to data science(esp. predictive modeling ). I find new information but I have no idea as to how I can use that info to improve the accuracy of my model.

I have a model, and I’m looking to improve it. During EDA, I find (through visualizations) that a particular factor (say x1) has four levels. When I distribute the target variable over the levels of the factor using a boxplot, I find that the median value of the target variable has a higher value in one of the levels.

How can I test for statistical significance of this phenomenon?
How do I incorporate this new found information into my model?

This question represents one scenario in which I’m trying to predict the value of a continuous target variable. However, I have been plagued with this issue in the past.

Thank you for the answer!!


Hi @krishnamurthypranesh

For your test of significance you could do the four samples of the levels then check that the response is normal and you can use t-test in this case (you go by differences, the power is not great but gives a good indication). Other way again if the response is normal or nearly normal you do a linear model and then Anova, you make the assumption of linear relation. But this could answer your second question, you build one lm and then ANOVA you will see the influence of the levels.

If not normal as you did the median, now you are dealing with robust statistics you can use Kolmogorov distance it will give you one “feeling” about the difference, it is not the only method but simple (not good power either).

For your model four levels is not one problem, specially if they equally distributed you can use all models. The encoding is often transparent if not check caret and the model you want, the caret framework will do the transformation for you.

Hope this help.
Best regards


hi @krishnamurthypranesh ,

  1. I suggest you to look up the p-values for each level of that column.
  2. you can incorporate this in linear reg. by making four separate flag columns for each level. It can be done by simple model attribute.



Thanks for the answer Alain. I’ll use the approach you suggested.


Thanks @rohan00747. Looking at the p-values is first thing I did. And the phenomenon turned out to be statistically significant.


@rohan00747, I’m using tree based models. A simple decision tree. Do I use the same method to incorporate the effect of the factor levels or do I have to follow some other approach?

Thanks for the answer!


@Lesaffrea I am using decision trees regression to model the data. So I’m not using linear regression. But do your suggestions still apply for decision trees?

Thanks for the answer!