Missing data and CART

missing_values
cart

#1

Hi Team,

I have few queries regarding ML algorithms and data, It would be great if you can provide some feedback on that.

  1. Which is best package to impute the missing data (currently using MICE in R) and how to deal with missing categorical values (e.g. self_work variable is significant in my model to predict loan and values are Yes/No and some of them are missing)
  2. How we can validate if imputed values are correct (any stats test or any test which we can perform on those columns to validate the values)
  3. If I have to predict binary output, how can I choose between Logistic regression, CART, Random forest. Do I need to build the model for each algorithm or is there any test which will help me to decide.
    Note: Logistic provides more details like variable significance etc and random forest gives accuracy and interpretability but what if my aim is accuracy. sometimes logistic maygive you better accuracy than CART/Random forest

Thanks in advance!
Thanks,
Omkar


#2

Hi @omkar18,

you can impute missing values either simply using mean, median, mode or you can use other columns to understand the pattern of missing values and treat them accordingly. Here’s an article that explains it very clearly (refer section 2 on missing value treatment) -

About choosing the right algorithm, there certainly exists no defined set of rules. You can understand the requirement of your problem statement and how the algorithms work to come to a decision. You can choose to use a single algorithm or ensemble multiple models.


#3

Hi @omkar18,

  1. I normally don’t use any package other than dplyr and stringR.

  2. Regarding Missing value, the answer to your question depends on a lot of things. However, I will try to give you some decision process to understand that.
    a. first understand how many rows have NA. if these are less than 1% of total row. then it is usually better to remove those rows. One advice is to use NA imputation after outlier treatment because removing outlier also reduce the rows having NA.

    b. check the column, if the more than 95% of data in that column have NA then it also sometime good practice to delete the column itself.
    c. now if you cannot delete the rows then the next step is to understand why NAs are happening. there are a couple of reasons. non-availability of data, system error are some reasons for it.
    d. Check the column of the data point which has NA. doing EDA based on column help sometimes. If the column having NA is not important enough for your prediction to check. consider deleting the whole column itself
    e. I would not recommend you blindly putting the value as NA to median, mean or mode as it might distort your modeling result. What you can do is check the value in similar rows have the one having NA. you can pick other two categorical value and filter your data on those and then check their median, mean or mode and replace with that.

    f. if you want to replace a value for NA in categorical value, then check the ratio of1 to 0 of dependent variable for all the categories. replace NA with that category which is closer to ratio of NA.
    g. one more option is to go back to client and try asking the reason for NA. Based on client’s response you can know what values to be put in.

of course this is not an exhaustive list you can do on NAs but this should give you headstart. As per my experience, handling NA is not a standardized process. everyone telling you what to do is telling you by their experience. it is a good thing to learn from others’ experience, however try coming up with your own ways to impute NA. Let me know if you find something new and interesting. :slight_smile:

  1. Logistic is computationally faster than random forest. so if there is no significant accuracy improvement with random forest, then it is better go with logistic regression.

Hope my answer helps.


#4

Hi Aishwarya/Vikas, (@AishwaryaSingh, @VikasJangra)

Thanks a lot for your inputs, it really helps.

I have couple of queries regarding hypothesis testing -

  1. At which phase we perform this i.e. p-test, t-test etc. whether once we receive the data and what is significance of this, my assumption is to test whether provided data is eligible for running model on it or not OR subset of actual population represents in sample.
  2. Currently I am working on marketing analyst and I have to check customer behavior. can you please let me know which hypothesis I can assume on this. For e.g If data consists products bought, last activity etc.and appropriate algorithm for the same.

Thanks again for your help and support.
Omkar


#5

Hi @omkar18,

  1. You normally start the test when you know something about the data and you want to test that assumption. This usually happens after the Exploratory data analysis has been performed on data. From EDA, you create some assumption and check with sample population if that is true or not with some significance level.

  2. choosing t-test and p-test depends on sample size and variance knowledge. If your sample size is less than 30 rows and you also do not know the population variance than you use t-test because the sample mean follows the t-distribution in this case. However, if you sample size is greater than 30, then you use p-test and you assume the sample variance as population variance.