That’s a very good question. The kernels/solutions you follow on kaggle/analytics vidhya mostly belong to some competition or some tool/ability of the tool demonstration. The dataset you have for such solutions are readily available. In real life they are not. In real life problems, we have to generate hypothesis even before we capture the data. After we have sufficient hypothesis, we start gathering the data (some of the data required by the hypothesis are difficult to fetch and those remain open). Once we get the data, we need to test all the possible hypothesis to validate it statistically. ANOVA, chi square test, T/Z test are all hypothesis test. Most of the hypothesis is trying to find statistical significance between a feature and the target variable. If some hypothesis fails, we do not include the feature in our end model as the pattern between the feature and target variable might be due to randomness.
Note: In real life problems, the feature you select should be both statistically significant and domain wise rational. Else, you drop the feature.
Let’s say you want to predict if India would win a cricket match or not.
You start with various hypothesis:
- Kohli scores a century, India wins the match most of the times
- India playing on home ground, India wins the match most of the times
- If stadium is more than 90% crowded, India wins the match most of the times
Now, you collect the data and do the relevant test:
- For the first one, you build a cross table with Kohli scoring/not scoring a century at the index and India winning/losing at the column. And perform a chi square test.
You would include the variable only if the chi square test of independence fails.
- similarly you can do the chi square test for India playing home/away and India winning/losing.
- For the third hypothesis, you won’t perform the test because there is no domain rationale. You cannot reason the behavior even if it has got statistical relevance.
T/Z test or ANOVA is used to check the significance of the difference in two continuous features (for ANOVA two or more). For instance, you can collect Virat Kohli’s runs scored in the match and find the average score Kohli needs to score for winning the match. But before saying his score does affect the match result, you need to perform Z-test. You have two series, one contains Kohli’s individual runs scored when India wins and another that contains his runs scored when India losses. You have average and standard deviation of both the series and you can easily perform a Z-test to conclude whether or not the difference in the average scores of the two series are statistically significant.
Hope you find my answer helpful.