When & How : Statistical Tests - Hypothesis Testing

Hello,
My question is about statistical test / Hypothesis testing. Having read numerous kernels/ solutions to various hacks, I have never noticed anyone stating any hypothesis, verifying it or anything to that effect.

People are in a hurry to load data, Do some jazzy EDA, some feature engineering and then hook some Algorithm.

My questions are as follows:

  1. When do we test data using various tests - ANOVA, Chi, t-test, check for Type1, Type2 errors. I have not seen anyone performing such tests in any solution so far. Can someone share a case study where it’s been done effectively & practically

  2. We make certain assumptions about the project / or assumptions which can affect the “Target Variable”.How do we do Hypothesis testing to accept or discard the assumptions? Need a practical scenario or a run through case or a solved case study.

Thanks,

Mohit

4 Likes

Hi Mohit,
That’s a very good question. The kernels/solutions you follow on kaggle/analytics vidhya mostly belong to some competition or some tool/ability of the tool demonstration. The dataset you have for such solutions are readily available. In real life they are not. In real life problems, we have to generate hypothesis even before we capture the data. After we have sufficient hypothesis, we start gathering the data (some of the data required by the hypothesis are difficult to fetch and those remain open). Once we get the data, we need to test all the possible hypothesis to validate it statistically. ANOVA, chi square test, T/Z test are all hypothesis test. Most of the hypothesis is trying to find statistical significance between a feature and the target variable. If some hypothesis fails, we do not include the feature in our end model as the pattern between the feature and target variable might be due to randomness.

Note: In real life problems, the feature you select should be both statistically significant and domain wise rational. Else, you drop the feature.

Example:
Let’s say you want to predict if India would win a cricket match or not.
You start with various hypothesis:

  • Kohli scores a century, India wins the match most of the times
  • India playing on home ground, India wins the match most of the times
  • If stadium is more than 90% crowded, India wins the match most of the times

Now, you collect the data and do the relevant test:

  • For the first one, you build a cross table with Kohli scoring/not scoring a century at the index and India winning/losing at the column. And perform a chi square test.
    You would include the variable only if the chi square test of independence fails.
  • similarly you can do the chi square test for India playing home/away and India winning/losing.
  • For the third hypothesis, you won’t perform the test because there is no domain rationale. You cannot reason the behavior even if it has got statistical relevance.

T/Z test or ANOVA is used to check the significance of the difference in two continuous features (for ANOVA two or more). For instance, you can collect Virat Kohli’s runs scored in the match and find the average score Kohli needs to score for winning the match. But before saying his score does affect the match result, you need to perform Z-test. You have two series, one contains Kohli’s individual runs scored when India wins and another that contains his runs scored when India losses. You have average and standard deviation of both the series and you can easily perform a Z-test to conclude whether or not the difference in the average scores of the two series are statistically significant.

Hope you find my answer helpful.

Thank you.

Kind Regards,
Viraj Pai

5 Likes

Hie @mohitlearns,
@Metal_Horse, has already explained your question perfectly. There is nothing left to answer you.
I am just supporting his responses. As, Hypothesis or statistical tests are not used in competitions, because you get prepared dataset, They might have done statistical tests already while building the dataset. Although, these tests are of much use in Research purposes. As, You make numerous suppositions while undergoing the research procedure. So, You end up making multiple hypothesises(suppositions), and check up their validity.
T/Z or anyother such tests are used to check validity and relation among features. They could come handy if you create new features and want to statistically validate it, before proceeding.
I am not attaching any example here, @Metal_Horse has mentioned a great example for your understanding. Hope, you are clear now! :slightly_smiling_face:

2 Likes
© Copyright 2013-2019 Analytics Vidhya