Statistics in Machine learning



Hi Team,

I am new to ML and have started learning statistics (Descriptive, inferential, p-test, t-test, z-test and hypothesis) and wants to understand when we need to apply these tests i.e. in which phase, to check model performance, before running the model etc. It would be great if anyone can provide lifecycle of ML project (getting data, checking eligibility, running the model etc) so that I can get to know the flow.

Also, which statistical method is used to check if my data (sample, population) is eligible for testing or not.

Thanks in advance!



Hi @omkar18,

These tests are applied to check whether different variable will have significant impact on the model. Before making the model you always do hypothesis testing, i.e. you try to find out the factors that can affect your target variable. So, to verify that your hypothesis is correct or not, you do these tests which tells you how much impact your hypothesis will create on the model.

General procedure to approach any machine learning problem can be summarized as:

  1. Understand the problem statement : The first and the foremost step is to understand the problem that you are dealing with.

  2. Hypothesis Generation : Once you know the problem, you make your own hypothesis, i.e. you point out the factors that can affect the target. NOTE: Hypothesis generation is done without looking at the dataset.

  3. Data Exploration : Now you look at your dataset. Explore each variable individually as well as their relation with other variables too. Here you apply all the statistics tests based on the type of variable.

  4. Feature Engineering : You made some hypothesis based on the problem. Now you try to make new features from the available features that might affect your target variable.

  5. Model Building : Finally you choose some algorithm and make model.

These are some basic steps to solve a machine learning problem.

It depends on the type of variable. For different variables we use different tests. You can refer the below article to have in-depth knowledge about different statistical tests:

How to identify potential customers who are ready to convert in to paid?