Why and when is hypothesis generation important?

@kunal sir / @aayushmnit / @shuvayan / AV friends.

This is a very basic question. But I really wanted to understand this. Can you explain me the concept of hypothesis generation or what is laying out a hypothesis means. I understand the concept of hypothesis testing (where in you a ztest of t test based forthe situation and accept or reject the null hypothsis). What I really dont understand is the hypothesis generation before actually looking into the data or just looking at the variable.

Could you please explain?

1 Like


This question might sound basic, but is something with tremendous value. Let me try and explain this to the extent I can.

What is a Hypothesis?

Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.

For example, if you are asked to build a credit risk model to identify which customers are likely to lapse are which are not, these can a possible set of hypothesis:

  • Customers with poor credit history in past are more likely to default in future
  • Customers with high (loan_value / income) are likely to default more than those with low ratio
  • Customers doing impulsive shopping are more likely to be at a higher credit risk

At this stage, you don’t know which out of these hypothesis would be true.

Why is hypothesis generation important?

Now, the natural question which arises is why is an upfront hypothesis generation important? Let us try and understand the 2 broad approaches and their contrast:

Approach 1: Non-hypothesis driven data analysis (i.e. Boiling the ocean)

In today’s world, there is no end to what data you can capture and how much time you can spend in trying to find out more variables / data. For example, in this particular case mentioned above, if you don’t form initial hypothesis, you will try and understand every possible variable available to you. This would include Bureau variables (which will have hundreds of variables), the companies internal experience variables, other external data sources. So, you are already talking about analyzing 300 - 500 variables. As an analyst, you will take a lot of time to do this and the value in doing that is not much. Why? Because, even if you understand the distribution of all 500 variables, you would need to understand their correlation and a lot of other information, which can take hell of a time. This strategy is typically known as boiling the ocean. So, you don’t know exactly what you are looking for and you are exploring every possible variable and relationship in a hope to use all - very difficult and time consuming.

Approach 2: Hypothesis driven analysis
In this case, you list down a comprehensive set of analysis first - basically whatever comes to your mind. Next, you see which out of these variables are readily available or can be collected. Now, this list should give you a set of smaller, specific individual pieces of analysis to work on. For example, instead of understanding all 500 variables first, you check whether the bureau provides number of past defaults or not and use it in your analysis. This saves a lot of time and effort and if you progress on hypothesis in order of your expected importance, you will be able to finish the analysis in fraction of time.

If you have read through the examples closely, the benefit of hypothesis driven approach should be pretty clear. You can further read books “The McKinsey Way” and"The Pyramid Principle" for gaining more insight into this process.

Common questions which might come to your mind

  • What if I miss out on some information, which was their in variables but I didn’t form a hypothesis on it This is where the importance of doing a comprehensive hypothesis generation comes into place - even if the hypothesis might be sounds crazy. If you believe it can impact, you should write that number of facebook connections with data analysts can lead to more defaults :slight_smile: Even if you miss out on some variables / information, the amount of time saving by being hypothesis driven would be far more.

  • What if I am new to domain and can’t form hypothesis upfront: You will be surprised with how much you can achieve by being structured and hypothesis driven. If you are completely new to domain, just spend some time understanding it. This is how consultants at McKinsey & BCG work

My suggestion is to do as much hypothesis generation as you can upfront in the project and then work on those hypothesis. You will finish it in far shorter period.

Hope this helps.



Thank you very much Kunal. I am going to start my first job soon in the Data Science field. Your article is definitely going to help me.

1 Like

Best answer i can get to explain the importance of hypothesis generation to my boss :wink: , thank you Kunal

1 Like

Thanks a lot for explaining it in layman terms. :slight_smile:

Thank you for explaining this, understood the importance of hypothesis!