This question might sound basic, but is something with tremendous value. Let me try and explain this to the extent I can.
What is a Hypothesis?
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
For example, if you are asked to build a credit risk model to identify which customers are likely to lapse are which are not, these can a possible set of hypothesis:
- Customers with poor credit history in past are more likely to default in future
- Customers with high (loan_value / income) are likely to default more than those with low ratio
- Customers doing impulsive shopping are more likely to be at a higher credit risk
At this stage, you don’t know which out of these hypothesis would be true.
Why is hypothesis generation important?
Now, the natural question which arises is why is an upfront hypothesis generation important? Let us try and understand the 2 broad approaches and their contrast:
Approach 1: Non-hypothesis driven data analysis (i.e. Boiling the ocean)
In today’s world, there is no end to what data you can capture and how much time you can spend in trying to find out more variables / data. For example, in this particular case mentioned above, if you don’t form initial hypothesis, you will try and understand every possible variable available to you. This would include Bureau variables (which will have hundreds of variables), the companies internal experience variables, other external data sources. So, you are already talking about analyzing 300 - 500 variables. As an analyst, you will take a lot of time to do this and the value in doing that is not much. Why? Because, even if you understand the distribution of all 500 variables, you would need to understand their correlation and a lot of other information, which can take hell of a time. This strategy is typically known as boiling the ocean. So, you don’t know exactly what you are looking for and you are exploring every possible variable and relationship in a hope to use all - very difficult and time consuming.
Approach 2: Hypothesis driven analysis
In this case, you list down a comprehensive set of analysis first - basically whatever comes to your mind. Next, you see which out of these variables are readily available or can be collected. Now, this list should give you a set of smaller, specific individual pieces of analysis to work on. For example, instead of understanding all 500 variables first, you check whether the bureau provides number of past defaults or not and use it in your analysis. This saves a lot of time and effort and if you progress on hypothesis in order of your expected importance, you will be able to finish the analysis in fraction of time.
If you have read through the examples closely, the benefit of hypothesis driven approach should be pretty clear. You can further read books “The McKinsey Way” and"The Pyramid Principle" for gaining more insight into this process.
Common questions which might come to your mind
What if I miss out on some information, which was their in variables but I didn’t form a hypothesis on it This is where the importance of doing a comprehensive hypothesis generation comes into place - even if the hypothesis might be sounds crazy. If you believe it can impact, you should write that number of facebook connections with data analysts can lead to more defaults Even if you miss out on some variables / information, the amount of time saving by being hypothesis driven would be far more.
What if I am new to domain and can’t form hypothesis upfront: You will be surprised with how much you can achieve by being structured and hypothesis driven. If you are completely new to domain, just spend some time understanding it. This is how consultants at McKinsey & BCG work
My suggestion is to do as much hypothesis generation as you can upfront in the project and then work on those hypothesis. You will finish it in far shorter period.
Hope this helps.