Hypothesis while data modelling


Hypothesis is very much needed while building a data model. But when we have a huge dataset how and what to define as our hypothesis.
Can someone enlighten me with an example.


Define what do you mean by Huge Dataset and Business domain which dataset belongs to.


It depends on what you want to prove and what is the purpose of your study. You cannot simply take a data-set and create a random hypothesis for the sake of it. The study/analysis must answer some question(s). Based on the question, you create a hypothesis. For ex. I have a data-set of net revenue and marketing cost. Now, spending more on marketing should increase my Total Revenue (TR). However, my Net Revenue (NR) will be Total Revenue-Marketing Cost (MC). Therefore, NR might not scale linearly with the TR. In order to see whether a higher (MC) enables me to achieve a higher NR, I will create a null Hypothesis that “Higher MC DOES NOT enable me to get a higher NR”. And hence, the alternate hypothesis will be that “Higher MC DOES enable me to get a higher NR”. This hypothesis is meaningful and answers a question.

If you are still having issues with creating a hypothesis, you can study the distribution of the variables in a graphical format for clarity.

I hope this helps.


Thanks Nishant for the explanation. So you mean to say that testing of hypothesis can be done either by graphical representation or by any test methods also right.



I mean to say that if you are doing this for practice, and you are not aware about what hypothesis you should form, then you can plot few scatter charts to get some idea about the relationships and then you might be in a better position to formulate a hypothesis.

Testing of hypothesis can be formally done by t-test. It is an element of a 6 step approach to hypothesis testing.