Why cases control sampling is most effective when the prior probabilities of class are unequal?

data_science

#1

I am currently studying about an approach of data collection called case-control sampling

The case-control is a type of epidemiological observational study. An observational study is a study in which subjects are not randomized to the exposed or unexposed groups, rather the subjects are observed in order to determine both their exposure and their outcome status and the exposure status is thus not determined by the researcher.

I have also read that the case-control sampling is most effective when there is a large difference in the prior probabilities of the classes.I want to know the reason behind this.


#2

hello @hinduja1234,

As far as I know case control studies were primarily developed for the field of biology to model rare events which generally have a low proportion in the population.Example :THe proportion of males in SA suffering from heart disease is 5%(say).So if you have a sample of 1000 people you will have only 50 people with heart_disease = 1.
This low number of records in one category makes the distribution skewed and as such your model will always predict the 0 cases more accurately than the 1’s.
For this reason,to have more data for the cases,case control sampling is helpful as it takes care of the proportions or the prior probabilities.
I might be wrong about some of the points but
It ultimately boils down to having enough data for the algorithm to feed on.
You can go through this example:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1706071/
Hope this helps!!