Measures of Central Tendency and Dispersion - Limitations

data_science
statistics

#1

Hi,

Please help me understand some situations when does each of the central tendency measures and variability fail?

Regards,
Tony


#2

Hi @tillutony

In general we work with the gaussian distribution(a.k.a. normal distribution) where in we choose mean over the rest as it is easily computable. The median is used to measure central tendency when skewed distributions are dealt with as the mean tends to lie near either of the extremes. Mode is used when we deal with nominal/categorical/ordinal distributions.

When it comes to variability there are two widely accepted choices.

  1. Variance = average of absolute errors
  2. Variance = average of squared errors

Now the function of 1) is singular(non-differentiable) at certain points which makes it computationally heavy so it is avoided altogether. Still there are statisticians who argue it to work better than 2). My personal view on this is if you feel your data has a lot of Outliers in it then you should probably go with 1) as 2) gives a lot of weightage to Outliers(|large error| << (large error)^2). Otherwise 2) works just fine.

Regards


#3

Hi @titusdream

Great question!
For reader’s references, measures of central tendency are used to describe data using key metrics such as mean, mode and median. Similarly, data sets suffers from variability or spread. To measure that, we use measures of variability such as range, variance, standard deviation(SD), IQR etc. These measure allow you to say more in less words.

This is basic statistics.

But, these measures have limitations. Considering, the different types of data available today, you can’t use one measure like mean, median, SD to describe the data everytime. Therefore, it’s important to understand which measure works best and when!

Let’s see!

Good Scenario
From statistics point of view, a data having normal distribution is the best data (shown below). Normal distribution is exhibited by the data which has Mean = Mode = Median

If you get such data, you are lucky! In such case, you can use either of mean, mode or median to describe the data. Note: Mean & Median are preferable.

Bad Scenario
Bad things happen with data too. When a data set displays skewed distribution (shown below), you should use Median as a measure of central tendency. Skewed Distribution happens 1) where outliers are present 2) major density of data is concentrated one side of graph.

Why Median?
You should use median because it doesn’t get affected by the presence of outlier values. In other words, it is robust to outlier values.

Mode, as a measure is used while working on categorical data.

For Variability, @B.Rabbit has given an apt explanation above. In addition, to visualize variability, boxplot works well in all conditions. Larger the IQR, larger the variability.


#4

Hi All,

Appriciate for all the insights