Data Exploration - Skewness

machine_learning

#1

Hi

How would one know if the variable is right skewed or left skewed by looking in to the summary results.(Distribution Analysis

Regards,
Tony


#2

@tillutony - First you can create a Histogram to see if the distribution is symmetric or skewed, if you see long tails on any of the sides, you will assume the variable is skewed. By plotting if you see long tails on the right side, you can say the variable is skewed right or positively skewed, if the long tails are on the left side, you can say the variable is skewed left or negatively skewed. A Histogram with Bell shaped curve will suggest a Symmetrical distribution.

Further, you can quantify the skewness using an application like MS Excel or R, Excel has inbuilt formula SKEW which will easily calculate the skewness in the data. If you are using R then you need to install e1071 package which has a function skewness to calculate the skewness in the distribution.

Below is a sample r code to check the skewness of a random variable:

#Install the required package “e1071”

install.packages(“e1071”)

#Attach the package to use it

library(e1071)

#Generate a random sample of 100 data points with mean of 50 and standard deviation of 5:

random_sample <- rnorm(100,50,5)

#Plot the data on a Histogram:

hist(random_sample)

#Calculate the skewness using the formula:

skewness(random_sample)

This will return either a positive or negative number depending on the skewness of the distribution, a rule of thumb to interpret the skewness results is:

  • If the skewness is between -0.5 to 0.5 then the data is fairly symmetrical.
  • If the skewness is between -1 and -0.5 or between 0.5 and 1 then the data is moderately skewed.
  • If the skewness is less than -1 or greater than 1 then the data is highly skewed.
  • If the skewness is 0 then the data is perfectly symmetrical (quite unlikely on real-time data distribution)

Hope this helps :slightly_smiling_face:


#3

Hi,

am aware of the above would like to know by running summary (data) from the results how would I understand distribution analysis.

for example

summarizeColumns(train)

name type na mean disp median mad min max nlevs
LoanAmount integer 22 146.4121622 85.5873252 128.0 47.4432 9 700 0
Loan_Amount_Term integer 14 342.0000000 65.1204099 360.0 0.0000 12 480 0
Credit_History integer 50 0.8421986 0.3648783 1.0 0.0000 0 1 0
Property_Area factor 0 NA 0.6205212 NA NA 179 233 3
Loan_Status factor 0 NA 0.3127036 NA NA 192 422 2

This functions gives a much comprehensive view of the data set as compared to base str() function.
From these outputs, we can make the following inferences:
1.In the data, we have 12 variables, out of which Loan_Status is the dependent variable and rest are independent variables.
2.Train data has 614 observations. Test data has 367 observations.
3.In train and test data, 6 variables have missing values (can be seen in na column).
4.ApplicantIncome and Coapplicant Income are highly skewed variables.
How do we know that ?

please eloborate

Regards,
Tony


#4

I am not very much sure that only by looking the summary of data, you will get a good picture of data distribution.

But if the mean is less than the median then you can assume a left skewed distribution and if the mean is greater than median then you can assume a right skewed distribution.

So if you look at the summary of ApplicantIncome and Coapplicant Income, you can get an idea of skewness in these variables.

For further data exploration you can use Boxplot and Histogram charts.


#5

Thank a ton for your Valuable inputs


#6

Yes ! Check for mean and median in the summary stats of the variable.
If mean < median then left skewed and if mean > median then right skewed.
Hope, I answered your question


#7

Everything that Manoj_Kumar said is great! However, you might want to look at the postion of the mode as well and see what your min and max are. If your Min is 10 and your mode was 20 and your max 100 then it might be right-skewed.