Data Normality Question

Hi,

let’s say I have a small data which is indicative of demographics of a place has fields /attributes like :

  • Place_Longitude,
  • Place_Latitude,
  • population (per district in the state),
  • median_income,
  • median_house_age,
  • median_house_value,
  • district_location (near sea or inland),
  • total_households (per state district)

I plotted histograms on the entire dataset and saw long tailed distributions. Data is NOT normal or Gaussian

Now my questions are :

  • When checking for Normality / Gaussian curve. Should we check all the attributes or a few particular attributes in the dataset? which one would be those in my case above.

  • When transforming data to make it normal, do we transform all the available data/attributes or a few important ones.

  • How should I handle such a transformation of making data normal…through scipy.stats module using cox-box or by any other technique.Kindly explain.

PS: Can data scaling / standarization make data normal ?

Thanks & Regards,

Mohit

Hi, MOHIT

first of all you must determine what kind of measure you have in each variable, doing this it will be easy determinate what kind of transfomation will be usefull.

  • Place_Longitude, geoinfo = (float64)
  • Place_Latitude, geoinfo = (float64)
  • population (per district in the state), count measure = (int32)
  • median_income, metric = (float64)
  • median_house_age, metric = (float64)
  • median_house_value, metric = (float64)
  • district_location (near sea or inland), nominal, you can transform in dummy (0 and 1)
  • total_households (per state district), count measure = (int32)

> Have you tested for normality? K-S test its recomended if n > 30 cases, Shapiro-Wilk if n < 30 cases.

Answering your questions:

  • When checking for Normality / Gaussian curve. Should we check all the attributes or a few particular attributes in the dataset? which one would be those in my case above.
    Depends on what kind of analysis you want to perform
  • When transforming data to make it normal, do we transform all the available data/attributes or a few important ones.
    Depends on what kind of analysis you want to perform
  • How should I handle such a transformation of making data normal…through scipy.stats module using cox-box or by any other technique.Kindly explain.
    Normal Distribuition isn’t dependent on technique used to transform data.

PS: Can data scaling / standarization make data normal ? No, sometimes transforming will not display as normal distribuition.
Scaling / standarization methods are more appropriate to reduce scaling effects, such as magnitude of measures, as example below:
age = placed in years ranging from 1 to 100 (for example)
household income = US$ 100,000.00 annual

© Copyright 2013-2019 Analytics Vidhya