I am trying to replace the missing values in my dataset with the mean of the corresponding variable. I tried a lot of code available on various links but nothing seems to work practically. The most common code on help manuals using PROC STANDARD does not replaces the missing value. There is always “.” in place of missing value in SAS. Can someone please tell me the SAS code for this that will actually work?
please reply…i also want the similar code
For now I have figured out to do this in two step:
Step1) Find mean of the required variables by PROC MEANS.
proc means data=Dataset_Original;
var var1 var2 …;
output out= Title (drop= type freq) mean= / autoname;
Step2) Replace the missing values by the above calculated mean.
if var1 = . then var1=;
if var2= . then var2=;
This will do the job for now!
But better ideas and efficient code are welcome on this thread!
We can perform this using Proc SQL and Data steps in SAS. I have done similar exercise on Age imputation for Titanic Survival competition on Kaggle. There are various methods to identify most relevant value to impute missing values like similar case and general imputation of mean, median and mode, generate a model to predict value and others.
Here I will use general method, impute missing value with average of that variable.
Step1: Populate average age using proc SQL
Proc Sql; Create Table Titanic_Avg_Age as Select t1.*, Mean(t1.age) as avg_age from Statdata.Titanictrain as t1; Quit;
A new table is generated with a additional variable of avg_age.
**Step2:**Created a new variable Age2 and assigned value of variable avg_age if variable age is missing else value of variable age.
After that drop original variable age and rename Age2 as age.
Data Titanic_Wth_Age (Drop=Age rename=(Age2=Age)); set Titanic_Avg_Age; if age=. then Age2=avg_age; else Age2=age; run;
Hope this helps!
Here is the solution to your question:
/This MACRO Imputes Missing Values(If Any) With The Mean/
/Specify the Numeric Variable VARLIST/
/Specify the Source Data Set Name/
%DO J=1 %TO &CONT_VAR_CNT.;
SELECT MEAN(&DNAME1.) INTO :VAR_MEAN_&i.
IF MISSING(&DNAME1.) THEN &DNAME1.=&&VAR_MEAN_&i…;
/*RESULT: YOUR FINAL DATA SET NAME WILL BE “YOUR_SOURCE_DATASET_NAME”_IMPUTED */
@Imran do u feel that replacing missing age with mean is going to be a good approach as outliers will effect the outcome . i think mode can be used . and i feel we should replace the missing values according to grouping of gender .
@abhinavmishra873 Hi Abhinav. I guess you are right. But shouldn’t this dealt in a case by case basis? The percentage of missing data for a variable will help us learn the desired method of imputation. I generally use similar case imputation, where I try to find records having similar values for other variables and then take a mean or mode of those similar cases.
ya i actually i had binned the data values into groups from 0-10 till 70-180 and then i have seen that around 150 values are missing . so i tried to replace it with median .