Variable Importance in Inputs that caused a shift in the output variable

r
classification
machine_learning
correlation
regression

#1

I am working on an interesting problem of finding some causes in the measurement shift of the output variable by correlating some of the input variables in the dataset. I am primarily using R for this task.
(Here is my dataset as a csv https://drive.google.com/open?id=0B7UROHet3IQwTURnSHlCcEt2Ykk)

df_sample <- read.csv("Sample.csv")

Here is the code and the plot of the output variable

ggplot(aes(x=DATETIME, y=OutVar), data = df_sample) + 
  geom_point(size = 2) + geom_line() +
  theme(axis.text=element_text(size=12)
        , axis.title=element_text(size=14,face="bold")) + 
  theme(plot.title=element_text(face="bold", size=20)) +
  xlab("Timestamp") + ylab("Output_Var") +
  scale_x_datetime(breaks=date_breaks("5 days"), labels=date_format("%m/%d"))

As you can clearly see, there is a shift in the output variable. I am trying to find the cause of this shift to check whether one or more of the input variables that are used in the above dataset have caused this to happen.
I have been trying the basic techniques with few ML algorithms to start off with. Some of the things that I have tried

  1. Pair wise plots, correlation using “library(qgraph)”
  2. Random Forest for variable importance - The error that I am getting is high here.

I am also thinking if I should deal this as a classification problem by assigning a “Good” for optimal points and “Bad” for abnormal points and then use classification algorithms for prediction. The only problem is that it is a very small data set with just 173 rows and the “Bad” points are minimal too. Can we get a good classification out of this?
Kindly provide some other techniques or directions on how to solve this problem


Correlations for more than 30 variables in 'R'
#2

This could perhaps help.

Alain


#3

Hi @sharathdhamodaran

ok I did a quick check, you have Multicollinearity with those data if you do a simple model type lm, lm.

Alain


#4

Hi @sharathdhamodaran

On analysing the data, I found that the trend is with the input data itself



I can see that we have data missing from Oct mid to Dec starting which is the time when you see the shift. During this shift period there is no data present

The next trend that you have marked as “Abnormal Activity” is because of higher input variables themselves as you can see from the graphs of a few variables i have plotted

I hope this answers your question

Regards,
Anant


#5

Hi @anantguptadbl

Thanks for this solution. Yes, there is no data between Oct mid to late november. I wanted to check if one of the variables caused the shift with the Output_var or if 2 or more variables contributed to it. From what I see from your graphs, input variables in column 4 & 6 seem to correlate well with the shift. Since there are more than 15 vars in the dataset, is there a way to automate this and compare with out_var?


#6

Thanks @Lesaffrea. I will look into the multicollinearity issue. I also noticed that there are some vars with different scale. Do we need to normalize the data prior to running any correlations plots?


#7

You can split the data by month or maybe weeks, and plot the slope ( beta) for the output variable against the outVar Plot the values for all the independent variables on the same graph. This will give a pretty good understanding