I am working on an interesting problem of finding some causes in the measurement shift of the output variable by correlating some of the input variables in the dataset. I am primarily using R for this task.
(Here is my dataset as a csv https://drive.google.com/open?id=0B7UROHet3IQwTURnSHlCcEt2Ykk)
df_sample <- read.csv("Sample.csv")
Here is the code and the plot of the output variable
ggplot(aes(x=DATETIME, y=OutVar), data = df_sample) + geom_point(size = 2) + geom_line() + theme(axis.text=element_text(size=12) , axis.title=element_text(size=14,face="bold")) + theme(plot.title=element_text(face="bold", size=20)) + xlab("Timestamp") + ylab("Output_Var") + scale_x_datetime(breaks=date_breaks("5 days"), labels=date_format("%m/%d"))
As you can clearly see, there is a shift in the output variable. I am trying to find the cause of this shift to check whether one or more of the input variables that are used in the above dataset have caused this to happen.
I have been trying the basic techniques with few ML algorithms to start off with. Some of the things that I have tried
- Pair wise plots, correlation using “library(qgraph)”
- Random Forest for variable importance - The error that I am getting is high here.
I am also thinking if I should deal this as a classification problem by assigning a “Good” for optimal points and “Bad” for abnormal points and then use classification algorithms for prediction. The only problem is that it is a very small data set with just 173 rows and the “Bad” points are minimal too. Can we get a good classification out of this?
Kindly provide some other techniques or directions on how to solve this problem