Modelling technique for categorical predictor and continuous target

machine_learning

#1

Hi,

My first post and I hope to learn a lot.

I am working with a dataset which consists of lot of categorical variables and target is continuous. I am in a fix will linear regression give meaningful insight? Struggling with dummy variables and predicting the test set.

I am also wondering on a idea which goes like to read each row as a word and try to find the distribution of that word over the entire data set and group the target accordingly. Does it make sense?

thanks


How to find the a single categorical variable importance in a set of all independent categorical variables?
How to find the a single categorical variable importance in a set of all independent categorical variables?
#2

Hi @cachu,

Since your target variable is continuous, you certainly can try fitting linear regression model even when you have categorical independent variables. You can also try using other models such as decision tree or xgb and compare the score you get when you fit the test set.

Can you please explain the second question again? I am not sure if I understand your query.


#3

Hi @AishwaryaSingh,

Thank you for your response.
consider this dummy data with target Survived.

A tibble: 6 x 5


  Class Sex    Age    CanJump    
  <chr> <chr>  <chr>  <num>    
1 1st   Male   Child     1.3          
2 2nd   Male   Child    1          
3 3rd   Male   Child    1.7        
4 Crew  Male   Child   1.8         
5 1st   Female Child    1.9        
6 2nd   Female Child   1   

If i make “one word” from the 1st observation it will look like

String: "1st/Male/Child" .

I was wondering how this string frequents in the whole dataset. What is the most frequent string and group_by the dataset with “string” variable and look into the distribution of target variable. This is an idea i am trying to form. Please help me with how can i add the levels of the variables and make a string and how do i compare between strings. Is there any package in R does similar things etc.

thanks


#4

Hi @cachu,

You can create a new variable combining the present three variables, for example, for the first data point, the string would look something like 1_M_C

You can now find how frequently the string appears and maybe use this variable as an important feature in your prediction. One possible way of creating these strings is by extracting the first letter from each word.

Otherwise you can locate data points that satisfy your condition and label them accordingly. By this I mean, rows which have class=1, Sex = Male, Age =Child , can be labeled as 1_M_C .
Similarly, rows which have class =1, Sex=Female, Age=Child can be labeled as 1_F_C


#5

TrySet <- train %>% select(-one_of(‘Target’) )
code <- apply(TrySet,1, function(x) paste( x, collapse ="/"))
code<- factor(code)
train$code <-code
train %>%ggplot(aes(x= code))+geom_bar() # better ignore this part if dataset is large
prpcode<-count(code)
train %>% group_by(code)%>% summarize(App=n())

The following code does that part. I think i need to somehow compare the string and collapse the string with similarity measure( i dont know how to achieve this part). Maybe this way i can conclude which observation is most frequent and what observation can be termed as outlier.

Do you know any such work done in this way or am i just chasing wild goose

Thanks for your patience …


#6

Hi,

If you are able to create a separate column combining the data from the three columns (as we discussed previously) finding the frequency of occurrence of each string should now be a problem.


#7

table() or count () can solve the problem. But I get too many levels. Next step should be try collapse some and create clusters. On side note there is AVF algorithm or nAVF. I am wondering whether they serve the same purpose.


#8

Nothing will make any sense if you dont give the business objective you are trying to solve. Give the objective first, people can tell u the best way to solve it.