Question regarding overfitting and underfitting

r
machine_learning
pca
random_forest
feature_engineering

#1

Hi,
1.)How do i get to know if my model is over fitting or under fitting in R. What are the signs?
2.)In R the input to cor() function accepts only numerical attributes how do i get the correlation among categorical attributes? Any function?
3.)what is the difference between feature selection and dimensionality reduction?
I know that in R specifying importance=T parameter in randomForest function gives you the important features based on info.gain. PCA is an dimensionality reduction technique which transforms your feature space to new dimensions.How to get the subset of important features using PCA?
please help me understand the differences between feature selection and dimensionality reduction?
4.)Any rule of thumb on when to use PCA based on number of attibutes?
Sorry if these are too basic questions.
Thanks


#2

hello @chakravarty,

It is a good idea to put so many questions(basic or not,they are important) as different questions :slight_smile:
I will try to answer these as simply as possible here and also try to point you to more detailed articles on the topics.

1.Cross Validation while using your models will give you the best idea of your model accuracy as it creates many partitions of the data and trains on some and tests on one.More detailed explanation here: http://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-in-python-r/

2.Correlation between categorical variables can be found out using the cor function by converting the variables to numeric but there are other methods like chi sq test which will tell you if the two variables are independent or not.More details :
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm

and http://datascience.stackexchange.com/questions/893/how-to-get-correlation-between-two-categorical-variable-and-a-categorical-variable

3.Feature selection is selecting the most important features out of n features using some criteria like in random forests.Dimension reduction is combining some features (linearly in case of PCA) to get a reduced set of uncorrelated features.more details http://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/

Hope this helps!!