Basic questions

r
machine_learning
data_wrangling
regression

#1

Hi,
I have a couple of basic questions. I am bit familiar with R.
1.)what is multicollinearity?How do i check if there exists any multi collinearity among the attributes?Is it a good thing to have among attributes or is it a curse? Any functions/packages pertaining to check it in R?
2.)Could you explain the difference between sparse data and dense data? My understanding is this:
dim(x1) ##1000 550 I think it is sparse data as we have many columns
dim(x2) ##4000 40 I think it is dense data as we have many rows
I may be terribly wrong in my understanding please correct me.
3.)say i have 10 attributes in my dataset i look at the str(data) and do the necessary data type conversion. What if i have 100 or 300 attributes do i need to manually change the data type by looking at each of them. Is there any R command or small code snippet to automate it.
Sorry if these questions are too naive.
Thanks


#2

Hi Chakravarty,

Do remember every Genius was once Naive. Learning is the only way that one can never stay Naive. And to learn one has to ask Questions. It is good of you that you asked. Well I will try to answer all your questions to the best of my knowledge.

1- Multicollinearity is a phenomenon in statistics where one variable is highly correlated to other in a multiple regression model. Or to say one variable can be fairly predicted using the other.
It is definitely a curse for us statisticians as we get to deal with more data which can provide no new information or affect( positive) to our output.
Multicollinearity can be checked using the correlation principle of statistics. If two variables are highly correlated, then we can safely assume that multicollinearity exists in our dataset.
Yes, for the very basic cor() function, kappa() and VIF are worth checking out.

VIF worth read article- https://onlinecourses.science.psu.edu/stat501/node/347

2- I will try to explain it using numerical analysis. Sparse data is one in which majority of the data is Zero to say whereas Dense has non-Zero Data.
Also, sparse data doesn’t has a lot of features( read columns) while dense has .

3- DF <- data.frame(x = letters[1:5], y = 1:5, z = LETTERS[1:5],
stringsAsFactors=FALSE)
str(DF)

‘data.frame’: 5 obs. of 3 variables:

$ x: chr “a” “b” “c” “d” …

$ y: int 1 2 3 4 5

$ z: chr “A” “B” “C” “D” …

The conversion

DF[sapply(DF, is.character)] <- lapply(DF[sapply(DF, is.character)],
as.factor)
str(DF)

‘data.frame’: 5 obs. of 3 variables:

$ x: Factor w/ 5 levels “a”,“b”,“c”,“d”,…: 1 2 3 4 5

$ y: int 1 2 3 4 5

$ z: Factor w/ 5 levels “A”,“B”,“C”,“D”,…: 1 2 3 4 5

To convert the entire data frame into a single type, you can omit the “sapply” part.

Hope this answer helps. Please correct me if I am wrong.
Thanks for asking.


#3

Hi @NSS,
Thank you so much for replying.Your answer was helpful so as the link posted by you.
The opening lines of answer boosted me which was much needed.
Thank you