Data Science Interview Questions



Hi All,

Following are some interview questions that I encountered.
I would really appreciate for anyone to answer with proper : Theory Proof as well relevant R/Python Coding

  • Based on problem statement what is the thumb rule to identify ML Algorithm.

  • Suppose two independent variables are highly correlated but both are mandatorily required in building ML model( like Linear / Logistic Reg), so how can we build such model without dropping any of these variables. Note : we are concerned about independent variables multi- colinearity among themselves not with target variable.

  • What is cost function of Random Forest Algorithm

  • How is P value got calculated / formula to calculate the P - value.

  • How can I control the leaf spilt length in Decision Tree

  • Is there any manual testing approach to test the built ML model before production.

  • How to Build Topic Model using Deep Learning

  • Scenario: We are given 1000 rows of labeled(0/1) data along with 7 independent features.
    Of these 500 rows are 0’'s and remaining 500 are 1’s. We spilt the data in 70:30 as train and test. We train Logistic Regression on train data,
    but predicting on test data the model predicted the 0’s more as compared to 1’s. What this situation is known as and How to deal with such problem

Kindly answer them ASAP


To your last problem scenario, answer is Overfitting, solution is cross validation.