I am facing issues with improving the models and score

r
machine_learning
data_science
python
#1

Hello Everyone,

I am new to data science and I have enrolled in a course for data science where I have learnt alot of techniques like linear regression, logistic regression, trees, random forests , gbm ,svm,clustering,stacking, parameter tuning. I am facing an issue with choosing the features and improving the score and which model to choose and which models to stack. Need some suggestions so that I can became I have visited several websites but I havent found any solution for these issues.

#2

hi @Umairnsr87 it really depends what is your target variable and how are you handling missing values , outliers .
Try and change the features you selected , the best you can do is open some basic problems like iris dataset , titanic dataset and see some kernels on kaggle and try and understand what those kaggle winners have done .
Refer :https://www.kaggle.com/mgabrielkerr/visualizing-knn-svm-and-xgboost-on-iris-dataset
https://www.kaggle.com/ash316/ml-from-scratch-with-iris

Also if you have any particular doubt you can mail me palbhanazwale@gmail.com
If I know I would help you …

#3

Thanks @palbha. The response is really good and I appreciate it. I can understand that it only depends on target variable, missing values and outliers.

But how to look at target variable so that I came to know that stuff. In a solution a guy has changed the distribution from pareto to normal while using boxcox transform. How will I came to know that I also need to change the distribution.

And feature engineering is about changing the features to be considered and provided them into the models.

Lets suppose I have 5 features (a,b,c,d,e) so feature engineering is about the selection of features which will give you the best result right.

Kindly correct me if I am wrong.

#4

Hi @Umairnsr87 , we are looking at the target variable distributions because the basic assumption while applying models is the distribution is normal , when we see at the distribution of the target varaible we apply transformations because we want the output to be distributed normally , feature selection is about selecting the variables which have significant impact on our model , whereas when it comes to feature engineering its about creating new ones as per our analysis or lets say for example from date we find the year month day of week and such things .
This particular kernel on kaggle is really good when it comes to understanding the transformation .
https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

I hope it helps .Have a nice day !!:grinning:

#5

Hi @Umairnsr87
I’ll attempt to answer your questions while complementing points made by @palbha

  1. Feature Selection
    i. Correlation Matrix
    You can do this before building your model. Plotting a correlation matrix will help you know how how much your target variable is affected by individual features. The matrix gives a range of 0.00 to 1 and the higher the value, the higher the correlation. You can then do your selection based on your judgement of the correlation.
    ii. Feature Importance
    This is done after building your model. This works almost in a similar way to a correlation matrix but it shows how much each feature influenced the target variable while building the model.

  2. Model Improvement
    The most common method of model Improvement is parameter tuning. This involves change the values of parameters of your model before building your model.
    Using a grid search, you can set a range of values for each parameter and during training, your model will be built using the best combination of each parameter and eventually give you the values of each parameter for the best model.

  3. Normalisation
    As @palbha explained, distribution of variables need to be normal for them give the best result while training.
    To know if a variable is normally distributed or not, there are a number of statistical methods you can use such a plotting a distribution plot and calculating the skewness of the data.
    There are different types of methods of normalisation like you mentioned boxcox and selection of each method depends on a number of factors.
    The link below will give you a good explanation on which method to use:
    http://desktop.arcgis.com/en/arcmap/latest/extensions/geostatistical-analyst/box-cox-arcsine-and-log-transformations.htm#GUID-E4D26783-5FF6-4B71-9FE1-B5A9F4E56AA6

  4. Feature Extraction
    This involves extraction of features from other features for example you can extract year, quarter, month or day from a date variable.

Reach out in case of any question or code regarding the above.

2 Likes
#6

Thank you @chriskinyua. Your answer is really good. I want to ask you more details about the features to select from corelation matrix. You mean to say if the co relation value of the feature is high then i can ignore that feature right. When this will get clear i will move to the next point and thank you for your patience and support.

#7

Hey Umairnsr87,

To improve your model their are 3 simple methods you need to apply,
I am giving you the list along with urls so that you can read them and understand them better

  1. Dimension Reduction
    Well, this technique helps in reducing the dimension or no of variables by giving you the list of the variables which have the best variance or which affects the result the most.
    Here is a link for you to understand it better-Dimensionality Reduction
    2)K-Fold Cross Validation
    K-Fold Cross Validation technique helps us in understanding the true accuracy of our dataset by dividing our dataset into batches and then predicting the accuracy of the model. A good model would have high accuracy and a low deviation.
    For example a model having a accuracy of 86±1 is a good model than 90±5
    Here is a link that will help you understand it better - k-fold Cross Validation
  2. Grid Search
    Models and their accuracies can be improved by controlling the hyperparameters these are parameters that are provided in the algorithm which tells the model how to perform operation on the data
    A url -towardsdatascience.com/grid-search-for-model-tuning-3319b259367e

Do practice using first simpler models, to get a better understanding

1 Like
#8

Hi @Umairnsr87 you’re welcome. Let me correct my earlier point.
The range should be -1 to +1 and not 0 to 1. Negative correlation can exist and features with with a negative correlation are still useful just like in linear regression where variables can have a negative weight.

Regarding your question, if the value is higher it indicates a strong linear relationship therefore you should pick the feature.
You can also plot a scatter plot of individual features against the target to show you the relationship in case it is linear.

1 Like
#9

Thanks alot for the patience @chriskinyua can you please explain the concept of normalizing the variables. I am pretty confused with this concept when to do when not to do. It will be my pleasure brother.

1 Like
#10

Hi @Umairnsr87, apologies for the late reply.
First I’ll try to explain what a normal distribution is and why it is important.

A normal distribution of a variable takes the shape of a bell curve where most values tend towards the center of the distribution and the distribution’s mean, median and mode are equal.

This is important because a lot of machine learning algorithms assumes that data a normal distribution before fitting the data. This is also supported by the Central Limit Theorem which states that a sample taken from the distribution will follow a normal distribution and the mean will be equal to the original distribution.

In order to determine if your distribution is normal, you can apply statistical methods such as plotting the probability density function of a variable and calculating its skewness.
A normal distribution is symmetric on either side of its mean/center. To best describe symmetry, skewness is used a metric. The distribution can be left skewed or right skewed and a normal distribution has a skewness of 0. Negative values indicate data that is left skewed and positive values indicate data that are right skewed.

Depending on the skewness and type of data, you can apply the various types of normalization techniques. This will help in transforming your data into a normal distribution and ensure your algorithm fits data without variation.
You do not need to normalize your data if it already fits a normal distribution.