I am currently studying about naive Bayes classifier in which Laplace argument takes care of additive smoothing but I am not able understand what is additive smoothing in context of classifier.
I’m not sure if I have covered all aspects but additive smoothing is used primarily for scenarios when you expect to see attributes or data points which weren’t present in training data set.
This is especially true when classifying text data. You often might encounter words which weren’t present in the training set.
eg : Lets say you have three words A,B,C . They are present in your sentences which you are classifying into either positive or negative sentiments.
So your dependent is a class : Positive or Negative
Predictors are the words ( all which may not be present in a sentence)
P(Class = Positive | X1 = A, X2 = B, X3 = C) = (P(X1 = A | Class = Positive)*P(X2 = B | Class = Positive)*P(X1 = C | Class = Positive)*P(Class = Positive) / (P(X1 = A)*P(X2 = B)*P(X3 = C)
Now lets say , you have another word in the sentence of the test case X4 = D , unfortunately none of your train cases had this D ,
P(X1 = D | Class = Positive) = 0 , right ?
this will make the entire probability zero … so you prevent such a thing from happening, you add in a smoothing or additive smoothing component
P(X1 = A | Class = Positive) = (Number of cases of A when Class Positive/Sum of Cases A+B+C)
P(X1 = A | Class = Positive) = (Number of cases of A when Class Positive + alpha/Sum of Cases A+B+C + alpha)
Here alpha is your smoothing component
You will get the following expression below with smoothing alpha = 1
P(X1 = D | Class = Positive) = 1