Gradient Descent problem in Titanic dataset

#1

Hi Everyone,

I am trying to implement gradient descent without regularization in titanic dataset using reference of Andrew NG’s machine learning course.

Here is the cost function given by him for logistic regression -

J(theta) = -ylog(h(x)) - (1-y)log(1-h(x)) (summed over all records)

thetaj = thetaj - (alpha/#records)(h(x) - y)(xj) (j=1…all features)

h(x) = 1/1+e(-np.multiple(thetaTranspose,X))

Following is my implementation in python3 -

``````iteration = 1000
theta = thetanew
errorLog = np.empty(iteration)
alpha = 0.0001
epsilon = 0.1
for i in range(iteration):
eThetaX = expit(-thetaX)
hx = np.divide(1,denom)
eThetaXNew = expit(-thetaXNew)
hxNew = np.divide(1,denomNew)
#To avoid divide by zero error in log
hxNewLogSafe = np.subtract(hxNew,epsilon)
# print (hxNew)
#print (costerror)
errorLog[i] = costerror.sum()
thetanew = theta
``````

However when I plot cost error and iteration, I don’t get a consistent curve when I execute this code multiple times. Sometimes, the error increases and sometimes it decreases with each iteration. Below are some plots of the same-

Can anyone suggest me what is going wrong here? Its supposed to decrease with every iteration. I tried different values of alpha (0.1,0.01,.001,.0001 etc) but no difference.

#2

I just noticed, when I initialize theta with random values between (0,1) then I am getting consistent curve (but NOT converging) but when I increase the range beyond 1 then again same issue.Any thoughts?

#3

You are right. In the actual algorithm, the curves will be different for different instances. And the intial cost totally depends on the random number. However, when implementing this in production a `random seed` , say `random seed = 10` is set so that the results remain consistent.

Again, the results may vary for GBM trees as the random number generation happens at the splits too. To get consistent results you too can set random seed with `np.random.seed(13)`. Let us know what the results are like with this change. Thanks!

#4

Thanks @Shaz13. I missed this point and you really brought me peace by pointing it out! Thank you so much for that.
Now I see a consistent curve but the cost function is ever increasing no matter what alpha or initial theta I choose.
Here are the columns for my titanicTrain_gradient_X -

[‘Fare’, ‘Pclass__1’, ‘Pclass__2’, ‘Sex__female’, ‘SibSp__0’, ‘SibSp__1’,
‘SibSp__2’, ‘SibSp__3’, ‘SibSp__4’, ‘Parch__0’, ‘Parch__1’, ‘Parch__2’,
‘Parch__3’, ‘Embarked__C’, ‘Embarked__Q’]

I also imputed missing values with median and did outlier treatment. What could be going wrong here? Any suggestions?

#5