This is similar to linear regression, except now the variable we want to predict
can only take discrete values. E.g. this binary classification problem:
\(
y \in {0, 1}
\)
0: negative class (e.g. absense of something, tumor is benign)
1: positive class (e.g. presense of something, tumor is malignant)
Therefore the logistic regression algorithm must always return a value for the
hypothesis that is between 0 and 1.
Hypothesis in logistic regression
Again following Andrew Ng’s ML course, we want \(0 \le h_\theta(x) \le 1\), so
now:
where and \(g(z)\) is called the sigmoid function or logistic function. So:
Let’s plot the sigmoid function below
In [123]:
Note that \(g(z)\ge 0.5\) when \(z \ge 0\) so the hypothesis:
\(h_\theta(x)\) is always bounded between 0 and 1.
Interpretation of hypothesis output
\(h_\theta(x)\) is the probability that \(y=1\) given input \(x\) for a model
parameterised by \(\theta\), or:
\(h_\theta(x) = p(y=1\mid x; \theta) \)
So if the variable \(x\) represented tumor size, and the hypothesis returned the
value \(h_\theta(x)=0.7\), there is a 70% probability that the tumor is malignant.
The code below shows a function defining the hypothesis for logistic regression.
In [124]:
Data
I’ll use data from Andrew Ng’s course to illustrate logistic regression in
python. This data set is an applicant’s scores on two exams and the admissions
decision.
The applicant was admitted if their scores were one of those in the upper right
points (in red), and was not admitted if their scores were one in the lower left
points (in blue)
In [125]:
Data set has 2 feature(s) and 100 data points
Cost function
We can’t simply plug in our logistic regression hypothesis into the same cost
function we used for linear regression. This is because the sigmoid function is
highly non linear and would result in a non-convex cost function that is not
guaranteed to converge to the global minimum.
We need a cost function that would produce a “bowl shape” (a convex function) as
a function of the parameters \(\theta\). Therefore gradient descent can definitely
converge to the global minimum.
The cost function for logistic regression will be:
Let’s plot what we expect this to look like as a function of \(h_\theta\) for
\(y=1\) and \(y=0\)
In [126]:
Focussing on the \(y=1\) (blue) curve: the cost\(=0\) when \(h_\theta=1\), and as
\(h_\theta \rightarrow 0\) the cost \(\rightarrow \infty\) which is the behavior we
want. And the reverse is true for the \(y=0\) (blue) curve.
Gradient descent
The update rule for the parameters \(\theta\) in logistical regression looks
identical to that of linear regression, but because the hypothesis has changed
it is not exactly the same thing.
On first iteration, value of cost function = 0.69314718056
Finding minimum of cost function
I’m going to use a built-in function from scipy instead of my own gradient
descent (copying this stage of Ng’s course where he uses Matlab’s fminunc)
though still using my functions for the cost function.
In [128]:
Optimal theta = [ 0.20623159 0.20147149 -25.16131872]
Cost function at minimum = 0.203497701589
Predictions
To evaluate how well the statistical model does compare the predicted class to
the actual class.
In [129]:
89.0 percent of classifications on the training data were correct
Non linear decision boundary
This data set (from Andrew Ng’s course) is the outcome of two different tests on
microchips to determine if they are functioning correctly. As can be seen from
the plot of the data, the boundary between functioning and not functioning is
clearly not linear.
Applying logistic regression as we just did will not end up performing well on
this dataset because it will only be able to find a linear decision boundary.
In [130]:
Data set has 2 feature(s) and 118 data points
Therefore we shall follow the exercise in Ng’s course and re-map the two
features onto new features which are all the possible polynomial terms of \(x_1\)
and \(x_2\) up to a degree of 6.
A logistic regression classifier trained on this higher-dimension feature vector
will have
a more complex decision boundary and will appear nonlinear when drawn in the 2-D
plot.
In [131]:
There are now 28 features
Regularised cost function
Because of the high order terms being used in the \(x_1\), \(x_2\) to \(y\)
relationship we will need to avoid overfitting by using regularisation.
In [132]:
On first iteration, value of cost function = 0.69314718056
Now I repeat the fit with less (\(\lambda=0\), well none) and more (\(\lambda=100\))
regularlisation. The first case should cause over-fitting, which you can see as
it gets more training classifications correct than \(\lambda=1\) but the boundary
is more complicated. The second case should cause under-fitting, where below you
can see that the model does not fit the training data well at all.