Linear regression can be used when you want to predict single values rather than
a class or category, and when a linear relationship between the features of the
model and the value to predict is reasonably valid. The below follows closely
the linear regression lecture from Andrew Ng’s Machine Learning course on
Coursera.
simultaneously updating every feature parameter \(j\). The index \(i\) refers to the
individual data points.
Feature scaling
Imagine you want to predict life expectancy based upon e.g. a person’s annual
salary and the number of children they have. The two features have very
different scales: salary $0 to $1,000,000+, whereas number of children could
be 0 to 10+.
Imagine the possible contours drawn by constant values of the cost function in
that 2D space: because the salary feature takes a huge range of values compared
to the children feature they will be elongated in the salary dimension. If this
is the case, gradient descent will have trouble finding the minimum as it will
oscillate around a lot on its way to the global minimum (a large step in
\(\delta\)[number of children] is equivalent to a tiny step in \(\delta\)[salary]).
To solve this problem we should set the features onto a similar scale, e.g.
divide by the maximum values of each feature. A good aim is to get every feature
approximately into a \(-1<=x_i<=1\) range, though it’s ok if they are not exactly
in this range, but are at least within the same order of magnitude as it.
Possible ways to normalise:
divide by maximum value of feature (good if min is zero)
mean normalisation: replace \(x_i\) by \((x_i - \mu_i)\) and divide by range
mean normalisation: replace \(x_i\) by \((x_i - \mu_i)\) and divide by \(s_i\)
where \(\mu_i\) is the average value of the \(x_i\) feature in the training set, and
\(s_i\) is the standard deviation of \(x_i\).
Learning rate \(\alpha\)
It is helpful to plot the value of \(J(\Theta)\) as a function of the number of
interations. If gradient descent is working properly \(J(\Theta)\) should decrease
after every iteration. When \(J(\Theta)\) flattens off we can find the number of
iterations it takes to converge (cost function no longer significantly decreases
for new parameter updates). It is hard to tell in advance the number of
iterations you will need for any particular application, but this can be played
with.
Automatic convergence tests are often used to declare convergence, e.g. \(\Delta
J(\Theta)<10^{-3}\), but these should be checked afterwards by plotting
\(J(\Theta)\) for example.
If the gradient descent does not appear to be working, try using a smaller value
for \(\alpha\). But if \(\alpha\) is too small gradient descent will be slow to
converge, potentially prohibitively so.
Choosing features
Be careful when choosing features of the model. For example combining features
together that are both very similarly related to what you are trying to predict
could result in a better model.
Polynomial regression
It’s easy to see how we can extend this to polynomial regression by rewriting
\(h_\theta(x)\) e.g.
Warning: feature scaling becomes even more important here!
Normal equation method
Another method of solving for the minimum of the cost function, but this time
analytically. This would replace gradient descent.
We can just differentiate the cost function with respect to each feature and set
the differential to zero.
Normal equation:
Place all features into a matrix of \(m\) rows (data points) by \(n+1\) columns
(features) called \(X\). And also the values of the feature to be predicted into a
\(m\) rows (data points) by 1 vector called \(y\) (because just single value to
predict here):
\(\Theta = (X^TX)^{-1}X^Ty \)
Each training example gives a feature vector, and these are used to make the
“design matrix” \(X\). So each column feature vector \(x\) is transposed and put
into a row of \(X\). The above equation gives the optimal value of \(\Theta\) at the
minimum of the cost value.
With this method feature scaling is not necessary. You also don’t need to choose
a learning rate and don’t have to iterate.
However when there are are large number of features \(n\) the Normal Equation is
very slow if \(n\) is large, because you have to invert \(n\)X\(n\) matrix. The cost
of inverting a matrix is \(O(n^3)\) so should rethink using this if \(n>10000\).
Python implementation of linear regression
In [18]:
Test this implementation using the data sets supplied with exercise 1 of Andrew
Ng’s course. The data below is simply the population of a city versus the profit of a food truck in that city.
In [19]:
Data set has 1 feature(s) and 97 data points
On first iteration, value of cost function = 32.0727338775
Visualize the cost function
For a grid of likely \(\Theta\) parameters calculate the values of the cost
function and plot
In [20]:
At minimum of array: Theta_0 = -3.93939393939 Theta_1 = 1.20183486239
Do the regression
In [21]:
First check shapes of all inputs:
Shape of feature matrix: (2, 97)
Shape of values to predict: (1, 97)
Shape of feature parameter vector: (1, 2)
Shape of resulting hypothesis: (1, 97)
Expectation: Theta_0 (intercept) = -3.93939393939 Theta_1 (slope) = 1.20183486239
Result: Theta_0 (intercept) = -3.6302914394 Theta_1 (slope) = 1.16636235034
Again we test this using a data set supplied with exercise 1 of Andrew Ng’s
course. The data below is a training set of housing prices in Portland, Oregon:
the size of the house (in square feet), number of bedrooms, and the price of the
house. Because house sizes are generally 1000’s of square feet and the number of
bedrooms are < 10 we will need to use feature normalisation here.
In [24]:
Data set has 2 features and 47 data points
Plot the multivariate data
In [25]:
Feature normalization
The size of the house is on a much larger scale than the number of bedrooms
In [26]:
In [27]:
Shape of feature matrix: (3, 47)
Shape of values to predict: (1, 47)
Shape of feature parameter vector: (1, 3)
Shape of hypothesis: (1, 47)