In linear regression we are given $m$ training example pairs $x^{(i)},y^{(i)}$ for $i=1,\cdots,m$ where $x^{(i)}\in\mathbb R^n,y^{(i)}\in\mathbb R$ and we want to predict $y$ from $x$ using a linear function. More precisely, we want to find weights $w\in\mathbb R^n$ so as to minimize the mean square error over all training pairs between the predicted value $\hat y^{(i)}=w^T\cdot x^{(i)}$ and the actual value $y^{(i)}$, that is, we want to find

\[\hat w:=\arg\min_w\frac 1m\sum_{i=1}^m(w^T\cdot x^{(i)}-y^{(i)})^2.\]

In practice, we allow ourselves to consider affine functions instead of linear functions, that is functions of the form $y=w^T\cdot x+b$ instead of $y=w^T\cdot x$. This can be achieved easily by adding an extra component to the inputs $x^{(i)}$ with a constant value of 1 so that $x\in\mathbb R^{n+1}$ and $w\in\mathbb R^{n+1}$ as well.

In order to compute $\hat w$ analytically, it will be useful to vectorize the setup above. Arrange the $x^{(i)}$ as rows in a matrix $X\in\mathbb R^{m\times (n+1)}$ i.e.

\[X=\begin{bmatrix} \rule[.5ex]{2.5ex}{0.5pt}{x^{(1)}}^T\rule[.5ex]{2.5ex}{0.5pt}\\ \vdots\\ \rule[.5ex]{2.5ex}{0.5pt}{x^{(m)}}^T\rule[.5ex]{2.5ex}{0.5pt}\\ \end{bmatrix}.\]

Similarly, arrange $y^{(i)}$ into a column vector $y\in\mathbb R^{m\times 1}$ i.e.

\[Y=\begin{bmatrix} y^{(1)}\\ \vdots\\ y^{(m)} \end{bmatrix}.\]

Then the minimization problem becomes

\[\begin{align*} \hat w&=\arg\min_w\frac 1m\Vert X\cdot w-y\Vert_2^2\\ &=\arg\min_w\Vert X\cdot w-y\Vert_2^2. \end{align*}\]

Define

\[f(w):=\Vert X\cdot w-y\Vert_2^2,\quad f:\mathbb R^{n+1}\to\mathbb R.\]

Then computing $\hat w$ becomes a simple calculus problem which can be solved by setting all the partial derivatives of $f$ to 0, and solving for $w$. Recall we have already computed the Jacobian of $f(w)$ (which contains exactly all the partial derivatives of $f$) where we had

\[\begin{align*} \frac{d}{d w}f(w)&=2(X\cdot w-y)^T\cdot X. \end{align*}\]

Setting the above equal to 0 and solving for $w$ gives

\[\begin{align*} 2(X\cdot w-y)^T\cdot X&=0\\ \implies(X\cdot w-y)^T\cdot X&=0\\ \implies X^T\cdot (X\cdot w-y)&=0\\ \implies X^T\cdot X\cdot w-X^T\cdot y&=0\\ \implies X^T\cdot X\cdot w&=X^T\cdot y\\ \implies w&=(X^TX)^{-1}X^Ty. \end{align*}\]

(Technically we should also check that this is indeed a minimum, not a saddle point nor a maximum, by looking at the eigenvalues of the Hessian matrix of $f$)

Regularization (Ridge regression)

A common approach to prevent overfitting in a linear regression model is to modify the function that we are minimizing $w$ over to instead be

\[\hat w:=\arg\min_w\frac 1m\left(\left(\sum_{i=1}^m (w^T\cdot x^{(i)}-y^{(i)})^2\right)+\lambda\Vert w\Vert_2^2\right).\]

where $\lambda\in\mathbb R^+$ is a constant (in practice it’s a hyperparameter that needs to be tuned). The above setup is also called “ridge regression”.

The additional $\lambda\Vert w\Vert_2^2$ term helps to drive down the magnitude of $w$, which causes many components of $w$ to become close to zero. This has a regularizing effect as intuitively, this reduces the model’s complexity, since only a few components of the input are relevant to compute the output.

In matrix form the minimization becomes

\[\begin{align*} \hat w&=\arg\min_w\frac 1m\left(\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2\right)\\ &=\arg\min_w\left(\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2\right). \end{align*}\]

We now approach in the same way as before. Define

\[f(w):=\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2.\]

Then

\[\begin{align*} \frac{d}{d w}\left(\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2\right)&=\frac{d(\Vert X\cdot w-y\Vert_2^2))}{dw}+\frac{d(\lambda \Vert w\Vert_2^2)}{dw}\\ &=2(X\cdot w-y)^T\cdot X+\lambda\cdot 2\cdot w^T. \end{align*}\]

Setting the above equal to 0 and solving for $w$ gives

\[\begin{align*} 2(X\cdot w-y)^T\cdot X+2\lambda\cdot w^T&=0\\ \implies(X\cdot w-y)^T\cdot X+\lambda\cdot w^T&=0\\ \implies X^T(X\cdot w-y)+\lambda\cdot w&=0\\ \implies X^T\cdot X\cdot w-X^T\cdot y+\lambda\cdot w&=0\\ \implies (X^T\cdot X+\lambda\mathbb I)w&=X^T\cdot y\\ \implies w&=(X^TX+\lambda\mathbb I)^{-1}X^Ty. \end{align*}\]

Machine Learning

Quick links

Linear regression

Regularization (Ridge regression)