And human learning too
In linear regression we are given $m$ training example pairs $x^{(i)},y^{(i)}$ for $i=1,\cdots,m$ where $x^{(i)}\in\mathbb R^n,y^{(i)}\in\mathbb R$ and we want to predict $y$ from $x$ using a linear function. More precisely, we want to find weights $w\in\mathbb R^n$ so as to minimize the mean square error over all training pairs between the predicted value $\hat y^{(i)}=w^T\cdot x^{(i)}$ and the actual value $y^{(i)}$, that is, we want to find
\[\hat w:=\arg\min_w\frac 1m\sum_{i=1}^m(w^T\cdot x^{(i)}-y^{(i)})^2.\]In practice, we allow ourselves to consider affine functions instead of linear functions, that is functions of the form $y=w^T\cdot x+b$ instead of $y=w^T\cdot x$. This can be achieved easily by adding an extra component to the inputs $x^{(i)}$ with a constant value of 1 so that $x\in\mathbb R^{n+1}$ and $w\in\mathbb R^{n+1}$ as well.
In order to compute $\hat w$ analytically, it will be useful to vectorize the setup above. Arrange the $x^{(i)}$ as rows in a matrix $X\in\mathbb R^{m\times (n+1)}$ i.e.
\[X=\begin{bmatrix} \rule[.5ex]{2.5ex}{0.5pt}{x^{(1)}}^T\rule[.5ex]{2.5ex}{0.5pt}\\ \vdots\\ \rule[.5ex]{2.5ex}{0.5pt}{x^{(m)}}^T\rule[.5ex]{2.5ex}{0.5pt}\\ \end{bmatrix}.\]Similarly, arrange $y^{(i)}$ into a column vector $y\in\mathbb R^{m\times 1}$ i.e.
\[Y=\begin{bmatrix} y^{(1)}\\ \vdots\\ y^{(m)} \end{bmatrix}.\]Then the minimization problem becomes
\[\begin{align*} \hat w&=\arg\min_w\frac 1m\Vert X\cdot w-y\Vert_2^2\\ &=\arg\min_w\Vert X\cdot w-y\Vert_2^2. \end{align*}\]Define
\[f(w):=\Vert X\cdot w-y\Vert_2^2,\quad f:\mathbb R^{n+1}\to\mathbb R.\]Then computing $\hat w$ becomes a simple calculus problem which can be solved by setting all the partial derivatives of $f$ to 0, and solving for $w$. Recall we have already computed the Jacobian of $f(w)$ (which contains exactly all the partial derivatives of $f$) where we had
\[\begin{align*} \frac{d}{d w}f(w)&=2(X\cdot w-y)^T\cdot X. \end{align*}\]Setting the above equal to 0 and solving for $w$ gives
\[\begin{align*} 2(X\cdot w-y)^T\cdot X&=0\\ \implies(X\cdot w-y)^T\cdot X&=0\\ \implies X^T\cdot (X\cdot w-y)&=0\\ \implies X^T\cdot X\cdot w-X^T\cdot y&=0\\ \implies X^T\cdot X\cdot w&=X^T\cdot y\\ \implies w&=(X^TX)^{-1}X^Ty. \end{align*}\](Technically we should also check that this is indeed a minimum, not a saddle point nor a maximum, by looking at the eigenvalues of the Hessian matrix of $f$)
A common approach to prevent overfitting in a linear regression model is to modify the function that we are minimizing $w$ over to instead be
\[\hat w:=\arg\min_w\frac 1m\left(\left(\sum_{i=1}^m (w^T\cdot x^{(i)}-y^{(i)})^2\right)+\lambda\Vert w\Vert_2^2\right).\]where $\lambda\in\mathbb R^+$ is a constant (in practice it’s a hyperparameter that needs to be tuned). The above setup is also called “ridge regression”.
The additional $\lambda\Vert w\Vert_2^2$ term helps to drive down the magnitude of $w$, which causes many components of $w$ to become close to zero. This has a regularizing effect as intuitively, this reduces the model’s complexity, since only a few components of the input are relevant to compute the output.
In matrix form the minimization becomes
\[\begin{align*} \hat w&=\arg\min_w\frac 1m\left(\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2\right)\\ &=\arg\min_w\left(\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2\right). \end{align*}\]We now approach in the same way as before. Define
\[f(w):=\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2.\]Then
\[\begin{align*} \frac{d}{d w}\left(\Vert X\cdot w-y\Vert_2^2+\lambda \Vert w\Vert_2^2\right)&=\frac{d(\Vert X\cdot w-y\Vert_2^2))}{dw}+\frac{d(\lambda \Vert w\Vert_2^2)}{dw}\\ &=2(X\cdot w-y)^T\cdot X+\lambda\cdot 2\cdot w^T. \end{align*}\]Setting the above equal to 0 and solving for $w$ gives
\[\begin{align*} 2(X\cdot w-y)^T\cdot X+2\lambda\cdot w^T&=0\\ \implies(X\cdot w-y)^T\cdot X+\lambda\cdot w^T&=0\\ \implies X^T(X\cdot w-y)+\lambda\cdot w&=0\\ \implies X^T\cdot X\cdot w-X^T\cdot y+\lambda\cdot w&=0\\ \implies (X^T\cdot X+\lambda\mathbb I)w&=X^T\cdot y\\ \implies w&=(X^TX+\lambda\mathbb I)^{-1}X^Ty. \end{align*}\]