We saw that a system of equations can be written as a matrix equation #A\vec{x}=\vec{b}#. If the rank of the coefficient matrix #A# is equal to the rank of the augmented matrix #(A\mid\vec{b})#, then the system has a solution #\vec{x}#.
The only alternative is that the rank of the coefficient matrix is smaller than the rank of the augmented matrix. In this case there is no solution. We now examine how we can find the best approximation to a solution. In other words, we seek a vector #\vec{\hat{x}}# for which the distance #\norm{A\vec{\hat{x}}-\vec{b}}# between #\vec{b}# and the image under #A# of the approximate solution #\vec{\hat{x}}# is minimal.
Let #A# be an #(m\times n)#-matrix and #\vec{b}# a vector of #\mathbb{R}^m#.
- The equation #A^{\top}A\vec{x}=A^{\top}\vec{b}# with unknown #\vec{x}# has a solution.
- Each solution #\vec{\hat{x}}# of this equation gives a best estimate for the equation #A\vec{x}=\vec{b}# in the sense that #\norm{A\vec{x}-\vec{b}}# is minimal for #\vec{x} = \vec{\hat{x}}# among all vectors #\vec{x}# of #\mathbb{R}^m#.
If the #(m\times n)#-matrix #A# has rank #n#, we can easily find the best approximation. After all, then the matrix #A^{\top}A# is invertible and we can follow the following procedure:
- Calculate #A^\top A# and #A^{\top}\vec{b}#.
- Compute the inverse of #A^\top A#.
- Now #\vec{\hat{x}}# is given by #(A^{\top} A)^{-1}\, (A^\top\vec{b})#.
- Now #\vec{\hat{b}}# is given by #A\vec{\hat{x}}#.
The fact that the condition on the rank is essential becomes clear when we look at the following example \[\matrix{1&1\\1&1}\, \cv{x\\y} = \cv{1\\0}\] The matrix #A^{\top}A# is given by \[\matrix{2&2\\2&2}\] This matrix is clearly not invertible, since its determinant is equal to #0#.
This statement is especially useful when the system #A\,\vec{x}=\vec{b}# does not have a solution. If this system has a solution, that solution is also a solution of #A^{\top}A\vec{x}=A^{\top}\vec{b}#, so the least squares procedure is unnecessarily cumbersome.
Consider the inconsistent system
\[\eqs{x+y&=&1\\
2x+3y&=&4\\
x-y&=&0}\] It is clear that this system does not have a solution. The last equation gives #x=y#, which, in combination with the first, leads to #x=y=\frac{1}{2}#. If we substitute this into the second equation, we get #\frac{5}{2}=4#. Since we can not find an exact solution, we will look for a best approximation. To this end, we first write the system as the matrix equation #A\vec{x} = \vec{b}# with
\[A = \matrix{1 & 1\\ 2&3\\1 & -1}\quad \text{ and } \quad \vec{b} = \matrix{1\\4\\0}
\] We calculate #A^{\top}A# and #A^{\top}\vec{b}#.
\[\begin{array}{rcl}
A^{\top}A&=&\displaystyle\matrix{1&3&-1\\1&2&1}\matrix{1&1\\2&3\\1&-1}=\matrix{6&11\\6&6} \\ A^{\top} \vec{b}&=&\displaystyle\matrix{1&3&-1\\1&2&1}\matrix{1\\4\\0}=\matrix{13\\9}\end{array}
\] The inverse of the matrix #A^{\top}A# is given by
\[(A^{\top} A)^{-1}=\matrix{-\frac{1}{5}&\frac{11}{30}\\\frac{1}{5}&-\frac{1}{5}}
\] Now we can calculate the vector #\vec{\hat{x}}#:
\[\vec{\hat{x}}=(A^{\top} A)^{-1}A^\top\vec{b}=\matrix{-\frac{1}{5}&\frac{11}{30}\\\frac{1}{5}&-\frac{1}{5}}\matrix{13\\9}=\matrix{\frac{7}{10}\\\frac{4}{5}}
\] The matrix #A# maps this vector onto the vector #\vec{\hat{b}}#, which is given by
\[\vec{\hat{b}}=A\vec{\hat{x}}=\matrix{1 & 1\\ 2&3\\1 & -1}\matrix{\frac{7}{10}\\\frac{4}{5}}=\matrix{\frac{2}{3}\\\frac{19}{5}\\-\frac{1}{10}}
\]
A classic application of the least squares method has a statistical character. Let #\rv{x_i,y_i}# for #i=1,\ldots,n# be points of the #x,y#-plane. The problem of linear regression asks for a line that best describes this collection of points. To this end, we write #y=a+b\cdot x# for an equation of the line, where #a# and #b# are yet to be determined real numbers. A best approximate solution #\rv{a,b}# of the following system is required.
\[\begin{array}{rcl}
a+bx_1&=&y_1\\
a+bx_2&=&y_2\\
\vdots &\quad& \vdots\\
a+bx_n&=&y_n
\end {array}\] In matrix form:
\[\matrix{1&x_1\\1&x_2\\\vdots&\vdots\\1&x_n}\matrix{a\\b}=\matrix{y_1\\y_2\\\vdots\\y_n}
\]The least squares method will provide it.
The same method as linear regression works for finding the best approximation of a set of points #\rv{x_i,y_i}# for #i=1,\ldots,n# in the #x,y#-plane by the graph of a polynomial function of higher degree. If we set #y:c_1+c_2x+\cdots+c_mx^m#, the corresponding matrix equation becomes \[\matrix{1&x_1&x_1^2&\ldots &x_1^m\\1&x_2&x_2^2&\ldots& x_2^m\\\vdots&\vdots&\vdots&\ddots&\vdots\\1&x_n&x_n^2&\cdots&x_n^m}\matrix{c_1\\\vdots\\c_m}=\matrix{y_1\\y_2\\\vdots\\y_n}\] In fact, this method is not limited to polynomials. For instance, exponential functions are often used in the modelling of growth of populations.
The equation #A^{\top}A\vec{x}=A^{\top}\vec{b}# is also called the normal equation of the matrix equation #A\vec{x}=\vec{b}#.
We will prove both parts, starting with the first.
1. Let #A# be an #(m\times n)#-matrix and #\vec{b}# a vector from #\mathbb{R}^m#. We claim #\ker{A} = \ker{A^\top A}#. Clearly #\ker{A} \subseteq \ker{A^\top A}#, so to prove the assertion we only have to verify that #\ker{A^\top A} \subseteq\ker{A} #. Suppose for that purpose that #\vec{x}# belongs to #\ker{A^\top A}#. In that case \[\dotprod{(A\vec{x})}{(A\vec{x})} = \dotprod{\vec{x}}{(A^\top A\vec{x})}= \dotprod{\vec{x}}{\vec{0}}=0\] Since the inner product is positive-definite, it follows that #A\vec{x} = \vec{0}#, that is to say: #\vec{x}# belongs to #\ker{A}#. This proves the assertion #\ker{A} = \ker{A^\top A}#.
We will now prove #\im{A^\top \, A} = \im{A^\top}#. Because #\im{A^\top \, A} \subseteq \im{A^\top}# is clear, it suffices for the proof to establish the inclusion # \im{A^\top}\subseteq\im{A^\top \, A}#. This follows from the following derivation: \[\begin{array}{rcl}\text{Suppose }& \vec{x}\in \im{A^\top A}^\perp\\ \text{Then we have }&\dotprod{\vec{x}}{(A^\top A\vec{v})}=0&\text{for all }\vec{v}\in \mathbb{R}^n\\ \text{Hence also }&\dotprod{(A\vec{x})}{( A\vec{v})}=0&\text{for all }\vec{v}\in \mathbb{R}^n\\ \text{In addition }&\dotprod{(A^\top A\vec{x})}{\vec{v}}=0&\text{for all }\vec{v}\in \mathbb{R}^n\\\text{This means that }&A^\top A\vec{x}=\vec{0}&\\ \text{By the above statement this implies }& A\vec{x}=\vec{0}&\\\text{For the inner product this gives }&\dotprod{(A\vec{x})}{\vec{v}}=0&\text{for all }\vec{v}\in \mathbb{R}^n\\ \text{Consequently, }&\dotprod{\vec{x}}{(A^\top \vec{v})}=0&\text{for all }\vec{v}\in \mathbb{R}^n\\ \text{In other words, }&\vec{x}\in\im{ A^\top}^{\perp}&\end{array}\] Thus, we have established the inclusion #\im{A^\top A}^\perp\subseteq\im{ A^\top}^\perp#. According to properties of the perpendicular space we then also have \[ \im{ A^\top}={\im{ A^\top}^\perp}^\perp\subseteq {\im{A^\top A} ^\perp}^\perp = \im{A^\top A} \] The vector #A^{\top}\vec{b}# belongs to #\im{A^\top}#. Thus, there is a vector #\vec{x}# such that #A^{\top}A\vec{x}=A^{\top}\vec{b}#. This proves the first statement.
2. Suppose that #\vec{x} = \vec{\hat{x}}# is a solution of the equation #A^{\top}A\vec{x}=A^{\top}\vec{b}#. We claim that then #\vec{\hat{b}}=A\vec{\hat{x}}# is the orthogonal projection of #\vec{b}# onto #\im{A}#. To see this, we compute \[\begin{array}{rcl}\dotprod{(\vec{\hat{b}}-\vec{b})}{(A\vec{y})}&=&\dotprod{(A\vec{\hat{x}}-\vec{b})}{(A\vec{y})}\\&=&\dotprod{(A\vec{\hat{x}})}{(A\vec{y})}-\dotprod{\vec{b}}{(A\vec{y})}\\&=&\dotprod{(A^\top A\vec{\hat{x}})}{\vec{y}}-\dotprod{(A^\top\vec{b})}{\vec{y}}\\&=&\dotprod{(A^\top A\vec{\hat{x}}-A^\top\vec{b})}{\vec{y}}\\&=&\dotprod{\vec{0}}{\vec{y}}\\&=&0\end{array}\] This shows that \(\vec{\hat{b}}-\vec{b}\) is perpendicular to the image of #A#. The vector #\vec{\hat{b}} =A\vec{\hat{x}}# belongs to #\im{A}# and thus is the orthogonal projection of #\vec{b}# onto #\im{A}#. This means that, among all images of vectors under #A#, the vector #\vec{\hat{b}} # is nearest to #\vec{b}#. Therefore, the solution #\vec{x} = \vec{\hat{x}}# gives the best approximation to a solution of the equation #A\vec{x}=\vec{b}#.
The method is called least squares method because of the application to linear regression in the plane, where the problem is to find, for a given set of points, a line that best describes the set as a part of the graph of a linear function. The solution is the line that minimizes the sum over this set of points of the squares of the vertical distances of the point to the line.
Find the least squares solution to the problem:
\[
\matrix{4 & -2 \\ -1 & -3 \\ -2 & -2} \,\vec{x} = \cv{-1\\-1\\5}
\]
\(\vec{x}={}\) \(\cv{-\frac{113}{178}\\-\frac{59}{178}}\)
To find the least squares solution \(\vec{x}\) to this problem, we need to solve \(A^{\top}A \vec{x}=A^{\top}\,\vec{b}\), where
\[
A=\matrix{4 & -2 \\ -1 & -3 \\ -2 & -2} \phantom{xx} \text{ and } \phantom{xx}
\vec{b}= \cv{-1\\-1\\5}
\] We calculate \(A^{\top}A\) and \(A^{\top}\,\vec{b}\):
\[\begin{aligned}
A^{\top}A &= \matrix{4 & -1 & -2 \\ -2 & -3 & -2}\matrix{4 & -2 \\ -1 & -3 \\ -2 & -2}\\
&= \matrix{4\cdot4-1\cdot-1-2\cdot-2 & 4\cdot-2-1\cdot-3-2\cdot-2 \\ 4\cdot-2-1\cdot-3-2\cdot-2 & -2\cdot-2-3\cdot-3-2\cdot-2}\\
&= \matrix{21 & -1 \\ -1 & 17}
\end{aligned}\] and
\[\begin{aligned}
A^{\top}\,\vec{b} &= \matrix{4 & -1 & -2 \\ -2 & -3 & -2} \cv{-1\\-1\\5}\\
&= \matrix{4\cdot-1-1\cdot-1+5\cdot-2 \\ -2\cdot-1-3\cdot-1-2\cdot5}\\
&= \matrix{-13 \\ -5}
\end{aligned}\] Therefore, the augmented matrix corresponding to \(A^{\top}A\vec{x}=A^{\top}\,\vec{b}\) is
\[\left(\begin{array}{cc|c}
21 & -1 & -13 \\
-1 & 17 & -5
\end{array}\right)\] Solving this with
row reduction, we find
\[\begin{array}[t]{ll}
\left(\begin{array}{cc|c}
21 & -1 & -13 \\
-1 & 17 & -5 \\
\end{array}\right)
&
\begin{array}[t]{ll} \sim\left(\begin{array}{rr|r} 1 & -{{1}\over{21}} & -{{13}\over{21}} \\ -1 & 17 &
-5 \end{array}\right) & \\\\ \sim\left(\begin{array}{rr|r} 1 & -{{1}\over{21}} & -{{13}\over{21}} \\ 0 & {{
356}\over{21}} & -{{118}\over{21}} \end{array}\right) & \\\\ \sim\left(\begin{array}{rr|r} 1 & -{{1}\over{21
}} & -{{13}\over{21}} \\ 0 & 1 & -{{59}\over{178}} \end{array}\right) & \\\\ \sim\left(\begin{array}{rr|r} 1
& 0 & -{{113}\over{178}} \\ 0 & 1 & -{{59}\over{178}} \end{array}\right) \end{array}
\end{array}\] Hence, \(\vec{x}=\cv{-\frac{113}{178}\\-\frac{59}{178}}\)
Because the matrix #A^{\top}A# is invertible, we also could have calculated the least squares solution by inverting #A^{\top}A# and performing the calculations as presented in the theory.