Skip to main content

Linear Regression

1. Notations

  • Let pNp\in\mathbb{N}^* be the number of features
  • Let nNn\in\mathbb{N}^* be the number of observations
  • Let (X1,Y1),,(Xn,Yn)Rp×R(X_1,Y_1),\dots ,(X_n,Y_n)\in\mathbb{R}^p\times \mathbb{R} be the observations
  • Let X=(1X1T1X2T1XnT),Y=(Y1Y2Yn)X=\begin{pmatrix}1&X_1^T\\1&X_2^T\\\vdots\\1&X_n^T\end{pmatrix},Y=\begin{pmatrix}Y_1\\Y_2\\ \vdots \\Y_n\end{pmatrix}

We want to find the best vector βRp+1\beta\in\mathbb{R}^{p+1} that minimizes the distance:

YXβ2\lVert Y-X\beta \rVert_2

2. Value of β\beta

2.1 General Case

minimizing YXβ2\lVert Y-X\beta \rVert_2 is the same as minimizing YXβ22.\lVert Y-X\beta \rVert_2^2. So we will approach the second problem.

First of all, we search for all values annihilating the derivative

YXβ22β=YXβ22(YXβ)(YXβ)β=2(YXβ)TXYXβ22β=0    2(YXβ)TX=0    XT(YXβ)=0    XTXβ=XTY    β=(XTX)+XTY is a solution, where .+ is the pseudoinverse    β=X+Y is a solution\begin{align*} \frac{\partial \lVert Y-X\beta \rVert_2^2}{\partial\beta}&=\frac{\partial \lVert Y-X\beta \rVert_2^2}{\partial (Y-X\beta)}\frac{\partial (Y-X\beta)}{\partial\beta}\\&=2(Y-X\beta)^TX\\ \frac{\partial \lVert Y-X\beta \rVert_2^2}{\partial\beta}=0&\iff 2(Y-X\beta)^TX=0\\ &\iff X^T(Y-X\beta)=0\\ &\iff X^TX\beta=X^TY\\ &\implies \beta=(X^TX)^+X^TY \text{ is a solution, where }.^+ \text{ is the pseudoinverse} \\ &\implies \beta=X^+Y \text{ is a solution} \end{align*}

YXβ22\lVert Y-X\beta \rVert_2^2 is a convex quadratic. So this value β\beta is a value minimizing the distance.

β=(XTX)+XTY\boxed{\beta=(X^TX)^+X^TY}

2.2 Probabilistic Approach: p=1p=1

For this case, we can even approach the problem probabilisticly.

We will treat yy and xx as random variables.

  • Let β0,β1R/y=β0+β1x+ε\beta_0,\beta_1\in\mathbb{R}/\quad y=\beta_0+\beta_1 x+\varepsilon
  • We will assume that εN(0,σ2)\varepsilon \sim \mathcal{N}(0,\sigma^2) is independent of xx

We have:

{Cov[x,y]=β1V[x]E[y]=β0+β1E[x]    {β1=Cov[x,y]V[x]β0=E[y]β1E[x]\begin{cases} \text{Cov}[x,y]=\beta_1\mathbb{V}[x]\\ \mathbb{E}[y]=\beta_0+\beta_1\mathbb{E}[x] \end{cases} \implies \begin{cases} \beta_1=\frac{\text{Cov}[x,y]}{\mathbb{V}[x]}\\ \beta_0=\mathbb{E}[y]-\beta_1\mathbb{E}[x] \end{cases}

This approach can be extended for p>1.p>1. But this is beyond this scope.

2.3 Probabilstic Approach: General Case

We will treat y,x1,,xpy,x_1,\dots,x_p as random variables.

  • Let x=(x1xp)\bold{x}=\begin{pmatrix}x_1 \\ \vdots \\ x_p\end{pmatrix}
  • Let βRp,β0R/y=β,x+β0+ε\beta \in\mathbb{R}^p,\beta_0\in\mathbb{R}/ y=\langle\beta,\bold{x}\rangle+\beta_0 + \varepsilon
  • Furthermore, we will assume that: εN(0,σ2)\varepsilon \sim \mathcal{N}(0,\sigma^2) is independent of all xix_i
  • Let C=E[(xE[x])(xE[x])T]C=\mathbb{E}\left[\left(\bold{x}-\mathbb{E}[\bold{x}]\right)\left(\bold{x}-\mathbb{E}[\bold{x}]\right)^T\right] be the covariance matrix of x\bold{x}
  • Let w=(Cov[x1,y]Cov[xp,y])w=\begin{pmatrix}\text{Cov}[x_1,y]\\ \vdots \\ \text{Cov}[x_p,y]\end{pmatrix} the cross-covariance between x\bold{x} and y.y.

First of all, we will calculate β\beta

i{1,,p},Cov[xi,y]=j=1pβjCov[xi,xj]    Cβ=w    β=C+w is a solution\begin{align*} \forall i\in\{1,\dots,p\},\quad\text{Cov}[x_i,y]&=\sum_{j=1}^p\beta_j\text{Cov}[x_i,x_j]\\ \iff C\beta&=w \\ \implies \beta&=C^+w \text{ is a solution} \end{align*}

For β0\beta_0

β0=E[y]β,E[x]\beta_0=\mathbb{E}[y]-\langle\beta,\mathbb{E}[\bold{x}]\rangle

As a conclusion:

{β=C+wβ0=E[y]β,E[x]\boxed{\begin{cases}\beta=C^+w\\ \beta_0=\mathbb{E}[y]-\langle \beta,\mathbb{E}[\bold{x}]\rangle\end{cases}}

Now the knowledge of β0,β\beta_0,\beta requires the explicit knowledge of C,w,E[x],E[y].C,w,\mathbb{E}[\bold{x}],\mathbb{E}[y]. which almost all of the time is not the case.

So we will estimate β0,β\beta_0,\beta by estimating those statistical parameters:

{β^=C^+w^β0^=μ^(y)β^,μ^(x)\boxed{\begin{cases}\hat{\beta}=\hat{C}^+\hat{w}\\ \hat{\beta_0}=\hat{\mu}(y)-\langle \hat{\beta},\hat{\mu}(\bold{x})\rangle\end{cases}}

If we have nn independent samples of (x,y):(x1,y1),,(xn,yn)(\bold{x},y):\quad(\bold{x}_1,y_1),\dots,(\bold{x_n},y_n) treated as random variables.

And if we use the appropriate estimators, this formula reduces to Linear Regression.

3. Significance of the model

Let's call M\mathcal{M} our linear model

Assumption: the relation between yy and x\bold{x} is linear

We will use the F\mathcal{F}-test.

3.1 Null Hyptothesis: H0:βi=0i>0H_0:\beta_i=0\quad\forall i>0

This null hyptothesis implies that yy is a constant function of x\bold{x}

We will statistically test this hypothesis using ANOVA

3.2 ANOVA

Theorem

If the null hypothesis is true then:

Z=(yyˉ)T(yyˉ)p(yβ,xβ0)T(yβ,xβ0)np1F(p,n1p)Z=\frac{\tfrac{(y- \bar{y})^T(y- \bar{y})}{p}}{\tfrac{(y- \langle\beta,\bold{x}\rangle-\beta_0)^T(y- \langle\beta,\bold{x}\rangle-\beta_0)}{n-p-1}}\sim\mathcal{F}(p,n-1-p)

Let:

{FSS=i=1n(yiyˉ)2RSS=i=1n(yyi)2TSS=i=1n(yyˉ)=FSS+RSS\begin{cases}\text{FSS} = \sum_{i=1}^n(y^*_i-\bar{y})^2\\ \text{RSS} = \sum_{i=1}^n(y-y_i^*)^2\\ \text{TSS} = \sum_{i=1}^n(y-\bar{y}) = \text{FSS} +\text{RSS} \end{cases}

We say we reject the null hypothesis within a confidence interval of (1p)%(1-p)\% if:

{f=FSSRSSp=P(Zf)\begin{cases} f=\tfrac{\text{FSS}}{\text{RSS}}\\ p=\mathcal{P}(Z\ge f) \end{cases}

3.3 Significance

Assuming a linear dependence between the variables, this result suggests that within a confidence of (1p)%,(1-p)\%, yy is not a constant function of x.\bold{x}.

4. Confidence interval of the prediction

Assumption: the relation between yy and x\bold{x} is linear

We will use student's tt-test.

4.1 Confidence Interval of parameters

Let β^\hat{\beta} be an estimator of β.\beta. We have:

i{0,,p},Ti=β^iβiσ^2((XTX)1)i,iTn1p\boxed{\forall i\in\{0,\dots,p\}\quad, T_i=\frac{\hat{\beta}_i-\beta_i}{\hat{\sigma}_*^2\sqrt{\left(\left(X^TX\right)^{-1}\right)_{i,i}}}\sim \mathcal{T}_{n-1-p}}

Where σ^2\hat{\sigma}^2_* is an unbiased estimation of σ2=V[y]\sigma^2=\mathbb{V}[y^*] where y=β,x+β0=yεy^*=\langle \beta,\bold{x}\rangle+\beta_0=y-\varepsilon. It is equal to:

σ^2=RSSn1p\boxed{\hat{\sigma}^2_*=\frac{\text{RSS}}{n-1-p}}
  1. For i{0,,p}i\in\{0,\dots,p\}
  2. Let tR+t\in \mathbb{R}_+
  3. Let γ=R+/P(Tit)=γ2\gamma=\in\mathbb{R}_+/\quad \mathcal{P}(\lvert T_i\rvert \ge t)=\frac{\gamma}{2}

We say that βi=β^i±tσ^2((XTX)1)i,i\beta_i=\hat{\beta}_i\pm t\hat{\sigma}^2_*\sqrt{\left(\left(X^TX\right)^{-1}\right)_{i,i}} within a confidence interval of (1γ)%.(1-\gamma)\%.

4.2 Confidence Interval of prediction

y=y^+tσ^2(1x)T(XTX)1(1x)=β,x+β0+tσ^2(1x)T(XTX)1(1x)\boxed{y^*= \hat{y}^*+t\hat{\sigma}^2_*\sqrt{\begin{pmatrix}1\\\bold{x}\end{pmatrix}^T\left(X^TX\right)^{-1}\begin{pmatrix}1\\\bold{x}\end{pmatrix}}=\langle\beta,\bold{x}\rangle+\beta_0+t\hat{\sigma}^2_*\sqrt{\begin{pmatrix}1\\\bold{x}\end{pmatrix}^T\left(X^TX\right)^{-1}\begin{pmatrix}1\\\bold{x}\end{pmatrix}}}

4.3 Case of simple regression: p=1p=1

  1. The confidence interval of β0\beta_0 is:

    β0=β^0±ti=1n(yyˉ)2i=1n(xxˉ)2=β^0±tss^(y)ss(x)\beta_0=\hat{\beta}_0\pm t\frac{\sum_{i=1}^{n}(y-\bar{y})^2}{\sqrt{\sum_{i=1}^{n}(x-\bar{x})^2}}=\hat{\beta}_0\pm t\frac{\hat{\text{ss}}(y)}{\sqrt{\text{ss}(x)}}
  2. The confidence interval of β1\beta_1 is:

    β1=β^1±tss(y)1n+xˉ2ss^(x)\beta_1=\hat{\beta}_1\pm t \text{ss}(y)\sqrt{\frac{1}{n}+\frac{\bar{x}^2}{\hat{\text{ss}}(x)}}
  3. The confidence interval of the a new prediction yy^* using x0x_0 is:

    y=β1x0+β0±tss(y)1n+(x0xˉ)2ss^(x)y^*=\beta_1x_0+\beta_0\pm t \text{ss}(y)\sqrt{\frac{1}{n}+\frac{(x_0-\bar{x})^2}{\hat{\text{ss}}(x)}}