Logistic Regression

Logistic Regression is a kind of classifier. Unlike Naive Bayes, it is based on the discrimated model and use a sigmoid function $\sigma(x)=\frac{1}{1+e^{-x}}$ to deal with two classification problem. For a sample $x\in R^d$, it’s label is $y\in\{0, 1\}$, and $P(y|x,w)=\sigma(w^Tx)^y(1-\sigma(w^Tx))^{1-y}$ is the probability that $x$’s label is $y$, $w\in R^d$ is the parameter of this model.

Given a set of labeled data $D={(x_1, y_1), (x_2, y_2) … (x_n, y_n)}$ and $x_i\in R^d$, $y_i\in\{0,1\}$. So, $P(D|w)=\prod_{i=1}^{n}P(y_i|x_i,w)=\prod_{i=1}^{n}\alpha_i^{y_i}(1-\alpha_i)^{1-y_i}$, here $\alpha_i=\frac{1}{1+e^{-w^Tx_i}}$. Define $L(w)=-\log P(D|w) = \sum_{i=1}^{n}y_i\log\alpha_i+\sum_{i=1}^{n}(1-y_i)\log(1-\alpha_i)$ and Based on the MLE $w^*=argmax_w\log P(D|w)=argmin_wL(w)$. We will use Newton method to solve this problem.
$$\frac{\partial}{\partial w_j}\log \alpha_i= \frac{\partial}{\partial w_j}\{-\log(1+e^{-w^Tx_i})\}=\frac{x_{ij}e^{-w^Tx_i}}{1+e^{-w^Tx_i}}\\ \frac{\partial}{\partial w_j}\log(1-\alpha_i)=\frac{\partial}{\partial w_j}\log\frac{e^{-w^Tx_i}}{1+e^{-w^Tx_i}}=\frac{\partial}{\partial w_j}\{\log e^{-w^Tx_i}-\log(1+e^{-w^Tx_i})\}\\ =-x_{ij}+\frac{x_{ij}e^{-w^Tx_i}}{1+e^{-w^Tx_i}}=-\frac{x_{ij}}{1+e^{-w^Tx_i}}\\ \frac{\partial}{\partial w_j}L(w)=-\sum_{i=1}^{n}\{y_i\frac{x_{ij}e^{-w^Tx_i}}{1+e^{-w^Tx_i}}+(1-y_i)\frac{-x_{ij}}{1+e^{-w^Tx_i}}\}=-\sum_{i=1}^{n}\{x_{ij}\frac{y_i(1+e^{-w^Tx_i})-1}{1+e^{-w^Tx_i}}\}\\ =-\sum_{i=1}^{n}x_{ij}(y_i-\alpha_i)=\sum_{i=1}^{n}x_{ij}(\alpha_i-y_i)$$
Define matrix $A=[x_{ij}]$, $B=diag(\alpha_i(1-\alpha_i))$, $i\in[1…n]$, $j\in[1…d]$. So, we can have

$$\begin{matrix}A=\begin{bmatrix} x_{11} & … & x_{1d}\\ \vdots & \ddots &\vdots \\ x_{n1} & … & x_{nd}\end{bmatrix}& B=\begin{bmatrix} \alpha_1(1-\alpha_1) & … & 0\\ \vdots & \ddots &\vdots \\ 0 & … & \alpha_n(1-\alpha_n)\end{bmatrix}\end{matrix}$$

And $\triangledown_wL(w)=A^T(\alpha-Y)$, $\alpha=[\alpha_1 … \alpha_n]^T$ and $Y=[y_1 … y_n]^T$
$$\frac{\partial^2}{\partial w_j \partial w_k}L(w)=\frac{\partial}{\partial w_k}\sum_{i=1}^{n}x_{ij}(\alpha_i-y_i)=\sum_{i=1}^{n}x_{ij}\frac{\partial}{\partial w_k}\alpha_i\\ \frac{\partial}{\partial w_k}\alpha_i = \frac{\partial}{\partial w_k}\frac{1}{1+e^{-w^Tx_i}}=\frac{x_{ik}e^{-w^Tx_i}}{(1+e^{-w^Tx_i})^2}=x_{ik}\alpha_i(1-\alpha_i)\\ \frac{\partial^2}{\partial w_j \partial w_k}L(w)=\sum_{i=1}^{n}x_{ij}x_{ik}\alpha_i(1-\alpha_i)=A^TBA$$
As $\triangledown_w^2L(w)=A^TBA=A^TB^{\frac{1}{2}}B^{\frac{1}{2}}A=(B^{\frac{1}{2}}A)^T(B^{\frac{1}{2}}A)$, the $\triangledown_w^2L(w)$ is positive definite and it should have a local minimum value. We can get the $w$ by iteration.

$$w_{new} = w_{old} - (\triangledown_w^2L(w))^{-1}\triangledown_wL(w)$$

Directory