SVM

Machine Learning

SVM (Support vector machine) is a powerful classifier. We can use SVM to find a hyperplane to devide the sample space into two parts. The samples on one side are marked as positive(+) and the samples on the other side are marked as negative(-). The SVM aims to maximize the distance between $L+$ and $L-$.

For the positive samples ($y=+1$) we have $w^Tx+b\geq 1$ and for the negative samples ($y=-1$) we have $w^Tx+b\leq -1$. So for all the samples, $y(w^Tx+b)\geq 1$. The between $L+$ and $L-$ is $\frac{2}{\left\|w\right\|}$. Minimizing the distance equals to maximizing $\left\|w\right\|^2$. In some situations, it is hard for us to devide the positive and negative samples with a plane. In this case, we can introduce a soft margin factor $\xi$. So the object funtion becomes:($C$ is the punishment factor and $n$ is the number of samples)
$$min \frac{1}{2}\left\|w\right\|^2 + C\sum_{i=1}^{n}\xi_i\\
s.t.~~~~y_i(w^Tx_i+b)\geq 1-\xi_i\\
\xi_i \geq 0\\
i\in[1,2…n]$$
Applying lagrange multiplier we can have:
$$L(w,b,\xi,\alpha,\beta)=\frac{1}{2}\left\|w\right\|^2+\sum_{i=1}^{n}\alpha_i[1-\xi_i-y_i(w^Tx_i+b)]+C\sum_{i=1}^{n}\xi_i+\sum_{i=1}^{n}\beta_i(-\xi_i)\\
\frac{\partial}{\partial w}L(w,b,\xi,\alpha,\beta)=w-\sum_{i=1}^{n}\alpha_iy_ix_i=0~~~~\Rightarrow~~~~w=\sum_{i=1}^{n}\alpha_iy_ix_i\\
\frac{\partial}{\partial b}L(w,b,\xi,\alpha,\beta)=-\sum_{i=1}^{n}\alpha_iy_i=0\\
\frac{\partial}{\partial \xi_i}L(w,b,\xi,\alpha,\beta)=-\alpha_i-\beta_i+C=0,~~\because \beta_i\geq 0~~~~\Rightarrow~~~~0\leq\alpha_i\leq C\\
s.t.~~~~\xi_i\geq 0,~~~~\beta_i\geq 0\\
1-\xi_i-y_i(w^Tx_i+b)=0,~~~~\beta_i\xi_i=0$$
Applying the constrains to $L(w,b,\xi,\alpha,\beta)$ we can have:
$$L(w,b,\xi,\alpha,\beta)=\sum_{i=1}^{n}\alpha_i-\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i\alpha_jy_iy_jx_ix_j\\
s.t.~~~~0\leq\alpha_i\leq C,~~~\beta_i\geq 0~~~\xi_i\geq 0\\
\alpha_i[1-\xi_i-y_i(w^Tx_i+b)]=0\\
\beta_i\xi_i=0\\
\sum_{i=1}^{n}\alpha_iy_i=0\\
w=\sum_{i=1}^{n}\alpha_iy_ix_i\\
C=\alpha_i+\beta_i$$
When $\alpha_i=0$, the sample $i$ is not the support vector and $y_i(w^Tx_i+b)\geq 1$.
When $\alpha_i \in (0,C)$, $\beta_i>0$, $\xi_i=0$ so $y_i(w^Tx_i+b)=1$, the sample $i$ is the support vector, and it is what we need.
When $\alpha_i=C$, $\beta_i=0$, $\xi_i\geq 0$, so $y_i(w^Tx_i+b)\leq 1$, the sample $i$ is between the soft margin and the hyperplane.

We can use SMO (Sequential Minimal Optimization) to optimize this problem. Select $\alpha_a$ and $\alpha_b$ set $\alpha_ay_a+\alpha_by_b+\sum_{i\neq a,i\neq b}\alpha_i^*y_i = 0$, we can get $\alpha_ay_a+\alpha_by_b = \alpha_a^*y_a+\alpha_b^*y_b = -\sum_{i\neq a,~ i\neq b}\alpha_i^*y_i = -\sum_{i\neq a,~ i\neq b}\alpha_iy_i=M$, here $\alpha_i^*$ is the value of $\alpha_i$ before optimalization.
$$L=\alpha_a+\alpha_b-\alpha_ay_ax_a\sum_{i\neq a,i\neq b}\alpha_iy_ix_i-\alpha_by_bx_b\sum_{i\neq a,i\neq b}\alpha_iy_ix_i-\alpha_ay_ax_a\alpha_by_bx_b\\
-\frac{1}{2}\alpha_a^2x_ax_a-\frac{1}{2}\alpha_b^2x_bx_b+const\\
=y_a(M-\alpha_by_b)+\alpha_b-(M-\alpha_by_b)x_a\sum_{i\neq a,i\neq b}\alpha_iy_ix_i-\alpha_by_bx_b\sum_{i\neq a,i\neq b}\alpha_iy_ix_i\\
-\alpha_b(M-\alpha_by_b)y_bx_ax_b-\frac{1}{2}(M-\alpha_by_b)^2x_ax_a-\frac{1}{2}\alpha_b^2x_bx_b+const$$
Set $x_ax_a=K_{aa}$, $x_ax_b=K_{ab}$, $x_bx_b=K_{bb}$ and $y_ay_b=S$. The items irrelevant to $\alpha_a$ and $\alpha_b$ are absorbed by $const$.
$$L=-S\alpha_b+\alpha_b+\alpha_by_bx_a\sum_{i\neq a,i\neq b}\alpha_iy_ix_i-\alpha_by_bx_b\sum_{i\neq a,i\neq b}\alpha_iy_ix_i-M\alpha_by_bx_ax_b\\
+\alpha_b^2x_ax_b+M\alpha_ay_bx_ax_a-\frac{1}{2}\alpha_b^2x_ax_a-\frac{1}{2}\alpha_b^2x_bx_b+const\\
=-\frac{1}{2}(K_{aa}+K_{bb}-2K_{ab})\alpha_b^2+(1-S)\alpha_b+(\alpha_by_bx_a-\alpha_by_bx_b)\sum_{i\neq a,i\neq b}\alpha_iy_ix_i\\
-M\alpha_by_bK_{ab}+M\alpha_by_bK_{aa}+const$$
Because $M=\alpha_a^*y_a+\alpha_b^*y_b$, $w=\sum_{i=1}^{n}\alpha_i^*y_ix_i=\alpha_a^*y_ax_a+\alpha_b^*y_bx_b+\sum_{i\neq a,i\neq b}\alpha_iy_ix_i$, there is $\sum_{i\neq a,i\neq b}\alpha_iy_ix_i=w-\alpha_a^*y_ax_a-\alpha_b^*y_bx_b$. Set $K_{aa}+K_{bb}-2K_{ab}=\eta$
$$\frac{\partial}{\partial \alpha_b}L=-\eta\alpha_b+(1-S)+(y_bx_a-y_bx_b)(w-\alpha_a^*y_ax_a-\alpha_b^*y_bx_b)\\
-(\alpha_a^*y_a+\alpha_b^*y_b)(y_bK_{ab}-y_bK_{aa})=0\\
-\eta\alpha_b+(1-S)+(y_ax_a-y_bx_b)w+\eta\alpha_b^*=0\\
\alpha_b = \alpha_b^*+\frac{(1-S)+(y_ax_a-y_bx_b)w}{\eta}\\
(1-S)+(y_ax_a-y_bx_b)w=y_b[y_b-y_a+(x_a-x_b)w]$$
Set $u_a=wx_a+b$ and $u_b=wx_b+b$, then $(x_a-x_b)w=u_a-u_b$
$$\alpha_b = \alpha_b^*+\frac{y_b(y_b-y_a+u_a-u_b)}{\eta}$$
As $\alpha_a \in [0,C]$, $M=y_a\alpha_a+y_b\alpha_b$, we need to restrict the value of $\alpha_b$.
When $y_a=1$ and $y_b=-1$, $M=\alpha_a^*-\alpha_b^*$, so $-\alpha_a^*+\alpha_b^*\leq \alpha_b \leq C-\alpha_a^*+\alpha_b^*$
When $y_a=-1$ and $y_b=1$, $M=-\alpha_a^*+\alpha_b^*$, so $\alpha_b^*-\alpha_a^*\leq \alpha_b \leq C+\alpha_b^*-\alpha_a^*$
When $y_a=1$ and $y_b=1$, $M=\alpha_a^*+\alpha_b^*$, so $\alpha_a^*+\alpha_b^*-C\leq \alpha_b \leq \alpha_a^*+\alpha_b^*$.
When $y_a=-1$ and $y_b=-1$, $M=-\alpha_a^*-\alpha_b^*$, so $\alpha_a^*+\alpha_b^*-C\leq \alpha_b \leq \alpha_a^*+\alpha_b^*$.
So when $y_a\neq y_b$, $Low=max(0,\alpha_b^*-\alpha_a^*)$, $High=min(C,C+\alpha_b^*-\alpha_a^*)$;
when $y_a= y_b$, $Low=max(0,\alpha_a^*+\alpha_b^*-C)$, $High=min(C,\alpha_a^*+\alpha_b^*)$. So
$$\alpha_b=\left\{\begin{matrix}
Low & \alpha_b \leq Low\\
\alpha_b & Low<\alpha_b < High\\
High & \alpha_b \geq High
\end{matrix}\right.$$

After updating $\alpha_b$, update $\alpha_a$ by the relationship $\alpha_ay_a+\alpha_by_b = -\sum_{i\neq a,~ i\neq b}\alpha_i^*y_i$ and $w_{new} = w+y_a(\alpha_a-\alpha_a^*)x_a+y_b(\alpha_b-\alpha_b^*)x_b$
Then we need to update $b$. Before the updating, there is $b^* = u-wx = u-(\alpha_a^*y_ax_a+\alpha_b^*y_bx_b)x+Mx$. After the uodating, there is $y(w_{new}x+b)=1$ and $b=y-(\alpha_ay_ax_a+\alpha_by_bx_b)x+Mx$. So there is the relationship.
$$b=b^*+(y-u)+y_a(\alpha_a^*-\alpha_a)x_ax+y_b(\alpha_b^*-\alpha_b)x_bx\\
b_a=b^*+(y_a-u_a)+y_a(\alpha_a^*-\alpha_a)K_{aa}+y_b(\alpha_b^*-\alpha_b)K_{ab}\\
b_b=b^*+(y_b-u_b)+y_a(\alpha_a^*-\alpha_a)K_{ab}+y_b(\alpha_b^*-\alpha_b)K_{bb}$$
We will select the $b$ that satisfies the KKT conditions and not on the bound ($\alpha\in(0,C)$). When $\alpha=0$ or $\alpha=C$, $b$ is at the bound and we set $b_{new}=\frac{b_a+b_b}{2}$. So there is
$$b_{new}=\left\{\begin{matrix}
b_a & \alpha_a \in(0,C) \\
b_b & \alpha_b \in(0,C) \\
\frac{b_a+b_b}{2} & otherwise
\end{matrix}\right.$$
In practice, we will choose the $\alpha$ which does not satisfies the KKT to optimalize.
When $y_iu_i < 1$, the $\alpha_i$ should be $C$;
When $y_iu_i > 1$, the $\alpha_i$ should be $0$;
When $y_iu_i = 1$, the $\alpha_i$ should be in $(0,C)$.

Post Directory

Directory