Softmax is a multiclass classifier, it is also based on discrimated model and very similar to Logistic Regression. But here we use a softmax function instand of a sigmoid function. With softmax function, the posterior probability of a sample $x$ and it label $c\in[1…K]$ ($K$ is the number of classifications) is shown below.

$$P(C=c|x)=\frac{e^{w_c^Tx}}{\sum_{i=1}^{K}e^{w_i^Tx}}$$

It is easy to find that in Logistic Regression $w$ is a vector but here it is a matrix. Given a set of training data $D=\{(x_1, c_1), (x_2, c_2) … (x_n, c_n)\}$, $x_i\in R^d$, $c_i\in[1…K]$, define $y_i\in R^K$.

$$y_{ik}=\left\{\begin{matrix}

1 & if c_i = k\\

0 & otherwise

\end{matrix}\right.$$

So we can have

$$P(D) = \prod_{i=1}^{n}P(c_i|x_i)\\

\log P(D) = \sum_{i=1}^{n}\log P(c_i|x_i)=\sum_{i=1}^{n}\log \prod_{k=1}^{K}P(c_k|x_i)^{y_{ik}}=\sum_{i=1}^{n}\sum_{k=1}^{K}y_{ik}\log P(c_k|x_i)$$

Define $L(w) = -\log P(D)$, so $w^*=argmax_w P(D) = argmin_w L(w)$. We can use the similar optimalization method as we have used in Logistic Regression to optimalize $w$ here.