# Naive Bayes

Naive Bayes is a classigier based on a strong assumption that all the features of a document are independent. So for a document $X=\{x_1, x_2 … x_n\}$ with $n$ features we can have $P(X|C)=\prod_{i=1}^{n}P(x_i|C)$. Here $C$ denotes the category $X$ might belong to. Based on the Bayes theorem we can have $P(C|X) = \frac{P(X|C)P(C)}{P(X)} \propto P(X|C)P(C)$. And by chooseing the category $c$ that maximize $P(C|X)$ we can decide the category of $X$.

Before we use the Naive Bayes, we need to have a dictionary $\{k_1, k_2 … k_m\}$ with $m$ keywords that all the features in the document should belong to at least one keyword. So document $X$ can be represented as $X = \{Z_1, Z_2 … Z_m\}$, $Z_i$ is the number of occurrences of $i$th keywords in $X$. So $P(X|C=c) = \prod_{i=1}^{m}P(k_i|C=c)^{Z_i}$.

We use $\pi_c$ denotes $P(C=c)$ and $\theta_{ck}$ denotes $P(K=k|C=c)$. Given a set of training data $D = \{X_1, X_2 … X_N\}$ and their labels $\{Y_1, Y_2 … Y_N\}$. Then we can get: ($Z_{ij}$ is the number of occurrences of $j$th keywords in $X_i$)
$$P(D) = \prod_{i=1}^{N}P(X=X_i, C=Y_i) = \prod_{i=1}^{N}\pi_{Y_i}\prod_{j=1}^{m}\theta_{Y_iK_j}^{Z_{ij}}\\ \log P(D) = \sum_{i=1}^{N}[\log \pi_{Y_i}+\sum_{j=1}^{m}Z_{ij}\log \theta_{Y_iK_j}]$$
So, with the training data $D$, we can get the $\pi_c$ and $\theta_{ck}$ by maximize $P(D)$. Suppose there are $C$ categorise $\{c_1, c_2 … c_C \}$ and $\#_i$ is the number of document occurrences of $i$th category in $D$, $W_{ij}$ is the number of keyword occurrences of $j$th keywords in $i$th category of $D$.
$$(\pi_c^*, \theta_{ck}^*) = argmax\{\sum_{i=1}^{N}\log \pi_{Y_i}+\sum_{i=1}^{N}\sum_{j=1}^{m}Z_{ij}\log \theta_{Y_iK_j}\}\\ \sum_{i=1}^{N}\log \pi_{Y_i}=\sum_{i=1}^{C}\#_i\log \pi_{c_i}\\ \sum_{i=1}^{N}\sum_{j=1}^{m}Z_{ij}\log \theta_{Y_iK_j}=\sum_{i=1}^{C}\sum_{j=1}^{m}W_{ij}\log \theta_{c_iK_j}$$

#### Optimize $\pi_c$

As $\sum_{i=1}^{N}\sum_{j=1}^{m}Z_{ij}\log \theta_{Y_iK_j}$ is independent to $\pi_c$ and $\sum_{i=1}^{C}\pi_{c_i} = 1$. Applying the Lagrange multiplier we can get $L(\pi_c) = \sum_{i=1}^{C}\#_i\log \pi_{c_i} + \lambda (1-\sum_{i=1}^{C}\pi_{c_i})$. And the final $\pi_{c_i}$ is $\frac{\#_i}{N}$.

#### Optimize $\theta_{ck}$

As $\sum_{i=1}^{N}\sum_{j=1}^{m}Z_{ij}\log \theta_{Y_iK_j}$ is independent to $\pi_c$ and $\sum_{j=1}^{m}\theta_{c_iK_j} = 1$. Applying the Lagrange multiplier we can get $L(\theta_{ck}) = \sum_{i=1}^{C}\sum_{j=1}^{m}W_{ij}\log \theta_{c_iK_j} + \sum_{i=1}^{C}\lambda_i(1-\sum_{j=1}^{m}\theta_{c_iK_j})$. And the final $\theta_{c_ik_j}$ is $\frac{W_{ij}}{\sum_{j=1}^{m}W_{ij}}$.

#### Apply

When we use the Naive Bayes, given an unlabled document $x$. $Z_k$ the number of $k$th keywords in $x$. it’s category $y$ is
$$argmax_{c}P(C=c|x)=argmax_{c}[\log\pi_c+\sum_{k=1}^{m}Z_k\log\theta_{ck}]$$

#### For Two Classification

$$y_1 = \log\pi_1 + \sum_{k=1}^{m}Z_k\log\theta_{1k}\\ y_2 = \log\pi_2 + \sum_{k=1}^{m}Z_k\log\theta_{2k}\\ y_1-y_2 = \log \frac{\pi_1}{\pi_2} + \sum_{k=1}^{m}\log \frac{\theta_{1k}}{\theta_{2k}}Z_k=b+\sum_{k=1}^{m}A_kZ_k$$
Treat the document $x$ as a vector $[Z_1, Z_2 … Z_k]^T$ the two classes Naive Bayes can be treated as a linear classifier.