Logistic regression makes predictions by probability, defined as: \[p = \sigma(Wx)\]
Why sigmoid?
- By sigmoid function, it can output probabilities between 0 and 1.
- It makes the back propagation easier since it is easy to calculate the derivative of sigmoid: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\)
Objective Function
As the output \(\sigma(Wx)\) is the predicated probability of a class, given a dataset, one way to get the best parameters for the LR model is to maximize the probability of the whole dataset. So, we can use Maximum Likelihood Estimation to do the job.
The likelyhood can be defined as: \[\begin{equation} L(W) = \prod_{i=1}^n \sigma(Wx_i)^{y_i}(1-\sigma(Wx_i))^{1-y_i} \end{equation}\]
We log it for easy computation: \[\begin{equation} ln(L(W)) = \sum_{i=1}^n {y_i}ln(\sigma(Wx_i))+({1-y_i})ln(1-\sigma(Wx_i)) \end{equation}\] What we want is to maximize \(ln(L(W))\) and get the best \(W\): \[\begin{equation} W = argmax_W(ln(L(W))) \end{equation}\] Maximize \(ln(L(W))\) means minimize \(-ln(L(W))\), so we define our loss function \(J(W)\) as: \[\begin{equation} J(W) = -ln(L(W)) \end{equation}\]
Derivation
\[\begin{align} J'(W) &= \frac{\partial J}{\partial p} \cdot \frac{\partial p}{\partial W} \\ &= \frac{\partial -ln(L(W))}{\partial \sigma(Wx_i)} \cdot \frac{\partial \sigma(Wx_i)}{\partial W} \\ &= -\sum_{i=1}^n(\frac{y_i}{\sigma(Wx_i)} - \frac{1 - y_i}{1-\sigma(Wx_i)})\sigma(Wx_i)(1-\sigma(Wx_i))x_i \\ &=-\sum_{i=1}^n(y_i(1-\sigma(Wx_i)) - (1-y_i)\sigma(Wx_i))x_i \\ &=\sum_{i=1}^n(\sigma(Wx_i)-y_i)x_i \end{align}\]