机器学习的一些笔记

创建时间 2021-12-12
更新时间 2021-12-14

无监督学习

鸡尾酒会算法

支持向量机

模型描述

h_\theta(x) = \theta_0 + \theta_1 x

代价函数

\min {1 \over 2n}\sum_{i = 1}^n [h_\theta(x_i) - y_i]^2

其中 (x_i, y_i) 是训练样本,总共有 n 个,训练的目标就是要优化上面这个表达式。

\displaystyle {1 \over m} 表示要尽量减少平均误差,而下面的 2 只是为了后面求导时,会将这个 2 约掉,得到更简单的形式,实际上最小化一个函数,前面的常量是没有关系的。

  • 假设:
h_\theta(x) = \theta_0 + \theta_1 x
  • 参数:
\theta_0, \theta_1
  • 代价函数(损失函数):
J(\theta_0, \theta_1) = {1 \over 2n} \sum_{i = 1}^n [h_\theta(x_i) - y_i]^2
  • 目标:
\min J(\theta_0, \theta_1)

梯度下降

算法,迭代计算下式,直到收敛:

\theta_j := \theta_j - \alpha {\partial J(\theta_0, \theta_1)\over \partial \theta_j} \quad (\text{for j = 0 and j = 1})
\begin{aligned} {\partial J(\theta_0, \theta_1)\over \partial \theta_0} =& {\displaystyle \partial \left({1 \over 2n} \sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_0} \\ =& {1 \over 2n} \cdot {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_0} \\ =& {1 \over 2n} \cdot {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0^2 + 2(\theta_1 x_i - y_i)\theta_0 + (\theta_1 x_i - y_i)^2 ] \right) \over \partial \theta_0} \\ =& {1 \over n} \sum_{i = 1}^n [\theta_0 + (\theta_1 x_i - y_i)] \\ =& {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i] \\ \end{aligned}
\begin{aligned} {\partial J(\theta_0, \theta_1)\over \partial \theta_1} =& {\displaystyle \partial \left({1 \over 2n} \sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_1} \\ =& {1 \over 2n} {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_1} \\ =& {1 \over 2n} {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0^2 + 2(\theta_1 x_i - y_i)\theta_0 + (\theta_1 x_i - y_i)^2 ] \right) \over \partial \theta_1} \\ =& {1 \over 2n}\sum_{i = 1}^n [2\theta_0 x_i + 2(\theta_1 x_i - y_i) x_i ] \\ =& {1 \over n}\sum_{i = 1}^n [\theta_0 x_i + (\theta_1 x_i - y_i) x_i ] \\ =& {1 \over n}\sum_{i = 1}^n (\theta_0 + \theta_1 x_i - y_i) x_i \\ =& {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i]x_i \\ \end{aligned}

更详细的算法,迭代计算下式,直到收敛:

\begin{aligned} \theta_0 :=& \theta_0 - \alpha {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i] \\ \theta_1 :=& \theta_1 - \alpha {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i]x_i \end{aligned}

批梯度下降

在计算梯度时,每次计算梯度都用到了全部的数据 \displaystyle \sum_{i = 1}^n [h_\theta(x_i) - y_i],这样虽然更准确,但是对硬件的要求也更高。

矩阵和向量

Dropout

随机的将隐藏层输出的一些权值设为 0

优化算法

其中 x 为向量

f(x + \delta x) \approx f(x) + \nabla f(x) \cdot \delta x + \delta x \cdot H \cdot \delta x
f(x + \delta x) \approx f(x) + \nabla f(x) \cdot \delta x
f(x + \delta x) - f(x) \approx \nabla f(x) \cdot \delta x \leqslant 0

\delta x = -\eta \nabla f 时,\nabla f(x) \cdot \delta x = -\eta \nabla f^2 \leqslant 0

梯度下降法

如果相求一个函数的极小值,使得每次迭代均减去负梯度即可。

f(x_1, x_2) = x_1^2 + 2 x_2^2 + x_1
W_t = W_{t - 1} - \eta \cdot \Delta W

动量法,指数加权移动平均法,其中 \beta 为动量因子

v_t =\beta v_{t - 1} + (1 - \beta) g_w
\hat{v}_t = {v_t \over 1 - \beta^t}
w_t = w_{t - 1} - \eta \cdot \hat{v}_t

Nesterov 算法

\Delta W_t = {\partial J(W_{t - 1} + \gamma V_{t - 1}) \over \partial W}

AdaGrad

W_t = W_{t - 1} - {\eta \over \sqrt{S_t} + \varepsilon} \cdot \Delta W_t
S_t = S_{t - 1} + \Delta W_t \cdot \Delta W_t
\Delta W_t = {\partial J(W_{t - 1}) \over \partial W}

Adam

反向传播

  • 随机梯度下降 Stochastic Gradient Descent