机器学习的一些笔记

创建时间 2021-12-12

更新时间 2021-12-14

分类

计算机科学

标签

人工智能机器学习

无监督学习

鸡尾酒会算法

支持向量机

模型描述

$h_\theta(x) = \theta_0 + \theta_1 x$

代价函数

$\min {1 \over 2n}\sum_{i = 1}^n [h_\theta(x_i) - y_i]^2$

其中 $(x_i, y_i)$ 是训练样本，总共有 $n$ 个，训练的目标就是要优化上面这个表达式。

$\displaystyle {1 \over m}$ 表示要尽量减少平均误差，而下面的 $2$ 只是为了后面求导时，会将这个 $2$ 约掉，得到更简单的形式，实际上最小化一个函数，前面的常量是没有关系的。

假设：

$h_\theta(x) = \theta_0 + \theta_1 x$

参数：

$\theta_0, \theta_1$

代价函数（损失函数）：

$J(\theta_0, \theta_1) = {1 \over 2n} \sum_{i = 1}^n [h_\theta(x_i) - y_i]^2$

目标：

$\min J(\theta_0, \theta_1)$

梯度下降

算法，迭代计算下式，直到收敛：

$\theta_j := \theta_j - \alpha {\partial J(\theta_0, \theta_1)\over \partial \theta_j} \quad (\text{for j = 0 and j = 1})$

$\begin{aligned} {\partial J(\theta_0, \theta_1)\over \partial \theta_0} =& {\displaystyle \partial \left({1 \over 2n} \sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_0} \\ =& {1 \over 2n} \cdot {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_0} \\ =& {1 \over 2n} \cdot {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0^2 + 2(\theta_1 x_i - y_i)\theta_0 + (\theta_1 x_i - y_i)^2 ] \right) \over \partial \theta_0} \\ =& {1 \over n} \sum_{i = 1}^n [\theta_0 + (\theta_1 x_i - y_i)] \\ =& {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i] \\ \end{aligned}$

$\begin{aligned} {\partial J(\theta_0, \theta_1)\over \partial \theta_1} =& {\displaystyle \partial \left({1 \over 2n} \sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_1} \\ =& {1 \over 2n} {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0 + \theta_1 x_i - y_i]^2 \right) \over \partial \theta_1} \\ =& {1 \over 2n} {\displaystyle \partial \left(\sum_{i = 1}^n [\theta_0^2 + 2(\theta_1 x_i - y_i)\theta_0 + (\theta_1 x_i - y_i)^2 ] \right) \over \partial \theta_1} \\ =& {1 \over 2n}\sum_{i = 1}^n [2\theta_0 x_i + 2(\theta_1 x_i - y_i) x_i ] \\ =& {1 \over n}\sum_{i = 1}^n [\theta_0 x_i + (\theta_1 x_i - y_i) x_i ] \\ =& {1 \over n}\sum_{i = 1}^n (\theta_0 + \theta_1 x_i - y_i) x_i \\ =& {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i]x_i \\ \end{aligned}$

更详细的算法，迭代计算下式，直到收敛：

$\begin{aligned} \theta_0 :=& \theta_0 - \alpha {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i] \\ \theta_1 :=& \theta_1 - \alpha {1 \over n} \sum_{i = 1}^n [h_\theta(x_i) - y_i]x_i \end{aligned}$

批梯度下降

在计算梯度时，每次计算梯度都用到了全部的数据 $\displaystyle \sum_{i = 1}^n [h_\theta(x_i) - y_i]$ ，这样虽然更准确，但是对硬件的要求也更高。

矩阵和向量

Dropout

随机的将隐藏层输出的一些权值设为 0

优化算法

其中 $x$ 为向量

$f(x + \delta x) \approx f(x) + \nabla f(x) \cdot \delta x + \delta x \cdot H \cdot \delta x$

$f(x + \delta x) \approx f(x) + \nabla f(x) \cdot \delta x$

$f(x + \delta x) - f(x) \approx \nabla f(x) \cdot \delta x \leqslant 0$

当 $\delta x = -\eta \nabla f$ 时， $\nabla f(x) \cdot \delta x = -\eta \nabla f^2 \leqslant 0$

梯度下降法

如果相求一个函数的极小值，使得每次迭代均减去负梯度即可。

$f(x_1, x_2) = x_1^2 + 2 x_2^2 + x_1$

$W_t = W_{t - 1} - \eta \cdot \Delta W$

动量法，指数加权移动平均法，其中 $\beta$ 为动量因子

$v_t =\beta v_{t - 1} + (1 - \beta) g_w$

$\hat{v}_t = {v_t \over 1 - \beta^t}$

$w_t = w_{t - 1} - \eta \cdot \hat{v}_t$

Nesterov 算法

$\Delta W_t = {\partial J(W_{t - 1} + \gamma V_{t - 1}) \over \partial W}$

AdaGrad

$W_t = W_{t - 1} - {\eta \over \sqrt{S_t} + \varepsilon} \cdot \Delta W_t$

$S_t = S_{t - 1} + \Delta W_t \cdot \Delta W_t$

$\Delta W_t = {\partial J(W_{t - 1}) \over \partial W}$

Adam

反向传播

随机梯度下降 Stochastic Gradient Descent