https://arxiv.org/pdf/1412.6980.pdf

Moving Average

Untitled

Untitled

Exponential Moving Average

Untitled

Untitled

SGD+Momentum

Untitled

AdaGrad

grad_squared = 0
while True:
    dx = compute_gradient(x)
    grad_squared += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-17)

What is happening with AdaGrad?

→ progress along “steep” directions is damped, and progress along “flat” directions is accelerated

What happens to the step size over long time?

→ Decays to zero

RMSProp

grad_squared = 0
while True:
    dx = compute_gradient(x)
    grad_squared += decay_rate * grad_squared + (1-decay_rate) * dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-17)

The use of decaying moving average allows the algorithm to forget early gradients and focus on the most recent observed partial gradients