https://arxiv.org/pdf/1412.6980.pdf





grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-17)
What is happening with AdaGrad?
→ progress along “steep” directions is damped, and progress along “flat” directions is accelerated
What happens to the step size over long time?
→ Decays to zero
grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += decay_rate * grad_squared + (1-decay_rate) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-17)
The use of decaying moving average allows the algorithm to forget early gradients and focus on the most recent observed partial gradients