
→Saddle points much more common in high dimension




grad_squared = 0
while True:
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-17)
What is happening with AdaGrad?
→ progress along “steep” directions is damped, and progress along “flat” directions is accelerated
What happens to the step size over long time?
→ Decays to zero