How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------
Take control of your privacy with Proton's trusted, Swiss-based, secure services.
Choose what you need and safeguard your digital life:
Mail: https://go.getproton.me/SH1CU
VPN: https://go.getproton.me/SH1DI
Password Manager: https://go.getproton.me/SH1DJ
Drive: https://go.getproton.me/SH1CT
Music by Eric Matyas
https://www.soundimage.org
Track title: Breezy Bay
--
Chapters
00:00 How Is Nesterov'S Accelerated Gradient Descent Implemented In Tensorflow?
01:08 Answer 1 Score 3
03:15 Accepted Answer Score 7
05:19 Thank you
--
Full question
https://stackoverflow.com/questions/5077...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #tensorflow #machinelearning
#avk47
ACCEPTED ANSWER
Score 7
TL;DR
TF's implementation of Nesterov is indeed an approximation of the original formula, valid for high values of momentum.
Details
This is a great question. In the paper, the NAG update is defined as
vt+1 = μ.vt - λ.∇f(θt + μ.vt)
θt+1 = θt + vt+1
where f is our cost function, θt our parameters at time t, μ the momentum, λ the learning rate; vt is the NAG's internal accumulator.
The main difference with standard momentum is the use of the gradient at θt + μ.vt, not at θt. But as you said, tensorflow only uses gradient at θt. So what is the trick?
Part of the trick is actually mentioned in the part of the documentation you cited: the algorithm is tracking θt + μ.vt, not θt. The other part comes from an approximation valid for high value of momentum.
Let's make a slight change of notation from the paper for the accumulator to stick with tensorflow's definition. Let's define at = vt / λ. The update rules are changed slightly as
at+1 = μ.at - ∇f(θt + μ.λ.at)
θt+1 = θt + λ.at+1
(The motivation for this change in TF is that now a is a pure gradient momentum, independent of the learning rate. This makes the update process robust to changes in λ, a possibility common in practice but that the paper does not consider.)
If we note ψt = θt + μ.λ.at, then
at+1 = μ.at - ∇f(ψt)
ψt+1 = θt+1 + μ.λ.at+1
= θt + λ.at+1 + μ.λ.at+1
= ψt + λ.at+1 + μ.λ.(at+1 - at)
= ψt + λ.at+1 + μ.λ.[(μ-1)at - ∇f(ψt)]
≈ ψt + λ.at+1
This last approximation holds for strong values of momentum, where μ is close to 1, so that μ-1 is close to zero, and ∇f(ψt) is small compared to a — this last approximation is more debatable actually, and less valid for directions with frequent gradient switch.
We now have an update that uses the gradient of the current position, and the rules are pretty simple — they are in fact those of standard momentum.
However, we want θt, not ψt. This is the reason why we subtract μ.λ.at+1 to ψt+1 just before returning it — and to recover ψ it is added again first thing at the next call.
ANSWER 2
Score 3
I couldn't see any info on this online, and the linked paper certainly wasn't helpful, so I had a look at the unit tests for tf.train.MomentumOptimizer, from which I can see tests for the implementation of both classic momentum and NAG modes.
Summary
var = var + accum * learning_rate * momentum
accum = accum * momentum + g
var = var - learning_rate * accum
var = var - accum * learning_rate * momentum
where accum starts at 0 and is updated at every step. The above is a modified version of the formulation in the unit test, and I find it a bit confusing. Here is the same set of equations arranged with my interpretation of what each of the parameters represent (I could be wrong though):
average_grad_0 = accum # previous rolling average
average_grad_1 = accum * momentum + g # updated rolling average
grad_diff = average_grad_1 - average_grad_0
adjustment = -learning_rate * (grad_diff * momentum + average_grad_1)
var += adjustment
accum = average_grad_new
In other words, it seems to me like tensorflow's implementation attempts to guess the "adjusted gradient" in NAG by assuming that the new gradient will be esimated by the current average gradient plus the product of momentum and the change in the average gradient. I'd love to see a proof for this!
What follows is more detail on how the classic and nesterov modes are implemented in tensorflow as per the tests.
Classic Momentum mode
For use_nesterov=False, based on the doTestBasic function, we have the following initial parameters:
learning_rate = 2.0
momentum = 0.9
var_0 = 1.0 # at time 0
grad = 0.1
Actually, the above are just the first element of the grads_0 and vars_0 arrays, but I'll just focus on a single value. For the subsequent timesteps, we have
var_1 = 1.0 - (0.1 * 2.0)
var_2 = 1.0 - (0.1 * 2.0) - ((0.9 * 0.1 + 0.1) * 2.0)
which I'm going to interpret as meaning;
var_1 = var_0 - (grad * learning_rate)
var_2 = var_1 - ((momentum * grad + grad) * learning_rate)
If we assume that for the purposes of the unit tests grad_0 == grad_1 == grad then this makes sense as a formulation of classic momentum.
Nesterov's Accelerated Gradient (NAG) mode
For use_nesterov=True, I had a look at the _update_nesterov_momentum_numpy function and the testNesterovMomentum test case.
The _update_nesterov_momentum_numpy function has the following definition:
def _update_nesterov_momentum_numpy(self, var, accum, g, lr, momentum):
var = var + accum * lr * momentum
accum = accum * momentum + g
var = var - lr * accum
var = var - accum * lr * momentum
return var, accum
and it is called in the unit tests like this:
for t in range(1, 5):
opt_op.run()
var0_np, accum0_np = self._update_nesterov_momentum_numpy(
var0_np, accum0_np, var0_np * 10, 2.0, 0.9)