How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Take control of your privacy with Proton's trusted, Swiss-based, secure services.
Choose what you need and safeguard your digital life:
Mail: https://go.getproton.me/SH1CU
VPN: https://go.getproton.me/SH1DI
Password Manager: https://go.getproton.me/SH1DJ
Drive: https://go.getproton.me/SH1CT

Music by Eric Matyas
https://www.soundimage.org
Track title: Breezy Bay

--

Chapters
00:00 How Is Nesterov'S Accelerated Gradient Descent Implemented In Tensorflow?
01:08 Answer 1 Score 3
03:15 Accepted Answer Score 7
05:19 Thank you

--

Full question
https://stackoverflow.com/questions/5077...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #tensorflow #machinelearning

#avk47

ACCEPTED ANSWER

Score 7

TL;DR

TF's implementation of Nesterov is indeed an approximation of the original formula, valid for high values of momentum.

Details

This is a great question. In the paper, the NAG update is defined as

v_t+1 = μ.v_t - λ.∇f(θ_t + μ.v_t)
θ_t+1 = θ_t + v_t+1

where f is our cost function, θ_t our parameters at time t, μ the momentum, λ the learning rate; v_t is the NAG's internal accumulator.

The main difference with standard momentum is the use of the gradient at θ_t + μ.v_t, not at θ_t. But as you said, tensorflow only uses gradient at θ_t. So what is the trick?

Part of the trick is actually mentioned in the part of the documentation you cited: the algorithm is tracking θ_t + μ.v_t, not θ_t. The other part comes from an approximation valid for high value of momentum.

Let's make a slight change of notation from the paper for the accumulator to stick with tensorflow's definition. Let's define a_t = v_t / λ. The update rules are changed slightly as

a_t+1 = μ.a_t - ∇f(θ_t + μ.λ.a_t)
θ_t+1 = θ_t + λ.a_t+1

(The motivation for this change in TF is that now a is a pure gradient momentum, independent of the learning rate. This makes the update process robust to changes in λ, a possibility common in practice but that the paper does not consider.)

If we note ψ_t = θ_t + μ.λ.a_t, then

a_t+1 = μ.a_t - ∇f(ψ_t)
ψ_t+1 = θ_t+1 + μ.λ.a_t+1
    = θ_t + λ.a_t+1 + μ.λ.a_t+1
    = ψ_t + λ.a_t+1 + μ.λ.(a_t+1 - a_t)
    = ψ_t + λ.a_t+1 + μ.λ.[(μ-1)a_t - ∇f(ψ_t)]
    ≈ ψ_t + λ.a_t+1

This last approximation holds for strong values of momentum, where μ is close to 1, so that μ-1 is close to zero, and ∇f(ψ_t) is small compared to a — this last approximation is more debatable actually, and less valid for directions with frequent gradient switch.

We now have an update that uses the gradient of the current position, and the rules are pretty simple — they are in fact those of standard momentum.

However, we want θ_t, not ψ_t. This is the reason why we subtract μ.λ.a_t+1 to ψ_t+1 just before returning it — and to recover ψ it is added again first thing at the next call.

ANSWER 2

Score 3

I couldn't see any info on this online, and the linked paper certainly wasn't helpful, so I had a look at the unit tests for tf.train.MomentumOptimizer, from which I can see tests for the implementation of both classic momentum and NAG modes.

Summary

var = var + accum * learning_rate * momentum
accum = accum * momentum + g
var = var - learning_rate * accum
var = var - accum * learning_rate * momentum

where accum starts at 0 and is updated at every step. The above is a modified version of the formulation in the unit test, and I find it a bit confusing. Here is the same set of equations arranged with my interpretation of what each of the parameters represent (I could be wrong though):

average_grad_0 = accum # previous rolling average
average_grad_1 = accum * momentum + g # updated rolling average
grad_diff = average_grad_1 - average_grad_0
adjustment = -learning_rate * (grad_diff * momentum + average_grad_1)
var += adjustment
accum = average_grad_new

In other words, it seems to me like tensorflow's implementation attempts to guess the "adjusted gradient" in NAG by assuming that the new gradient will be esimated by the current average gradient plus the product of momentum and the change in the average gradient. I'd love to see a proof for this!

What follows is more detail on how the classic and nesterov modes are implemented in tensorflow as per the tests.

Classic Momentum mode

For use_nesterov=False, based on the doTestBasic function, we have the following initial parameters:

learning_rate = 2.0
momentum = 0.9
var_0 = 1.0 # at time 0
grad = 0.1

Actually, the above are just the first element of the grads_0 and vars_0 arrays, but I'll just focus on a single value. For the subsequent timesteps, we have

var_1 = 1.0 - (0.1 * 2.0)
var_2 = 1.0 - (0.1 * 2.0) - ((0.9 * 0.1 + 0.1) * 2.0)

which I'm going to interpret as meaning;

var_1 = var_0 - (grad * learning_rate)
var_2 = var_1 - ((momentum * grad + grad) * learning_rate)

If we assume that for the purposes of the unit tests grad_0 == grad_1 == grad then this makes sense as a formulation of classic momentum.

Nesterov's Accelerated Gradient (NAG) mode

For use_nesterov=True, I had a look at the _update_nesterov_momentum_numpy function and the testNesterovMomentum test case.

The _update_nesterov_momentum_numpy function has the following definition:

  def _update_nesterov_momentum_numpy(self, var, accum, g, lr, momentum):
    var = var + accum * lr * momentum
    accum = accum * momentum + g
    var = var - lr * accum
    var = var - accum * lr * momentum
    return var, accum

and it is called in the unit tests like this:

    for t in range(1, 5):
      opt_op.run()
      var0_np, accum0_np = self._update_nesterov_momentum_numpy(
          var0_np, accum0_np, var0_np * 10, 2.0, 0.9)