Stochastic Gradient Descent: Strong convergence guarantees – without parameter tuning

Rachel Ward, University of Texas, Austin
August 30th, 4PM-5PM, Evans 60, 2018 at 11:00AM–12:00PM in Evans 732

Stochastic Gradient Descent is the basic optimization algorithm behind powerful deep learning architectures which are becoming increasingly omnipresent in society. However, existing theoretical guarantees of convergence rely on knowing certain properties of the optimization problem such as maximal curvature and noise level which are not known a priori in practice. Thus, in practice, hyper parameters of the algorithm such as the stepsize are tuned by hand before training, taking days or weeks. In this talk, we discuss a modification of Stochastic Gradient Descent with an adaptive "on the fly" step size update known as AdaGrad which is used in practice but until now did not come with any theoretical guarantees. We provide the first such guarantees, showing that Stochastic Gradient Descent with AdaGrad converges to a near-stationary point of a smooth loss function, at a rate which nearly matches the "oracle" rate as if the curvature of the loss function and noise level on the stochastic gradients were known in advance. We also demonstrate its favorable empirical performance on deep learning problems compared to pre-tuned state-of-the-art algorithms.