University of Edinburgh
In this talk, I describe the problem of minimizing the average of a large number (n) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs, in each of which a single full gradient and a random number of stochastic gradients are computed, following a geometric law. The total work needed for the method to output an epsilon-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the loss, is O(log(1/epsilon)). This is achieved by running the method for O(log(1/epsilon)) epochs, each with a single gradient evaluation and O(kappa) stochastic gradient evaluations where kappa is the condition number of the problem. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an 10-6-accurate solution for a problem with n=109 and kappa=103.
In the second part of the talk, I will try to explain “what machine learners actually want”. In other words, I will provide an insight into why the best optimization algorithm does not necessarily have to be the best learning algorithm (if we work in large scale setting), and how good S2GD is, as a learning algorithm.
Jakub is a PhD student at the University of Edinburgh, under the supervision of Peter Richtárik. His main interest is large-scale optimization methods for machine learning. He is the recipient of the prestigious Google Europe Doctoral Fellowship in Optimization Algorithms. Jakub has a paper published in Journal of Machine Learning Research, and two papers to appear in SIAM Journal on Optimization. Prior to his PhD study, Jakub obtained his Bachelor’s degree in Mathematics of Economy and Finance from Comenius University in Bratislava.