#### SGD without replacement: optimal rate analysis and more

Stochastic gradient descent (SGD) is omnipresent in machine learning. Two fundamental versions of SGD exist: (i) one that picks stochastic gradients with replacement, and (ii) one that picks gradients without replacement. Ironically, version (ii) is what is used in practice, while version (i) is what most theoretical works analyze. This mismatch is well-known. It arises because without replacement sampling leads to non-independent stochastic gradients, which makes analysis hard.

In this talk I will present recent progress on analyzing without replacement SGD, focusing in particular on two key variants: RandomShuffle and SingleShuffle. I will summarize the best known convergence rates, as well as important refinements that result under additional assumptions on the loss functions. The results presented remove drawbacks common to most previous works on this topic.

Based on work with: Chulhee Yun and Kwangjun Ahn

In this talk I will present recent progress on analyzing without replacement SGD, focusing in particular on two key variants: RandomShuffle and SingleShuffle. I will summarize the best known convergence rates, as well as important refinements that result under additional assumptions on the loss functions. The results presented remove drawbacks common to most previous works on this topic.

Based on work with: Chulhee Yun and Kwangjun Ahn