The most widely used optimization method in machine learning practice is the Perceptron Algorithm, also known as the Stochastic Gradient Method (SGM). This method has been used since the fifties to build statistical estimators, iteratively improving models by correcting errors observed on single data points. SGM is not only scalable, robust, and simple to implement, but achieves the state-of-the-art performance in many different domains. In contemporary systems, SGM powers enterprise analytics systems and is the workhorse tool used to train complex pattern-recognition systems in speech and vision.
In this talk, I will explore why SGM has had such staying power, focusing on notions of stability and robustness. I will first discuss how SGM is robust to perturbations of the model and the updates. From a computing systems perspective, this robustness enables parallel implementations with minimal communication, with no locking or synchronization, and with strong spatial locality. I will then show how SGM is robust to perturbations of the data itself, and prove that any model trained with stochastic gradient method in a reasonable amount of time attains small generalization error. I will subsequently provide a new interpretation of common practices in neural networks, and provide a formal rationale for many popular techniques in training large, deep models.