Open AccessPosted Content
On Lazy Training in Differentiable Programming.
219
TL;DR: In this article, the authors show that the lazy training phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
read more
Abstract: In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Jaehoon Lee,Lechao Xiao,Samuel S. Schoenholz,Yasaman Bahri,Roman Novak,Jascha Sohl-Dickstein,Jeffrey Pennington +6 more
TL;DR: In this article, the authors show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
1K
•Posted Content
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
TL;DR: This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
700
•Posted Content
Surprises in High-Dimensional Ridgeless Least Squares Interpolation.
TL;DR: This paper recovers---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
670
•Posted Content
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Alexander D'Amour,Katherine Heller,Dan Moldovan,Ben Adlam,Babak Alipanahi,Alex Beutel,Christina Chen,Jonathan Deaton,Jacob Eisenstein,Matthew D. Hoffman,Farhad Hormozdiari,Neil Houlsby,Shaobo Hou,Ghassen Jerfel,Alan Karthikesalingam,Mario Lucic,Yi-An Ma,Cory Y. McLean,Diana Mincu,Akinori Mitani,Andrea Montanari,Zachary Nado,Vivek T. Natarajan,Christopher Nielson,Thomas F. Osborne,Rajiv Raman,Kim Ramasamy,Rory Sayres,Jessica Schrouff,Martin G. Seneviratne,Shannon Sequeira,Harini Suresh,Victor Veitch,Max Vladymyrov,Xuezhi Wang,Kellie Webster,Steve Yadlowsky,Taedong Yun,Xiaohua Zhai,D. Sculley +39 more
TL;DR: This work shows the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain, and shows that this problem appears in a wide variety of practical ML pipelines.
659
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
51.9K
•Dissertation
Learning Multiple Layers of Features from Tiny Images
Alex Krizhevsky
- 01 Jan 2009
TL;DR: In this paper, the authors describe how to train a multi-layer generative model of natural images, using a dataset of millions of tiny colour images, described in the next section.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 07 Dec 2015
TL;DR: In this paper, a Parametric Rectified Linear Unit (PReLU) was proposed to improve model fitting with nearly zero extra computational cost and little overfitting risk, which achieved a 4.94% top-5 test error on ImageNet 2012 classification dataset.