About: Regularization (mathematics) is a research topic. Over the lifetime, 14674 publications have been published within this topic receiving 333252 citations. The topic is also known as: desensitization.
TL;DR: In comparative timings, the new algorithms are considerably faster than competing methods and can handle large problems and can also deal efficiently with sparse features.
Abstract: We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include l(1) (the lasso), l(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
TL;DR: This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.
Abstract: L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at this https URL
TL;DR: In this paper, the authors considered the problem of recovering a vector x ∈ R^m from incomplete and contaminated observations y = Ax ∈ e + e, where e is an error term.
Abstract: Suppose we wish to recover a vector x_0 Є R^m (e.g., a digital signal or image) from incomplete and contaminated observations y = Ax_0 + e; A is an n by m matrix with far fewer rows than columns (n « m) and e is an error term. Is it possible to recover x_0 accurately based on the data y?
To recover x_0, we consider the solution x^# to the l_(1-)regularization problem min ‖x‖l_1 subject to ‖Ax - y‖l(2) ≤ Є, where Є is the size of the error term e. We show that if A obeys a uniform uncertainty principle (with unit-normed columns) and if the vector x_0 is sufficiently sparse, then the solution is within the noise level ‖x^# - x_0‖l_2 ≤ C Є. As a first example, suppose that A is a Gaussian random matrix; then stable recovery occurs for almost all such A's provided that the number of nonzeros of x_0 is of about the same order as the number of observations. As a second instance, suppose one observes few Fourier samples of x_0; then stable recovery occurs for almost any set of n coefficients provided that the number of nonzeros is of the order of n/[log m]^6. In the case where the error term vanishes, the recovery is of course exact, and this work actually provides novel insights into the exact recovery phenomenon discussed in earlier papers. The methodology also explains why one can also very nearly recover approximately sparse signals.
TL;DR: Recently, this paper proposed a decoupled weight decay regularization that decouples the optimal weight decay factor from the setting of the learning rate for both standard SGD and Adam and substantially improves Adam's generalization performance.
Abstract: L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at this https URL
Abstract: Chapters 2–7 make up Part II of the book: artificial neural networks. After introducing the basic concepts of neurons and artificial neuron learning rules in Chapter 2, Chapter 3 describes a particular formalism, based on signal-plus-noise, for the learning problem in general. After presenting the basic neural network types this chapter reviews the principal algorithms for error function minimization/optimization and shows how these learning issues are addressed in various supervised models. Chapter 4 deals with issues in unsupervised learning networks, such as the Hebbian learning rule, principal component learning, and learning vector quantization. Various techniques and learning paradigms are covered in Chapters 3–6, and especially the properties and relative merits of the multilayer perceptron networks, radial basis function networks, self-organizing feature maps and reinforcement learning are discussed in the respective four chapters. Chapter 7 presents an in-depth examination of performance issues in supervised learning, such as accuracy, complexity, convergence, weight initialization, architecture selection, and active learning. Par III (Chapters 8–15) offers an extensive presentation of techniques and issues in evolutionary computing. Besides the introduction to the basic concepts in evolutionary computing, it elaborates on the more important and most frequently used techniques on evolutionary computing paradigm, such as genetic algorithms, genetic programming, evolutionary programming, evolutionary strategies, differential evolution, cultural evolution, and co-evolution, including design aspects, representation, operators and performance issues of each paradigm. The differences between evolutionary computing and classical optimization are also explained. Part IV (Chapters 16 and 17) introduces swarm intelligence. It provides a representative selection of recent literature on swarm intelligence in a coherent and readable form. It illustrates the similarities and differences between swarm optimization and evolutionary computing. Both particle swarm optimization and ant colonies optimization are discussed in the two chapters, which serve as a guide to bringing together existing work to enlighten the readers, and to lay a foundation for any further studies. Part V (Chapters 18–21) presents fuzzy systems, with topics ranging from fuzzy sets, fuzzy inference systems, fuzzy controllers, to rough sets. The basic terminology, underlying motivation and key mathematical models used in the field are covered to illustrate how these mathematical tools can be used to handle vagueness and uncertainty. This book is clearly written and it brings together the latest concepts in computational intelligence in a friendly and complete format for undergraduate/postgraduate students as well as professionals new to the field. With about 250 pages covering such a wide variety of topics, it would be impossible to handle everything at a great length. Nonetheless, this book is an excellent choice for readers who wish to familiarize themselves with computational intelligence techniques or for an overview/introductory course in the field of computational intelligence. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond—Bernhard Schölkopf and Alexander Smola, (MIT Press, Cambridge, MA, 2002, ISBN 0-262-19475-9). Reviewed by Amir F. Atiya.