TL;DR: This article showed that adaptive methods often find drastically different solutions than gradient descent or stochastic gradient descent (SGD) for simple overparameterized problems, and that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance.
Abstract: Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks Examples include AdaGrad, RMSProp, and Adam We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD) We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks
TL;DR: The architecture of the Jalapeno Adaptive Optimization System is presented, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations, based on a federation of threads with asynchronous communication.
Abstract: Future high-performance virtual machines will improve performance through sophisticated online feedback-directed optimizations. this paper presents the architecture of the Jalapeno Adaptive Optimization System, a system to support leading-edge virtual machine technology and enable ongoing research on online feedback-directed optimizations. We describe the extensible system architecture, based on a federation of threads with asynchronous communication. We present an implementation of the general architecture that supports adaptive multi-level optimization based purely on statistical sampling. We empirically demonstrate that this profiling technique has low overhead and can improve startup and steady-state performance, even without the presence of online feedback-directed optimizations. The paper also describes and evaluates an online feedback-directed inlining optimization based on statistical edge sampling. The system is written completely in Java, applying the described techniques not only to application code and standard libraries, but also to the virtual machine itself.
TL;DR: This paper surveys the evolution and current state of adaptive optimization technology in virtual machines and concludes that adaptive optimization has begun to mature as a widespread production-level technology.
Abstract: Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimization overhead. Second, modular program representations preclude many forms of whole-program interprocedural optimization. Third, virtual machines incur additional costs for runtime services such as security guarantees and automatic memory management. To address these challenges, vendors have invested considerable resources into adaptive optimization systems in production virtual machines. Today, mainstream virtual machine implementations include substantial infrastructure for online monitoring and profiling, runtime compilation, and feedback-directed optimization. As a result, adaptive optimization has begun to mature as a widespread production-level technology. This paper surveys the evolution and current state of adaptive optimization technology in virtual machines.
TL;DR: A new relative bundle adjustment is derived which, instead of optimizing in a single Euclidean space, works in a metric space defined by a manifold, and it is shown experimentally that it is possible to solve for the full maximum-likelihood solution incrementally in constant time, even at loop closure.
Abstract: In this paper we describe a relative approach to simultaneous localization and mapping, based on the insight that a continuous relative representation can make the problem tractable at large scales. First, it is well known that bundle adjustment is the optimal non-linear least-squares formulation for this problem, in that its maximum-likelihood form matches the definition of the CramerâRao lower bound. Unfortunately, computing the maximum-likelihood solution is often prohibitively expensive: this is especially true during loop closures, which often necessitate adjusting all parameters in a loop. In this paper we note that it is precisely the choice of a single privileged coordinate frame that makes bundle adjustment costly, and that this expense can be avoided by adopting a completely relative approach. We derive a new relative bundle adjustment which, instead of optimizing in a single Euclidean space, works in a metric space defined by a manifold. Using an adaptive optimization strategy, we show experimentally that it is possible to solve for the full maximum-likelihood solution incrementally in constant time, even at loop closure. Our approach is, by definition, everywhere locally Euclidean, and we show that the local Euclidean estimate matches that of traditional bundle adjustment. Our system operates online in realtime using stereo data, with fast appearance-based loop closure detection. We show results on over 850,000 images that indicate the accuracy and scalability of the approach, and process over 330 GB of image data into a relative map covering 142 km of Southern England. To demonstrate a baseline sufficiency for navigation, we show that it is possible to find shortest paths in the relative maps we build, in terms of both time and distance. Query images from the web of popular landmarks around London, such as the London Eye or Trafalgar Square, are matched to the relative map to provide route planning goals.
TL;DR: This work proposes a new class of nonparametric adaptive data-driven policies for stochastic inventory control problems on the distribution-free newsvendor model with censored demands and obtains new results on the asymptotic consistency of the Kaplan-Meier estimator for discrete random variables that extend existing work in statistics.
Abstract: Using the well-known product-limit form of the Kaplan-Meier estimator from statistics, we propose a new class of nonparametric adaptive data-driven policies for stochastic inventory control problems. We focus on the distribution-free newsvendor model with censored demands. The assumption is that the demand distribution is not known and there are only sales data available. We study the theoretical performance of the new policies and show that for discrete demand distributions they converge almost surely to the set of optimal solutions. Computational experiments suggest that the new policies converge for general demand distributions, not necessarily discrete, and demonstrate that they are significantly more robust than previously known policies. As a by-product of the theoretical analysis, we obtain new results on the asymptotic consistency of the Kaplan-Meier estimator for discrete random variables that extend existing work in statistics. To the best of our knowledge, this is the first application of the Kaplan-Meier estimator within an adaptive optimization algorithm, in particular, the first application to stochastic inventory control models. We believe that this work will lead to additional applications in other domains.