Sample size selection in optimization methods for machine learning

Question

1. What are the contributions mentioned in the paper "Sample size selection in optimization methods for machine learning" ?

2. What is the premise of their dynamic sample size algorithm?

3. What is the subspace phase of the algorithm?

4. What is the role of the subspace phase in the Newton algorithm?

Accepted Answer

This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large scale machine learning problems.. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient.. The authors propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient.. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique.. The focus of the paper shifts in the third part of the paper to L1 regularized problems designed to produce sparse solutions.. The authors propose a Newton-like method that consists of two phases: a ( minimalistic ) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables.

Accepted Answer

The premise of their dynamic sample size algorithm is to efficiently generate directions that decrease the target objective function, sufficiently often.

Accepted Answer

The subspace phase then minimizes a quadratic model of the objective function over the variables that are not active, to determine a direction along which progress in the objective can be made.

Accepted Answer

The subspace phase plays the dual role of accelerating convergence toward the solution while promoting the fast generation of sparse solutions.

Accepted Answer

If the authors condition on wk, the only random quantity in this expression is gk, which is the sample mean of the gradients ∇`(w; i), for i ∈ Sk.

Accepted Answer

The authors have omitted the term (1 − θ) used in (4.7) because it has no effect on the expected value analysis presented in this section (i.e., shortening the steplength by the factor (1− θ) is only needed for the deterministic Theorem 6.1 that imposed (4.8) directly).

Accepted Answer

In [18], Homem-de-Mello presents a variable sample size method that utilizes only function values, within the framework of a pure random search method.

Accepted Answer

it is common in practice to add a regularization term of the form 12γwTw to the objective (2.2), which guarantees uniform convexity.

Accepted Answer

These methods must, however, perform a vast number of iterations before an appreciable improvement in the objective is obtained, and due to the sequential nature of these iterations, it can be difficult to parallelize them; see [23, 1, 10] and the references therein.

Accepted Answer

The CG algorithm effectively computes the product ∇2JHk(wk)d at every iteration, but computing the term ∇2JSk(wk)d would defeat the purpose of the Hessian sub-sampling approach.

Accepted Answer

The alignment is motivated by the need to achieve global convergence properties, but their computational experience indicates that it slows down the iteration.

Accepted Answer

Sample selection also plays a crucial role in the incorporation of curvature information in a Hessian-free Newton method for machine learning [7, 21].

Accepted Answer

To illustrate the benefits of dynamic sampling, the authors report results of Algorithm 5.2 on the supervised learning speech recognition problem described in [7], which was modeled using a multi-class logistic function.

Accepted Answer

The strategy for increasing the sample is quite different from the one presented in this paper; it relies on function values and a system of sample size updates analogous to update rules for trust region sizes.

Sample size selection in optimization methods for machine learning

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions mentioned in the paper "Sample size selection in optimization methods for machine learning" ?

2. What is the premise of their dynamic sample size algorithm?

3. What is the subspace phase of the algorithm?

4. What is the role of the subspace phase in the Newton algorithm?

5. What is the sample mean of the gradients (w); i?

6. Why did the authors omit the term in (4.7)?

7. What is the main premise of the dynamic sample size algorithm?

8. What is the common practice of adding a regularization term to the objective?

9. Why do the methods perform a vast number of iterations before an appreciable improvement?

10. What is the purpose of the Hessian sub-sampling approach?

11. Why is the alignment motivated by the need to achieve global convergence properties?

12. What is the role of sample selection in the Hessian-free Newton method for machine learning?

13. What is the supervised learning speech recognition problem?

14. What is the strategy for increasing the sample?

Figures

Citations

TensorFlow: A system for large-scale machine learning

TensorFlow: a system for large-scale machine learning

Optimization Methods for Large-Scale Machine Learning

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

A proximal stochastic gradient method with progressive variance reduction

References

A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems

A Stochastic Approximation Method

De-noising by soft-thresholding

Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems

Acceleration of stochastic approximation by averaging

Related Papers (5)

A Stochastic Approximation Method

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Optimization Methods for Large-Scale Machine Learning

Adam: A Method for Stochastic Optimization