Learning long-term dependencies with gradient descent is difficult

doi:10.1109/72.279181

Journal Article10.1109/72.279181

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, +2 more

- 01 Mar 1994

- IEEE Transactions on Neural Networks

- Vol. 5, Iss: 2, pp 157-166

9K

TL;DR: This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.INS.2019.01.076

Evaluation of statistical and machine learning models for time series prediction: identifying the state-of-the-art and the best conditions for the use of each model

Antonio Rafael Sabino Parmezan, +2 more

- 01 May 2019

- Information Sciences

TL;DR: One of the most extensive, impartial and comprehensible experimental evaluations ever done in the time series prediction field is presented, showing that SARIMA is the only statistical method able to outperform, but without a statistical difference, the following machine learning algorithms: ANN, SVM, and kNN-TSPI.

...read moreread less

263

Journal Article•10.1007/S10916-019-1345-Y

Automated Depression Detection Using Deep Representation and Sequence Learning with EEG Signals

Betul Ay, +8 more

- 28 May 2019

- Journal of Medical Systems

TL;DR: A deep hybrid model developed using convolutional neural network (CNN) and long-short term memory (LSTM) architectures to detect depression using EEG signals is proposed and can be employed in psychiatry wards of the hospitals to detect the depression using EEG signals accurately and thus aid the psychiatrists.

...read moreread less

263

•Proceedings Article

Full-capacity unitary recurrent neural networks

Scott Wisdom, +4 more

- 05 Dec 2016

TL;DR: This work provides a theoretical argument to determine if a unitary parameterization has restricted capacity, and shows how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices.

...read moreread less

262

•Journal Article•10.1142/S0129065721300011

An Experimental Review on Deep Learning Architectures for Time Series Forecasting.

Pedro Lara-Benítez, +2 more

- 16 Feb 2021

- International Journal of Neural Systems

TL;DR: Among all studied models, the results show that long short-term memory and convolutional networks are the best alternatives, with LSTMs obtaining the most accurate forecasts and CNNs achieving comparable performance with less variability of results under different parameter configurations, while also being more efficient.

...read moreread less

261

•Posted Content

Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

Xiangang Li, +1 more

- 16 Oct 2014

- arXiv: Computation and Language

TL;DR: Alternative deep LSTM architectures are proposed and empirically evaluated on a large vocabulary conversational telephone speech recognition task and Experimental results demonstrate that the deep L STM networks benefit from the depth and yield the state-of-the-art performance on this task.

...read moreread less

260

...

Expand

References

Journal Article•10.1126/SCIENCE.220.4598.671

Optimization by Simulated Annealing

Scott Kirkpatrick, +2 more

- 13 May 1983

- Science

TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.

...read moreread less

46.9K

Book Chapter•10.1016/B978-1-4832-1446-7.50035-2

Learning internal representations by error propagation

David E. Rumelhart, +2 more

- 01 Jan 1988

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.

...read moreread less

18.9K

•Book

Learning internal representations by error propagation

David E. Rumelhart, +2 more

- 03 Jan 1986

TL;DR: In this paper, the problem of the generalized delta rule is discussed and the Generalized Delta Rule is applied to the simulation results of simulation results in terms of the generalized delta rule.

...read moreread less

16K

Journal Article•10.1162/NECO.1989.1.2.270

A learning algorithm for continually running fully recurrent neural networks

Ronald J. Williams, +1 more

- 01 Jun 1989

- Neural Computation

TL;DR: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks.

...read moreread less

5K

Journal Article•10.1145/29380.29864

Minimizing multimodal functions of continuous variables with the “simulated annealing” algorithm—Corrigenda for this article is available here

Angelo Corana, +3 more

- 01 Sep 1987

- ACM Transactions on Mathematical Softwar...

TL;DR: A new global optimization algorithm for functions of continuous variables is presented, derived from the “Simulated Annealing” algorithm recently introduced in combinatorial optimization, which is quite costly in terms of function evaluations, but its cost can be predicted in advance, depending only slightly on the starting point.

...read moreread less

1.6K

...

Expand

Learning long-term dependencies with gradient descent is difficult

Chat with Paper

AI Agents for this Paper

Citations

Evaluation of statistical and machine learning models for time series prediction: identifying the state-of-the-art and the best conditions for the use of each model

Automated Depression Detection Using Deep Representation and Sequence Learning with EEG Signals

Full-capacity unitary recurrent neural networks

An Experimental Review on Deep Learning Architectures for Time Series Forecasting.

Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

References

Optimization by Simulated Annealing

Learning internal representations by error propagation

Learning internal representations by error propagation

A learning algorithm for continually running fully recurrent neural networks

Minimizing multimodal functions of continuous variables with the “simulated annealing” algorithm—Corrigenda for this article is available here

Related Papers (5)

Long short-term memory

Deep Residual Learning for Image Recognition

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Deep learning

Adam: A Method for Stochastic Optimization