Open AccessPosted Content
Data Programming: Creating Large Training Sets, Quickly
TL;DR: A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.
read more
Abstract: Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Opportunities and obstacles for deep learning in biology and medicine.
Travers Ching,Daniel Himmelstein,Brett K. Beaulieu-Jones,Alexandr A. Kalinin,Brian T. Do,Gregory P. Way,Enrico Ferrero,Paul-Michael Agapow,Michael Zietz,Michael M. Hoffman,Michael M. Hoffman,Wei Xie,Gail L. Rosen,Benjamin J. Lengerich,Johnny Israeli,Jack Lanchantin,Stephen Woloszynek,Anne E. Carpenter,Avanti Shrikumar,Jinbo Xu,Evan M. Cofer,Evan M. Cofer,Christopher A. Lavender,Srinivas C. Turaga,Amr Alexandari,Zhiyong Lu,David J. Harris,Dave DeCaprio,Yanjun Qi,Anshul Kundaje,Yifan Peng,Laura K. Wiley,Marwin H. S. Segler,Simina M. Boca,S. Joshua Swamidass,Austin Huang,Anthony Gitter,Anthony Gitter,Casey S. Greene +38 more
TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.
2K
A State-of-the-Art Survey on Deep Learning Theory and Architectures
Zahangir Alom,Tarek M. Taha,Chris Yakopcic,Stefan Westberg,Paheding Sidike,Mst Shamima Nasrin,Mahmudul Hasan,Brian Van Essen,Abdul A. S. Awwal,Vijayan K. Asari +9 more
TL;DR: This survey presents a brief survey on the advances that have occurred in the area of Deep Learning (DL), starting with the Deep Neural Network and goes on to cover Convolutional Neural Network, Recurrent Neural Network (RNN), and Deep Reinforcement Learning (DRL).
1.4K
A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective
TL;DR: This survey performs a comprehensive study of data collection from a data management point of view, providing a research landscape of these operations, guidelines on which technique to use when, and identify interesting research challenges.
858
•Posted Content
The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches.
Md. Zahangir Alom,Tarek M. Taha,Christopher Yakopcic,Stefan Westberg,Paheding Sidike,Mst Shamima Nasrin,Brian Van Essen,Abdul A. S. Awwal,Vijayan K. Asari +8 more
TL;DR: This report presents a brief survey on development of DL approaches, including Deep Neural Network (DNN), Convolutional neural network (CNN), Recurrent Neural network (RNN) including Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU), Auto-Encoder (AE), Deep Belief Network (DBN), Generative Adversarial Network (GAN), and Deep Reinforcement Learning (DRL).
722
Snorkel: Rapid Training Data Creation with Weak Supervision
TL;DR: Snorkel is a first-of-its-kind system that enables users to train state- of- the-art models without hand labeling any training data and proposes an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution.
References
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna,Yuke Zhu,Oliver Groth,Justin Johnson,Kenji Hata,Joshua Kravitz,Stephanie Chen,Yannis Kalantidis,Li-Jia Li,David A. Shamma,Michael S. Bernstein,Li Fei-Fei +11 more
TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.
Combining labeled and unlabeled data with co-training
Avrim Blum,Tom M. Mitchell +1 more
- 24 Jul 1998
TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
6.4K
Distant supervision for relation extraction without labeled data
Mike D. Mintz,Steven Bills,Rion Snow,Dan Jurafsky +3 more
- 02 Aug 2009
TL;DR: This work investigates an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size.
Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm
A. P. Dawid,A. M. Skene +1 more
TL;DR: The EM algorithm is shown to provide a slow but sure way of obtaining maximum likelihood estimates of the parameters of interest in compiling a patient record.
1.9K
Related Papers (5)
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
Avrim Blum,Tom M. Mitchell +1 more
- 24 Jul 1998
Sebastian Riedel,Limin Yao,Andrew McCallum +2 more
- 20 Sep 2010