Data Programming: Creating Large Training Sets, Quickly

Open AccessPosted Content

Data Programming: Creating Large Training Sets, Quickly

- 25 May 2016

446

TL;DR: A paradigm for the programmatic creation of training sets called data programming is proposed in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict.

Abstract: Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1098/RSIF.2017.0387

Opportunities and obstacles for deep learning in biology and medicine.

Travers Ching, +38 more

- 01 Apr 2018

- Journal of the Royal Society Interface

TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.

...read moreread less

2K

•Journal Article•10.3390/ELECTRONICS8030292

A State-of-the-Art Survey on Deep Learning Theory and Architectures

Zahangir Alom, +9 more

- 05 Mar 2019

- Electronics

TL;DR: This survey presents a brief survey on the advances that have occurred in the area of Deep Learning (DL), starting with the Deep Neural Network and goes on to cover Convolutional Neural Network, Recurrent Neural Network (RNN), and Deep Reinforcement Learning (DRL).

...read moreread less

1.4K

•Journal Article•10.1109/TKDE.2019.2946162

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

Yuji Roh, +2 more

- 01 Apr 2021

- IEEE Transactions on Knowledge and Data ...

TL;DR: This survey performs a comprehensive study of data collection from a data management point of view, providing a research landscape of these operations, guidelines on which technique to use when, and identify interesting research challenges.

...read moreread less

858

•Posted Content

The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches.

Md. Zahangir Alom, +8 more

- 03 Mar 2018

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This report presents a brief survey on development of DL approaches, including Deep Neural Network (DNN), Convolutional neural network (CNN), Recurrent Neural network (RNN) including Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU), Auto-Encoder (AE), Deep Belief Network (DBN), Generative Adversarial Network (GAN), and Deep Reinforcement Learning (DRL).

...read moreread less

722

•Journal Article•10.14778/3157794.3157797

Snorkel: Rapid Training Data Creation with Weak Supervision

Alexander Ratner, +5 more

- 28 Nov 2017

- arXiv: Learning

TL;DR: Snorkel is a first-of-its-kind system that enables users to train state- of- the-art models without hand labeling any training data and proposes an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution.

...read moreread less

623

...

Expand

References

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Journal Article•10.1007/S11263-016-0981-7

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna, +11 more

- 01 May 2017

- International Journal of Computer Vision

TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.

...read moreread less

6.6K

Proceedings Article•10.1145/279943.279962

Combining labeled and unlabeled data with co-training

Avrim Blum, +1 more

- 24 Jul 1998

TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.

...read moreread less

6.4K

•Proceedings Article•10.3115/1690219.1690287

Distant supervision for relation extraction without labeled data

Mike D. Mintz, +3 more

- 02 Aug 2009

TL;DR: This work investigates an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size.

...read moreread less

3.6K

Journal Article•10.2307/2346806

Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

A. P. Dawid, +1 more

- 01 Mar 1979

- Journal of The Royal Statistical Society...

TL;DR: The EM algorithm is shown to provide a slow but sure way of obtaining maximum likelihood estimates of the parameters of interest in compiling a patient record.

...read moreread less

1.9K

...

Expand

Data Programming: Creating Large Training Sets, Quickly

Chat with Paper

AI Agents for this Paper

Citations

Opportunities and obstacles for deep learning in biology and medicine.

A State-of-the-Art Survey on Deep Learning Theory and Architectures

A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective

The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches.

Snorkel: Rapid Training Data Creation with Weak Supervision

References

Long short-term memory

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Combining labeled and unlabeled data with co-training

Distant supervision for relation extraction without labeled data

Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

Related Papers (5)

Distant supervision for relation extraction without labeled data

Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

Glove: Global Vectors for Word Representation

Combining labeled and unlabeled data with co-training

Modeling relations and their mentions without labeled text