TL;DR: An algorithm for improving the accuracy of algorithms for learning binary concepts by combining a large number of hypotheses, each of which is generated by training the given learning algorithm on a different set of examples, is presented.
Abstract: We present an algorithm for improving the accuracy of algorithms for learning binary concepts. The improvement is achieved by combining a large number of hypotheses, each of which is generated by training the given learning algorithm on a different set of examples. Our algorithm is based on ideas presented by Schapire and represents an improvement over his results, The analysis of our algorithm provides general upper bounds on the resources required for learning in Valiant′s polynomial PAC learning framework, which are the best general upper bounds known today. We show that the number of hypotheses that are combined by our algorithm is the smallest number possible. Other outcomes of our analysis are results regarding the representational power of threshold circuits, the relation between learnability and compression, and a method for parallelizing PAC learning algorithms. We provide extensions of our algorithms to cases in which the concepts are not binary and to the case where the accuracy of the learning algorithm depends on the distribution of the instances.
TL;DR: It is proved that the algorithm proposed can efficiently learn distributions generated by the subclass of APFAs it considers, and it is shown that the KL-divergence between the distributiongenerated by the target source and the distribution generated by the authors' hypothesis can be made arbitrarily small with high confidence in polynomial time.
Abstract: We propose and analyze a distribution learning algorithm for a subclass ofacyclic probalistic finite automata(APFA). This subclass is characterized by a certain distinguishability property of the automata's states. Though hardness results are known for learning distributions generated by general APFAs, we prove that our algorithm can efficiently learn distributions generated by the subclass of APFAs we consider. In particular, we show that the KL-divergence between the distribution generated by the target source and the distribution generated by our hypothesis can be made arbitrarily small with high confidence in polynomial time. We present two applications of our algorithm. In the first, we show how to model cursively written letters. The resulting models are part of a complete cursive handwriting recognition system. In the second application we demonstrate how APFAs can be used to build multiple-pronunciation models for spoken words. We evaluate the APFA-based pronunciation models on labeled speech data. The good performance (in terms of the log-likelihood obtained on test data) achieved by the APFAs and the little time needed for learning suggests that the learning algorithm of APFAs might be a powerful alternative to commonly used probabilistic models.
TL;DR: A general scheme for extending the VC-dimension to the case n > 1 is presented, which defines a wide variety of notions of dimension in which all these variants of theVC-dimension, previously introduced in the context of learning, appear as special cases.
TL;DR: In this article, the authors investigate the learnability of grammars in Optimality Theory and discuss the special nature of the learning problem in that theory, and present a simple and efficient algorithm for solving this problem, assuming a given set of hypothesized underlying forms.
Abstract: If Optimality Theory (Prince & Smolensky 1991, 1993) is correct, Universal Grammar provides a set of universal constraints which are highly general, inherently conflicting, and consequently rampantly violated in the surface forms of languages. A language’s grammar ranks the universal constraints in a dominance hierarchy, higher-ranked constraints taking absolute priority over lower-ranked constraints, so that violations of a constraint occur in well-formed structures when, and only when, they are necessary to prevent violation of higher-ranked constraints. Languages differ principally in how they rank the universal constraints in their language-specific dominance hierarchies. The surface forms of a given language are structural descriptions of inputs which are optimal in the following sense: they satisfy the universal constraints, or, when these constraints are brought into conflict by an input, they satisfy the highest-ranked constraints possible. This notion of optimality is partly language-specific, since the ranking of constraints is language-particular, and partly universal, since the constraints which evaluate well-formedness are (at least to a considerable extent) universal. In many respects, ranking of universal constraints in Optimality Theory plays a role analogous to parameter-setting in principles-and-parameters theory. Evidence in favor of this Optimality-Theoretic characterization of Universal Grammar is provided elsewhere; most work to date addresses phonology: see Prince & Smolensky 1993 (henceforth, ‘P&S’) and the several dozen works cited therein, notably McCarthy & Prince 1993; initial work addressing syntax includes Grimshaw 1993 and Legendre, Raymond & Smolensky 1993. Here, we investigate the learnability of grammars in Optimality Theory. Under the assumption of innate knowledge of the universal constraints, the primary task of the learner is the determination of the dominance ranking of these constraints which is particular to the target language. We will present a simple and efficient algorithm for solving this problem, assuming a given set of hypothesized underlying forms. (Concerning the problem of acquiring underlying forms, see the discussion of ‘optimality in the lexicon’ in P & S 1993:§9). The fact that surface forms are optimal means that every positive example entails a great number of implicit negative examples: for a given input, every candidate output other than the correct form is ill-formed.1 As a consequence, even a single positive example can greatly constrain the possible grammars for a target language, as we will see explicitly. In §1 we present the relevant principles of Optimality Theory and discuss the special nature of the learning problem in that theory. Readers familiar with the theory may wish to proceed directly to §1.3. In §2 we present the first version of our learning algorithm, initially, through a concrete example; we also consider its (low) computational complexity. Formal specification of the first version of the algorithm and proof of its correctness are taken up in the Appendix. In §3 we generalize the algorithm, identifying a more general core called Constraint Demotion(‘CD’) and then a family of CD algorithms which differ in how they apply this core to the acquisition data. We sketch a proof of the correctness and convergence of the CD algorithms, and of a bound on the number of examples needed to complete learning. In §4 we briefly consider the issue of ties in the ranking of constraints and the case of inconsistent data. Finally, we observe that the CD algorithm entails a Superset Principle for acquisition: as the learner refines the grammar, the set of well-formed structures shrinks.
TL;DR: In this article, the authors consider the influence of various monotonicity constraints to the learning process of uniformly recursive languages and provide a thorough study concerning their influence on the learnability of several parameters.
Abstract: The present paper deals with the learnability of indexed families of uniformly recursive languages from positive data as well as from both, positive and negative data. We consider the influence of various monotonicity constraints to the learning process, and provide a thorough study concerning the influence of several parameters. In particular, we present examples pointing to typical problems and solutions in the field. Then we provide a unifying framework for learning. Furthermore, we survey results concerning learnability in dependence on the hypothesis space, and concerning order independence. Moreover, new results dealing with the efficiency of learning are provided. First, we investigate the power of iterative learning algorithms. The second measure of efficiency studied is the number of mind changes a learning algorithm is allowed to perform. In this setting we consider the problem whether or not the monotonicity constraints introduced do influence the efficiency of learning algorithms.
TL;DR: A complexity theoretic accounting of memory utilization by learning machines is introduced and, in this new model, memory is measured in bits as a function of the size of the input.
Abstract: People tend not to have perfect memories when it comes to learning, or to anything else for that matter. Most formal studies of learning, however, assume a perfect memory. Some approaches have restricted the number of items that could be retained. We introduce a complexity theoretic accounting of memory utilization by learning machines. In our new model, memory is measured in bits as a function of the size of the input. There is a hierarchy of learnability based on increasing memory allotment. The lower bound results are proved using an unusual combination of pumping and mutual recursion theorem arguments. For technical reasons, it was necessary to consider two types of memory : long and short term.
TL;DR: This paper presents a learning algorithm that learns any SDA M in the limit from positive data, satisfying the properties that (i) the time for updating a conjecture is at most O(lm), and (ii) the number of implicit prediction errors is at least O(ln), where l is the maximum length of all positive data provided.
Abstract: This paper deals with the polynomial-time learnability of a language class in the limit from positive data, and discusses the learning problem of a subclass of deterministic finite automata (DFAs), called strictly deterministic automata (SDAs), in the framework of learning in the limit from positive data. We first discuss the difficulty of Pitt's definition in the framework of learning in the limit from positive data, by showing that any class of languages with an infinite descending chain property is not polynomial-time learnable in the limit from positive data. We then propose new definitions for polynomial-time learnability in the limit from positive data. We show in our new definitions that the class of SDAs is iteratively, consistently polynomial-time learnable in the limit from positive data. In particular, we present a learning algorithm that learns any SDA M in the limit from positive data, satisfying the properties that (i) the time for updating a conjecture is at most O(lm), (ii) the number of implicit prediction errors is at most O(ln), where l is the maximum length of all positive data provided, m is the alphabet size of M and n is the size of M, (iii) each conjecture is computed from only the previous conjecture and the current example, and (iv) at any stage the conjecture is consistent with the sample set seen so far. This is in marked contrast to the fact that the class of DFAs is neither learnable in the limit from positive data nor polynomial-time learnable in the limit.
TL;DR: A general framework for studying the effects of the Maturation Hypothesis on the problem of language learning, parametrically conceived, and a method for finding all maturational solutions for any parametric hypothesis space and any learning algorithm that differs from Gibson and Wexier's TLA only in the number of parameters that can be reset at each step is presented.
Abstract: Recent work in parametric language learning has showed that even very small systems of linguistically plausible parameters pose very serious problems for error-driven and conservative learning algorithms It has been argued that such problems may be solved by considering that different parameters may become available for reset ring at different times, as an effect of biological maturation This article presents a general framework for studying the effects of the Maturation Hypothesis on the problem of language learning, parametrically conceived, and offers a general method for finding all maturational solutions (where some exist) for any parametric hypothesis space and any learning algorithm that differs from Gibson and Wexier's TLA only in the number of parameters that can be reset at each step Implications for research in natural language acquisition are discussed in the concluding section
TL;DR: In this paper, lower bounds on the number of samples and computational resources required to learn several classes of boolean circuits on the uniform distribution were investigated under the assumption that the distribution is uniform.
TL;DR: A model of learning by distances is presented and Insight gained is applied to show that every class of subsets C that has a finite VC-dimension is PAC-learnable with respect to any fixed distribution.
Abstract: A model of learning by distances is presented. In this model a concept is a point in a metric space. At each step of the learning process the student guesses a hypothesis and receives from the teacher an approximation of its distance to the target. A notion of a distance measuring the proximity of a hypothesis to the correct answer is common to many models of learnability. By focusing on this fundamental aspect we discover some general and simple tools for the analysis of learnability tasks. As a corollary we present new learning algorithms for Valiant?s PAC scenario with any given distribution. These algorithms can learn any PAC-learnable class and, in some cases, settle for significantly less information than the usual labeled examples. Insight gained by the new model is applied to show that every class of subsets C that has a finite VC-dimension is PAC-learnable with respect to any fixed distribution. Previously known results of this nature were subject to complicated measurability constraints.
TL;DR: A polynomial-time algorithm using membership and equivalence queries that finds the minimum obdd for the target respecting a given ordering is given.
Abstract: This note studies the learnability of ordered binary decision diagrams (obdds). We give a polynomial-time algorithm using membership and equivalence queries that finds the minimum obdd for the target respecting a given ordering. We also prove that both types of queries and the restriction to a given ordering are necessary if we want minimality in the output, unless P=NP. If learning has to occur with respect to the optimal variable ordering, polynomial-time learnability implies the approximability of two NP-hard optimization problems: the problem of finding the optimal variable ordering for a given obdd and the Optimal Linear Arrangement problem on graphs.
TL;DR: This work proposes a discount method for evaluation of an interactive system's learnability based on automated logging of user actions, detection of user mental chunks, and observation of chunk size as it grows over time with experience.
Abstract: Learnability evaluation has traditionally required expensive and time-consuming techniques. Practitioners have refrained from performing extended learnability evaluation due to its prohibitive costs. We propose a discount method for evaluation of an interactive system's learnability. Our method is based on automated logging of user actions, detection of user mental chunks, and observation of chunk size as it grows over time with experience. We introduce a model for chunk detection, and present experimental results validating the use of chunk size as an indicator of learnability.
TL;DR: In this article, the authors investigate the learnability of nested differences of intersection-closed classes in the presence of malicious noise and present an online algorithm whose mistake bound is optimal in the sense that there are concept classes for which each learning algorithm (using nested differences as hypotheses) can be forced to make at least that many mistakes.
Abstract: We investigate the learnability of nested differences of intersection-closed classes in the presence of malicious noise. Examples of intersection-closed classes include axis-parallel rectangles, monomials, linear sub-spaces, and so forth. We present an on-line algorithm whose mistake bound is optimal in the sense that there are concept classes for which each learning algorithm (using nested differences as hypotheses) can be forced to make at least that many mistakes. We also present an algorithm for learning in the PAC model with malicious noise. Surprisingly enough, the noise rate tolerable by these algorithms does not depend on the complexity of the target class but depends only on the complexity of the underlying intersection-closed class.
TL;DR: This work focuses on the learnability and use of visualization systems, and the perceptual and cognitive processes involved in viewing visualizations, which support a broad range of user tasks and abilities, are easy to learn, and provide powerful and flexible output formatting.
Abstract: Recent software provides new tools for visualizing multivariate data that facilitate data analysis. We focus on (1) the learnability and use of visualization systems, and (2) the perceptual and cognitive processes involved in viewing visualizations. Effective visualization systems support a broad range of user tasks and abilities, are easy to learn, and provide powerful and flexible output formatting. Effective visualizations incorporate Gestalt and other perceptual and cognitive principles that encourage more rapid, automatic processing, and less slow, controlled processing.
TL;DR: A novel "solid learnability" notion is presented that indicates when the class in question can be successfully learned by the most straightforward algorithms, namely, any consistent algorithm.
Abstract: We present a systematic framework for classifying, comparing, and defining models of PAC learnability. Apart from the obvious "uniformity" parameters, we present a novel "solid learnability" notion that indicates when the class in question can be successfully learned by the most straightforward algorithms, namely, any consistent algorithm. We analyze known models in terms of our new parameterization scheme and investigate the relative strength of notions of learnability that correspond to different parameter values. In addition, we consider "proximity" between concept classes. We define notions of "covering" one class by another and show that, with respect to learnability, they play a role similar to the role of reductions in computational complexity; the learnability of a class implies the learnability of any class it covers. We apply the covering technique to resolve some open questions raised by Benedek and Itai (1991, Theoret. Comput. Sci.86, 377-389; 1989, Inform. and Comput.82, 247-261) and Linial et al. (1991, Inform. and Comput.90, 33-49). The notions we discuss are information-theoretic: we concentrate on the question of learnability rather than the computational complexity of the learning process.
TL;DR: This paper considers computer-based support for the development of computer skills in the workplace and offers a number of collaborative learnability design principles, including some principles of collaborative visibility and the importance of the demonstration in the sharing of skills.
Abstract: This paper considers computer-based support for the development of computer skills in the workplace. We suggest that computer systems should be designed to support collaborative learnability; to this end, we offer a number of collaborative learnability design princi ples. In particular, we emphasize that the prime objec tive should be user participation. We suggest some principles of collaborative visibility and highlight the importance of the demonstration in the sharing of skills. The various design principles are incorporated into a generic model for collaborative user support called MutualAid. A specific system based on this model is also described. This system uses multimedia demonstrations recorded by end users to support an in teractive problem-solving forum and the development of a local database of computer-related practice.
TL;DR: This paper gives a general learnability result for typed pattern languages, and shows that if a class of types has finite elasticity then the typed pattern language is identifiable in the limit from positive data.
Abstract: In this paper, we extend patterns, introduced by Angluin [Ang80b], to typed patterns by introducing types into variables. A type is a recursive language and a variable of the type is substituted only with an element in the recursive language. This extension enhances the expressive power of patterns with preserving their good properties. First, we give a general learnability result for typed pattern languages. We show that if a class of types has finite elasticity then the typed pattern language is identifiable in the limit from positive data. Next, we give a useful tool to show the conservative learnability of typed pattern languages. That is, if an indexed family \({\cal L}\)of recursive languages has recursive finite thickness and the equivalence problem for \({\cal L}\) is decidable, then \({\cal L}\) is conservatively learnable from positive data. Using this tool, we consider the following classes of types: (1) the class of all strings over subsets of the alphabet, (2) the class of all untyped pattern languages, and (3) a class of k-bounded regular languages. We show that each of these typed pattern languages is conservatively learnable from positive data.
TL;DR: It is shown that, when exactness is not required, prudence, consistency and responsiveness, even together, do not restrict the power of conservative learners.
TL;DR: The learnability of the class of letter-counts of regular languages (semilinear sets) and other related classes of subsets of Nd or Zd with respect to the distribution-free learning model of Valiant (PAC learning model) is characterized using the notion of reducibility among learning problems due to Pitt and Warmuth.
Abstract: The learnability of the class of letter-counts of regular languages (semilinear sets) and other related classes of subsets of Nd or Zd With respect to the distribution-free learning model of Valiant (PAC learning model) is characterized. Using the notion of reducibility among learning problems due to Pitt and Warmuth called "prediction preserving reducibility," and a special case thereof, a number of positive and partially negative results are obtained. On the positive side the class of semilinear sets of dimension 1 or 2 is shown to be learnable when the integers are encoded in unary. On the neutral to negative side it is shown that when the integers are encoded in binary the learning problem for semilinear sets as well as for a class of subsets of Zd much simpler than semilinear sets is as hard as learning DNF, a central open problem in the field. A number of hardness results for related learning problems are also given.
TL;DR: A general framework for the construction of efficient learning algorithms in noise tolerant variants of Valiant's PAC learning model is described, which provides a unified and intuitive framework for noise tolerant learning that allows the algorithm designer to achieve efficient, and often optimal, fault tolerant learning.
Abstract: Learning systems are often provided with imperfect or noisy data. Therefore, researchers have formalized various models of learning with noisy data, and have attempted to delineate the boundaries of learnability in these models. In this thesis, we describe a general framework for the construction of efficient learning algorithms in noise tolerant variants of Valiant's PAC learning model. By applying this framework, we also obtain many new results for specific learning problems in various settings with faulty data.
The central tool used in this thesis is the specification of learning algorithms in Kearns' Statistical Query (SQ) learning model, in which statistics, as opposed to labelled examples, are requested by the learner. These SQ learning algorithms are then converted into PAC algorithms which tolerate various types of faulty data.
We develop this framework in three major parts: (1) We design automatic compilations of SQ algorithms into PAC algorithms which tolerate various types of data errors. These results include improvements to Kearns classification noise compilation, and the first such compilations for malicious errors, attribute noise and new classes of "hybrid" noise composed of multiple noise types. (2) We prove nearly tight bounds on the required complexity of SQ algorithms. The upper bounds are based on a constructive technique which allows one to achieve this complexity even when it is not initially achieved by a given SQ algorithm. (3) We define and employ an improved model of SQ learning which yields noise tolerant PAC algorithms that are more efficient than those derived from standard SQ algorithms. Together, these results provide a unified and intuitive framework for noise tolerant learning that allows the algorithm designer to achieve efficient, and often optimal, fault tolerant learning.
TL;DR: This paper explores techniques for creatively integrating language and interface constructs within programmable applications and demonstrates how an interface and language can combine symbolically and thereby provide powerful modes of expression within applications.
Abstract: Programmable applications are software systems that seek to combine the learnability and accessibility of direct manipulation interfaces with the expressive power and range of programming languages. In this paper we explore techniques for creatively integrating language and interface constructs within programmable applications. Using SchemePaint—a programmable graphics application—as a source of examples, we demonstrate how an interface and language can combine symbolically and thereby provide powerful modes of expression within applications.
TL;DR: This paper provides a survey of recent advances in the field of “grammatical inference” with a particular emphasis on the results concerning the learnability of target classes represented by deterministic finite automata, context-free grammars, hidden Markov models, stochastic context- free grammARS, simple recurrent neural networks, and casebased representations.
Abstract: In this paper, we provide a survey of recent advances in the field “grammatical inference” with a particular emphasis on the results concerning the learnability of target classes represented by deterministic finite automata, context-free grammars, hidden Markov models, stochastic context-free grammars, simple recurrent neural networks, and casebased representations.
TL;DR: The case in which the circuit expressions are of low (time-bounded) Kolmogorov complexity is studied, showing that these are polynomial-time learnable from membership queries in the presence of an NP oracle.
Abstract: Circuit expressions were introduced to provide a natural link between Computational Learning and certain aspects of Structural Complexity. Upper and lower bounds on the learnability of circuit expressions are known. We study here the case in which the circuit expressions are of low (time-bounded) Kolmogorov complexity. We show that these are polynomial-time learnable from membership queries in the presence of an NP oracle. We also exactly characterize the sets that have such circuit expressions, and precisely identify the subclass whose circuit expressions can be learned from membership queries alone. The extension of the results to various Kolmogorov complexity bounds is discussed.
TL;DR: It is shown that allowing anomalies does not increase the learning power as long as inference from positive and negative data is considered, and every learnable indexed family L may be even inferred with respect to the hypothesis space L itself.
Abstract: The present paper deals with the learnability of indexed families of uniformly recursive languages by single inductive inference machines (abbr. IIM) and teams of IIMs from positive and both positive and negative data. We study the learning power of single IIMs in dependence on the hypothesis space and the number of allowed anomalies the synthesized language may have. Our results are fourfold. First, we show that allowing anomalies does not increase the learning power as long as inference from positive and negative data is considered. Second, we establish an in nite hierarchy in the number of allowed anomalies for learning from positive data. Third, we prove that every learnable indexed family L may be even inferred with respect to the hypothesis space L itself. Fourth, we characterize learning with anomalies from positive data. Finally, we investigate the error correcting power of team learners, and relate the inference capabilities of teams in dependence on their size to one another. Again, an in nite hierarchy is established and the learnability is characterized in terms of recursively generable families of nite and non-empty sets.
TL;DR: It is proved that a relaxation of the relevant (dual) monotonic requirement may result in an arbitrarily large speed-up, however, whether or not such a speed- up may be achieved crucially depends on the set of allowed hypothesis spaces as well as of the (duals) monotonicity demands involved.
Abstract: The present paper deals with the learnability of indexed families G of uniformly recursive languages from positive data. We consider the influ ence of three monotonicity demands and their dual counterparts to the efficiency of the learning process. The efficiency of learning is measured in dependence on the number of mind changes a learning algorithm is allowed to perform. The three notions of (dual) monotonicity reflect dif ferent formalizations of the requirement that the learner has to produce better and better (specializations) generalizations when fed more and more data on the target concept. We distinguish between exact learnability (G has to be inferred with respect to G), class preserving learning (G has to be inferred with respect to some suitably chosen enumeration of all the languages from G), and class comprising inference (G has to be learned with respect to some suitably chosen enumeration of uniformly recursive languages containing at least all the languages from G). In particular, we prove that a relaxation of the relevant (dual) monotonic ity requirement may result in an arbitrarily large speed-up. However, whether or not such a speed-up may be achieved crucially depends on the set of allowed hypothesis spaces as well as of the (dual) monotonicity demands involved.
TL;DR: The influence of three monotonicity demands to the efficiency of the learning process of uniformly recursive languages from positive data is considered.
Abstract: The present paper deals with with the learnability of indexed families \({\cal L}\) of uniformly recursive languages from positive data. We consider the influence of three monotonicity demands to the efficiency of the learning process. The efficiency of learning is measured in dependence on the number of mind changes a learning algorithm is allowed to perform. The three notions of monotonicity reflect different formalizations of the requirement that the learner has to produce better and better generalizations when fed more and more data on the target concept.
TL;DR: Although CORECLASSIC is not pat-learnable under the usual complexity assumptions (as correctly shown in Theorem 4 of [1] and expanded upon further by Frazier and Pitt [2]), the question of the representation-independent learnability y of CORE CLASSIC remains an open problem.
Abstract: The proof of the representation-independent hardness of learning CORECLASSIC in Theorem 3 of [1] is in error. In particular, although the constructed functions ~ and g satisfy the " if " direction of w E L(A) iff g(A) ~ f(w) they do not satisfy the " only if " direction. Lenny Pitt for pointing out the mistake. We thank Although CORECLASSIC is not pat-learnable under the usual complexity assumptions (as correctly shown in Theorem 4 of [1] and expanded upon further by Frazier and Pitt [2]), the question of the representation-independent learnability y of CORE CLASSIC remains an open problem.