Machine translation has wide applications in daily life. In mission-critical applications such as translating official documents, incorrect translation can have unpleasant or sometimes catastrophic consequences. This motivates recent research on testing methodologies for machine translation systems. Existing methodologies mostly rely on metamorphic relations designed at the textual level (e.g., Levenshtein distance) or syntactic level (e.g., the distance between grammar structures) to determine the correctness of translation results. However, these metamorphic relations do not consider whether the original and translated sentences have the same meaning (i.e., Semantic similarity). Therefore, in this paper, we propose SemMT, an automatic testing approach for machine translation systems based on semantic similarity checking. SemMT applies round-trip translation and measures the semantic similarity between the original and translated sentences. Our insight is that the semantics expressed by the logic and numeric constraint in sentences can be captured using regular expressions (or deterministic finite automata) where efficient equivalence/similarity checking algorithms are available. Leveraging the insight, we propose three semantic similarity metrics and implement them in SemMT. The experiment result reveals SemMT can achieve higher effectiveness compared with state-of-the-art works, achieving an increase of 21% and 23% on accuracy and F-Score, respectively. We also explore potential improvements that can be achieved when proper combinations of metrics are adopted. Finally, we discuss a solution to locate the suspicious trip in round-trip translation, which may shed lights on further exploration.

SemMT: A Semantic-based Testing Approach for Machine Translation Systems.

Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple ...

/pdf/sketch-driven-regular-expression-generation-from-natural-40rsgohxct.pdf

Sketch-Driven Regular Expression Generation from Natural Language and Examples

We study the generalization abilities of language models when translating natural language into formal specifications with complex semantics. In particular, we fine-tune language models on three datasets consisting of English sentences and their corresponding formal representation: 1) regular expressions (regex), frequently used in programming and search; 2) First-order logic (FOL), commonly used in software verification and theorem proving; and 3) linear-time temporal logic (LTL), which forms the basis for industrial hardware specification languages. Our experiments show that, in these diverse domains, the language models maintain their generalization capabilities from pre-trained knowledge of natural language to generalize, e.g., to new variable names or operator descriptions. Additionally, they achieve competitive performance, and even outperform the state-of-the-art for translating into regular expressions, with the benefits of being easy to access, efficient to fine-tune, and without a particular need for domain-specific reasoning.

Formal Specifications from Natural Language

Recent systems for converting natural language descriptions into regular expressions (regexes) have achieved some success, but typically deal with short, formulaic text and can only produce simple regexes. Realworld regexes are complex, hard to describe with brief sentences, and sometimes require examples to fully convey the user's intent. We present a framework for regex synthesis in this setting where both natural language (NL) and examples are available. First, a semantic parser (either grammar-based or neural) maps the natural language description into an intermediate sketch, which is an incomplete regex containing holes to denote missing components. Then a program synthesizer searches over the regex space defined by the sketch and finds a regex that is consistent with the given string examples. Our semantic parser can be trained purely from weak supervision based on correctness of the synthesized regex, or it can leverage heuristically-derived sketches. We evaluate on two prior datasets (Kushman and Barzilay, 2013; Locascio et al., 2016) and a real-world dataset from Stack Overflow. Our system achieves state-of-the-art performance on the prior datasets and solves 57% of the real-world dataset, which existing neural systems completely fail on.

Graph Neural Networks (GNN) are a promising technique for bridging differential programming and combinatorial domains. GNNs employ trainable modules which can be assembled in different configurations that reflect the relational structure of each problem instance. In this thesis, we propose a new formulation for GNNs, which employs the concept of “types” to partition the objects in the problem domain into many distinct classes, yielding the Typed Graph Networks (TGN) model and a Python / Tensorflow library for prototyping TGNs. This thesis is also concerned with the application of GNNs to the Traveling Salesperson Problem (TSP). We show that GNNs can learn to solve, with very little supervision, the decision variant of the TSP, a highly relevant NP-Complete problem. Our model is trained to function as an effective message-passing algorithm in graph in which edges from the input graph communicate with vertices from the input graph for a number of iterations after which the model is asked to decide whether a route with cost < C ∈ R0 exists. We show that such a network can be trained with sets of dual examples: given the optimal tour cost C∗, we produce one decision instance with target cost (C) x% smaller and one with target cost x% larger than C∗. We were able to obtain 80% accuracy training with −2%,+2% deviations, and the same trained model can generalize for more relaxed deviations with increasing performance. We also show that the model is capable of generalizing for larger problem sizes. Finally, we provide a method for predicting the optimal route cost within 1.5% relative deviation from the ground truth. In summary, our work shows that Graph Neural Networks are powerful enough to solve NP-Complete problems which combine symbolic and numeric data, in addition to proposing a modern reformulation of the meta-model.

Learning to solve NP-complete problems

We continue the study of generating se-mantically correct regular expressions from natural language descriptions (NL). The current state-of-the-art model SemRegex produces regular expressions from NLs by rewarding the reinforced learning based on the semantic (rather than syntactic) equivalence between two regular expressions. Since the regular expression equivalence problem is PSPACE-complete, we introduce the EQ_Reg model for computing the simi-larity of two regular expressions using deep neural networks. Our EQ_Reg mod-el essentially softens the equivalence of two regular expressions when used as a reward function. We then propose a new regex generation model, SoftRegex, us-ing the EQ_Reg model, and empirically demonstrate that SoftRegex substantially reduces the training time (by a factor of at least 3.6) and produces state-of-the-art results on three benchmark datasets.

Softregex: Generating regex from natural language descriptions using softened regex equivalence

In natural language processing, a common task is to compute the probability of a phrase appearing in a document or to calculate the probability of all phrases matching a given pattern For instance, one computes affix (prefix, suffix, infix, etc) probabilities of a string or a set of strings with respect to a probability distribution of patterns The problem of computing infix probabilities of strings when the pattern distribution is given by a probabilistic context-free grammar or by a probabilistic finite automaton is already solved, yet it was open to compute the infix probabilities in an incremental manner The incremental computation is crucial when a new query is built from a previous query We tackle this problem and suggest a method that computes infix probabilities incrementally for probabilistic finite automata by representing all the probabilities of matching strings as a series of transition matrix calculations We show that the proposed approach is theoretically faster than the previous method and, using real world data, demonstrate that our approach has vastly better performance in practice

/pdf/incremental-computation-of-infix-probabilities-for-i8fsakmtsr.pdf

Incremental computation of infix probabilities for probabilistic finite automata

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not been subjected to rigorous scientific scrutiny. While the removal of rare subwords is suggested as best practice in machine translation implementations, both as a means to reduce model size and for improving model performance through robustness, our experiments indicate that, across a large space of hyperparameter settings, vocabulary trimming fails to improve performance, and is even prone to incurring heavy degradation.

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

The present invention provides an automata-based incremental median probability calculation device and a method thereof. The automata-based incremental median probability calculation device obtains a character string as a regular expression using a DFA model obtained based on the deterministic finite automata, converts the same into a clear regular expression, uses the probabilistic finite automata so as to accurately calculate the probability of occurrence of a character string expressed by a clear regular expression, and obtains the probability of occurrence in accordance with the increment of the string represented by the regular expression by the incremental method of the regular expression, so as to easily obtain the probability of occurrence.

Apparatus and method for computing incrementally infix probabilities based on automata

Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a “three-hot” scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.

pdf/parameter-efficient-korean-character-level-language-modeling-n0tyd7bo.pdf

Parameter-Efficient Korean Character-Level Language Modeling

Handle everyday research tasks with reliable, citation-backed results

Your personal Research Agent to handle research tasks with citation-backed results

Popular Tasks used by Researchers

How can I help with your research?

Meet SciSpace

Get more enhanced response by uploading the PDFs you want me to reference.

No relevant PDFs in your library

SciSpace is the AI research assistant for academics. Run systematic literature reviews on 280M+ papers, and write papers with cited sources. Trusted by 1M+ students, PhDs & researchers.

SciSpace | AI for Research

Analyze PDFs

Code & Manuscripts

Funding & Grants

Literature & Patents

Medical & Clinical Data

Systematic Review

Visualize & Present

Web & Data

Build a Google Scholar-like website for your research.

Build a website

Create charts and images for your research

Create a Chart

Write a paper for submission to a journal

Draft a manuscript

Patent Search

Design eye-catching scientific posters in minutes.

Scientific Poster Generation

Systematic Literature Review

One task is running at the moment. Your messages will be shown right after.

Drag and drop or click here to browse

Loved by <highlight>1 million+</highlight> researchers

Extract a list of specific topics and their sources from unstructured text

Topics

Compare and analyze relevant papers that matches with your search

Papers

Get insights from PDFs and bookmarked papers from your library

My library

Recent searches

Try searching for:

Catch AI-generated content in scholarly and non-scholarly content

{ai} Detector

Ai Writer

Get PDF Summaries, highlighted text explanations 

Chat with PDF

Effortlessly create in-text citations and bibliographies in APA and 2,500 other formats

Citation generator

Get explanations, summaries, and answers on academic papers

Ease up your research workflow with {scispace}'s cohort of exciting AI tools

Elevate your academic writing skills and convey your ideas the way you want

Paraphraser

Explore our range of reading and writing tools

Your file is being prepared and should be ready in a few minutes. If it's a large file, it might take a bit longer. You can close this window, and we'll email you the file when it's done.

You have reached a maximum limit of <strong>{limit}</strong> columns in the table. Remove at least <strong>1</strong> column to add or create another one.

Marco Cognetta

Author Tools

Chat about Author