About: Legacy code is a research topic. Over the lifetime, 810 publications have been published within this topic receiving 12671 citations. The topic is also known as: bad code.
TL;DR: Analysis reveals that PLBART learns program syntax, style, logical flow, and style that are crucial to program semantics and thus excels even with limited annotations, and outperforms or rivals state-of-the-art models.
Abstract: Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., “if“ block inside an “else“ block is equivalent to “else if“ block) that are crucial to program semantics and thus excels even with limited annotations.
TL;DR: In this paper, the authors show that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact excel at nothing, and they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines.
Abstract: In previous papers [SC05, SBC+07], some of us predicted the end of "one size fits all" as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1--2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.
Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T. to a popular RDBMS on the standard transactional benchmark, TPC-C.
We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow's requirements, not continue to push code lines and architectures designed for yesterday's needs.
TL;DR: This contribution proposes a set of criteria that distinguish a grand challenge in science or engineering from the many other kinds of short-term or long-term research problems that engage the interest of scientists and engineers.
Abstract: This contribution proposes a set of criteria that distinguish a grand challenge in science or engineering from the many other kinds of short-term or long-term research problems that engage the interest of scientists and engineers. As an example drawn from Computer Science, it revives an old challenge: the construction and application of a verifying compiler that guarantees correctness of a program before running it.
TL;DR: An algorithmic framework to construct a lattice of concepts from a program, where each concept represents a potential module in legacy code is presented.
Abstract: Describes a general technique for identifying modules in legacy code. The method is based on concept analysis-a branch of lattice theory that can be used to identify similarities among a set of objects based on their attributes. We discuss how concept analysis can identify potential modules using both lpositiver and lnegativer information. We present an algorithmic framework to construct a lattice of concepts from a program, where each concept represents a potential module
TL;DR: The first practical clone detection algorithm for binary executables is described, which extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information.
Abstract: Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual property. Because source code is often unavailable, especially for third-party software, finding duplicated code in binaries becomes particularly important. However, existing techniques operate primarily on source code, and no effective tool exists for binaries.In this paper, we describe the first practical clone detection algorithm for binary executables. Our algorithm extends an existing tree similarity framework based on clustering of characteristic vectors of labeled trees with novel techniques to normalize assembly instructions and to accurately and compactly model their structural information. We have implemented our technique and evaluated it on Windows XP system binaries totaling over 50 million assembly instructions. Results show that it is both scalable and precise: it analyzed Windows XP system binaries in a few hours and produced few false positives. We believe our technique is a practical, enabling technology for many applications dealing with binary code.