TL;DR: The authors propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback, where agents verbally reflect on task feedback signals and maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials.
Abstract: Large language models (LLMs) have been increasingly used to interact with external environments (e.g., games, compilers, APIs) as goal-driven agents. However, it remains challenging for these language agents to quickly and efficiently learn from trial-and-error as traditional reinforcement learning methods require extensive training samples and expensive model fine-tuning. We propose Reflexion, a novel framework to reinforce language agents not by updating weights, but instead through linguistic feedback. Concretely, Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials. Reflexion is flexible enough to incorporate various types (scalar values or free-form language) and sources (external or internally simulated) of feedback signals, and obtains significant improvements over a baseline agent across diverse tasks (sequential decision-making, coding, language reasoning). For example, Reflexion achieves a 91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%. We also conduct ablation and analysis studies using different feedback signals, feedback incorporation methods, and agent types, and provide insights into how they affect performance.
TL;DR: QASMBench as mentioned in this paper is a low-level, easy-to-use benchmark suite based on the OpenQASM assembly representation, which consolidates commonly used quantum routines and kernels from a variety of domains including chemistry, simulation, linear algebra, searching, optimization, arithmetic, machine learning, fault tolerance, cryptography, and so on.
Abstract: The rapid development of quantum computing (QC) in the NISQ era urgently demands a low-level benchmark suite and insightful evaluation metrics for characterizing the properties of prototype NISQ devices, the efficiency of QC programming compilers, schedulers and assemblers, and the capability of quantum system simulators in a classical computer. In this work, we fill this gap by proposing a low-level, easy-to-use benchmark suite called QASMBench based on the OpenQASM assembly representation. It consolidates commonly used quantum routines and kernels from a variety of domains including chemistry, simulation, linear algebra, searching, optimization, arithmetic, machine learning, fault tolerance, cryptography, and so on, trading-off between generality and usability. To analyze these kernels in terms of NISQ device execution, in addition to circuit width and depth, we propose four circuit metrics including gate density, retention lifespan, measurement density, and entanglement variance, to extract more insights about the execution efficiency, the susceptibility to NISQ error, and the potential gain from machine-specific optimizations. Applications in QASMBench can be launched and verified on several NISQ platforms, including IBM-Q, Rigetti, IonQ and Quantinuum. For evaluation, we measure the execution fidelity of a subset of QASMBench applications on 12 IBM-Q machines through density matrix state tomography, comprising 25K circuit evaluations. We also compare the fidelity of executions among the IBM-Q machines, the IonQ QPU and the Rigetti Aspen M-1 system. QASMBench is released at: http://github.com/pnnl/QASMBench .
TL;DR: In this paper , the authors fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and VL textbooks to generate syntactically correct code in response to problems of varying difficulty.
Abstract: Automating hardware design could obviate a signif-icant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). We release our training/evaluation scripts and LLM checkpoints as open source contributions.
Or Honovich, Uri Shaham, Samuel R. Bowman, Omer Levy
1 Jan 2023
TL;DR: InstructGPT can generate instructions that achieve high human performance on a wide range of natural language tasks.
Abstract: Large language models are able to perform a task by conditioning on a few input-output demonstrations -a paradigm known as incontext learning.We show that language models can explicitly infer an underlying task from a few demonstrations by prompting them to generate a natural language instruction that fits the examples.To explore this ability, we introduce the instruction induction challenge, compile a dataset consisting of 24 tasks, and define a novel evaluation metric based on executing the generated instruction.We discover that, to a large extent, the ability to generate instructions does indeed emerge when using a model that is both large enough and aligned to follow instructions; InstructGPT achieves 65.7% of human performance in our execution-based metric, while the original GPT-3 model reaches only 9.8% of human performance.This surprising result suggests that instruction induction might be a viable learning paradigm in and of itself, where instead of fitting a set of latent continuous parameters to the data, one searches for the best description in the natural language hypothesis space.
TL;DR: HipSYCL as discussed by the authors is a single-source, single compiler pass (SSCP) design and a unified code representation across backends, which allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries.
Abstract: Current SYCL implementations rely on multiple compiler invocations to generate code for host and device, and typically even employ one compiler invocation per required backend code format such as SPIR-V, PTX or amdgcn. This makes generating “universal” binaries that can run on all devices supported by a SYCL implementation very time-consuming, or outright impractical. The ability to generate such universal binaries is however important e.g. when a software vendor wishes to distribute binaries to users that rely on unknown hardware configurations. To address this issue, we present the very first SYCL implementation with a single-source, single compiler pass (SSCP) design and a unified code representation across backends. This allows a single compiler invocation to generate a binary that can execute kernels on all supported devices, dramatically reducing both compile times as well as the user effort required to generate such universal binaries. Our work is publicly available as part of the hipSYCL implementation of SYCL, and supports Intel GPUs through SPIR-V, NVIDIA GPUs through CUDA PTX and AMD GPUs through ROCm amdgcn code. Our new compiler operates in two phases: At compile time, during the regular host compilation pass, it extracts the LLVM IR of kernels. This IR is then stored in a backend-independent fashion in the host binary. At runtime, the embedded LLVM IR is then lowered to the format required by backend drivers (e.g. PTX, SPIR-V, amdgcn). This approach enables portability of a single code representation even if backends do not support a common code format, while still allowing interoperability with vendor-specific optimized libraries. We find that our new compiler can generate highly portable binaries that run on any NVIDIA, Intel or AMD ROCm GPU with only 20% additional compilation time compared to a regular clang host compilation. On our test system, this is roughly 2.2 × faster than compiling with the existing hipSYCL compiler for just three AMD GPUs. We also show that the cost of the additional runtime compilation steps can be expected to be approximately comparable to the cost of runtime compilation that backend drivers already perform today, e.g. to lower SPIR-V to machine code. Lastly, we present early performance results on four different GPUs from three vendors. We find that performance is usually within 10% of current multipass SYCL compiler techniques, with the maximum deviations ranging from a performance regression of 13% to a speedup of 27%. This implies that compared to current SYCL compilation techniques, our new compiler achieves similar performance while substantially decreasing compile times, and increasing the portability of generated binaries.
TL;DR: Predicting good quantum circuit compilation options using supervised machine learning techniques to automate the decision-making process for end-users.
Abstract: Any potential application of quantum computing, once encoded as a quantum circuit, needs to be compiled in order to be executed on a quantum computer. Deciding which qubit technology, which device, which compiler, and which corresponding settings are best for the considered problem—according to a measure of goodness—requires expert knowledge and is overwhelming for end-users from different domains trying to use quantum computing to their advantage. In this work, we treat the problem as a statistical classification task and explore the utilization of supervised machine learning techniques to optimize the compilation of quantum circuits. Based on that, we propose a framework that, given a quantum circuit, predicts the best combination of these options and, therefore, automatically makes these decisions for end-users. Experimental evaluations show that, considering a prototypical setting with 3000 quantum circuits, the proposed framework yields promising results: for more than three quarters of all unseen test circuits, the best combination of compilation options is determined. Moreover, for more than 95% of the circuits, a combination of compilation options within the top-three is determined—while the median compilation time is reduced by more than one order of magnitude. Furthermore, the resulting methodology not only provides end-users with a prediction of the best compilation options, but also provides means to extract explicit knowledge from the machine learning technique. This knowledge helps in two ways: it lays the foundation for further applications of machine learning in this domain and, also, allows one to quickly verify whether a machine learning algorithm is reasonably trained. The corresponding framework and the pre-trained classifier are publicly available on GitHub (https://github.com/cda-tum/MQTPredictor) as part of the Munich Quantum Toolkit (MQT).
TL;DR: VerilogEval benchmarks LLM performance in Verilog code generation and provides a dataset for evaluation.
Abstract: The increasing popularity of large language models (LLMs) has paved the way for their application in diverse domains. This paper proposes a benchmarking framework tailored specifically for evaluating LLM performance in the context of Verilog code generation for hardware design and verification. We present a comprehensive evaluation dataset consisting of 156 problems from the Verilog instructional website HDLBits. The evaluation set consists of a diverse set of Verilog code generation tasks, ranging from simple combinational circuits to complex finite state machines. The Verilog code completions can be automatically tested for functional correctness by comparing the transient simulation outputs of the generated design with a golden solution. We also demonstrate that the Verilog code generation capability of pretrained language models could be improved with supervised fine-tuning by bootstrapping with LLM generated synthetic problem-code pairs.
TL;DR: In this article , a parametric ILP formulation is proposed to minimize the time complexity of entanglement generation and distribution in distributed quantum computing systems, where the objective function is to count the number of used resources.
Abstract: Practical distributed quantum computing requires the development of efficient compilers, able to make quantum circuits compatible with some given hardware constraints. This problem is known to be tough, even for local computing. Here, we address it on distributed architectures. As generally assumed in this scenario, telegates represent the fundamental remote (inter-processor) operations. Each telegate consists of several tasks: i) entanglement generation and distribution, ii) local operations, and iii) classical communications. Entanglement generations and distribution is an expensive resource, as it is time-consuming and fault-prone. To mitigate its impact, we model an optimization problem that combines running-time minimization with the usage of that resource. Specifically, we provide a parametric ILP formulation, where the parameter denotes a time horizon (or time availability); the objective function count the number of used resources. To minimize the time, a binary search solves the subject ILP by iterating over the parameter. Ultimately, to enhance the solution space, we extend the formulation, by introducing a predicate that manipulates the circuit given in input and parallelizes telegates' tasks.
TL;DR: YARPGen as mentioned in this paper is an open-source generative compiler fuzzer for data-parallel languages, such as the Intel Implicit SPMD Program Compiler and the Intel oneAPI DPC++ compiler.
Abstract: Compilers are part of the foundation upon which software systems are built; they need to be as correct as possible. This paper is about stress-testing loop optimizers; it presents a major reimplementation of Yet Another Random Program Generator (YARPGen), an open-source generative compiler fuzzer. This new version has found 122 bugs, both in compilers for data-parallel languages, such as the Intel® Implicit SPMD Program Compiler and the Intel® oneAPI DPC++ compiler, and in C++ compilers such as GCC and Clang/LLVM. The first main contribution of our work is a novel method for statically avoiding undefined behavior when generating loops; the resulting programs conform to the relevant language standard, enabling automated testing. The second main contribution is a collection of mechanisms for increasing the diversity of generated loop code; in our evaluation, we demonstrate that these make it possible to trigger loop optimizations significantly more often, providing opportunities to discover bugs in the optimizers.
TL;DR: In this article , the authors present an intermediate representation (IR) that focuses on discovering JIT compiler vulnerabilities, and they implemented a complete prototype of the proposed approach and evaluated their fuzzer over a period of six months.
Abstract: —JavaScript has become an essential part of the Internet infrastructure, and today’s interactive web applications would be inconceivable without this programming language. On the downside, this interactivity implies that web applications rely on an ever-increasing amount of computationally intensive JavaScript code, which burdens the JavaScript engine responsible for efficiently executing the code. To meet these rising performance demands, modern JavaScript engines ship with sophisticated just-in-time (JIT) compilers. However, JIT compilers are a complex technology and, consequently, provide a broad attack surface for potential faults that might even be security-critical. Previous work on discovering software faults in JavaScript engines found many vulnerabilities, often using fuzz testing. Unfortunately, these fuzzing approaches are not designed to generate source code that actually triggers JIT semantics. Consequently, JIT vulnerabilities are unlikely to be discovered by existing methods. In this paper, we close this gap and present the first fuzzer that focuses on JIT vulnerabilities. More specifically, we present the design and implementation of an intermediate representation (IR) that focuses on discovering JIT compiler vulnerabilities. We implemented a complete prototype of the proposed approach and evaluated our fuzzer over a period of six months. In total, we discovered 17 confirmed security vulnerabilities. Our results show that targeted JIT fuzzing is possible and a dangerously neglected gap in fuzzing coverage for JavaScript engines.
TL;DR: In this paper , the authors propose a low-level neural machine code for neural networks to solve complex equations and store chaotic dynamical systems as random-access memory, and demonstrate a fully distributed neural implementation of software virtualization and logical circuits.
Abstract: Abstract From logical reasoning to mental simulation, biological and artificial neural systems possess an incredible capacity for computation. Such neural computers offer a fundamentally novel computing paradigm by representing data continuously and processing information in a natively parallel and distributed manner. To harness this computation, prior work has developed extensive training techniques to understand existing neural networks. However, the lack of a concrete and low-level machine code for neural networks precludes us from taking full advantage of a neural computing framework. Here we provide such a machine code along with a programming framework by using a recurrent neural network—a reservoir computer—to decompile, code and compile analogue computations. By decompiling the reservoir’s internal representation and dynamics into an analytic basis of its inputs, we define a low-level neural machine code that we use to program the reservoir to solve complex equations and store chaotic dynamical systems as random-access memory. We further provide a fully distributed neural implementation of software virtualization and logical circuits, and even program a playable game of pong inside of a reservoir computer. Importantly, all of these functions are programmed without requiring any example data or sampling of state space. Finally, we demonstrate that we can accurately decompile the analytic, internal representations of a full-rank reservoir computer that has been conventionally trained using data. Taken together, we define an implementation of neural computation that can both decompile computations from existing neural connectivity and compile distributed programs as new connections.
TL;DR: Codon as mentioned in this paper is a domain-extensible compiler and DSL framework for high-performance DSLs with Python's syntax and semantics, which leverages a novel intermediate representation to easily incorporate domain-specific optimizations and analyses.
Abstract: Domain-specific languages (DSLs) are able to provide intuitive high-level abstractions that are easy to work with while attaining better performance than general-purpose languages. Yet, implementing new DSLs is a burdensome task. As a result, new DSLs are usually embedded in general-purpose languages. While low-level languages like C or C++ often provide better performance as a host than high-level languages like Python, high-level languages are becoming more prevalent in many domains due to their ease and flexibility. Here, we present Codon, a domain-extensible compiler and DSL framework for high-performance DSLs with Python's syntax and semantics. Codon builds on previous work on ahead-of-time type checking and compilation of Python programs and leverages a novel intermediate representation to easily incorporate domain-specific optimizations and analyses. We showcase and evaluate several compiler extensions and DSLs for Codon targeting various domains, including bioinformatics, secure multi-party computation, block-based data compression and parallel programming, showing that Codon DSLs can provide benefits of familiar high-level languages and achieve performance typically only seen with low-level languages, thus bridging the gap between performance and usability.
TL;DR: In this article , the authors study the effectiveness and limitations of existing techniques for automatically translating unsafe raw pointers (in Rust programs translated from C) into safe Rust references via ownership and lifetime inference.
Abstract: The Rust language was created to provide safe low-level systems programming. There is both industrial and academic interest in the problem of (semi-)automatically translating C code to Rust in order to exploit Rust's safety guarantees. We study the effectiveness and limitations of existing techniques for automatically translating unsafe raw pointers (in Rust programs translated from C) into safe Rust references via ownership and lifetime inference. Our novel evaluation methodology enables our study to extend beyond prior studies, and to discover new information contradicting the conclusions of prior studies. We find that existing translation methods are severely limited by a lack of precision in the Rust compiler's safety checker, causing many safe pointer manipulations to be labeled as potentially unsafe. Leveraging this information, we propose methods for improving translation, based on encoding the results of a more precise analysis in a manner that is understandable to an unmodified Rust compiler. We implement one of our proposed methods, increasing the number of pointers that can be translated to safe Rust references by 75% over the baseline (from 12% to 21% of all pointers).
TL;DR: A reinforcement learning framework for optimizing quantum circuit compilation flows significantly outperforms individual compilers in terms of expected fidelity.
Abstract: Any quantum computing application, once encoded as a quantum circuit, must be compiled before being executable on a quantum computer. Similar to classical compilation, quantum compilation is a sequential process with many compilation steps and numerous possible optimization passes. Despite the similarities, the development of compilers for quantum computing is still in its infancy—lacking mutual consolidation on the best sequence of passes, compatibility, adaptability, and flexibility. In this work, we take advantage of decades of classical compiler optimization and propose a reinforcement learning framework for developing optimized quantum circuit compilation flows. Through distinct constraints and a unifying interface, the framework supports the combination of techniques from different compilers and optimization tools in a single compilation flow. Experimental evaluations show that the proposed framework—set up with a selection of compilation passes from IBM’s Qiskit and Quantinuum’s TKET—significantly outperforms both individual compilers in 73% of cases regarding the expected fidelity. The framework is available on GitHub (https://github.com/cda-tum/MQTPredictor) as part of the Munich Quantum Toolkit (MQT).
TL;DR: In this paper , a generalized parity quantum optimization (GQO) is proposed to solve optimization problems consisting of arbitrary $k$-body interactions and side conditions using planar quantum chip architectures.
Abstract: We introduce parity quantum optimization with the aim of solving optimization problems consisting of arbitrary $k$-body interactions and side conditions using planar quantum chip architectures. The method introduces a decomposition of the problem graph with arbitrary $k$-body terms using generalized closed cycles of a hypergraph. Side conditions of the optimization problem in form of hard constraints can be included as open cycles containing the terms involved in the side conditions. The generalized parity mapping thus circumvents the need to translate optimization problems to a quadratic unconstrained binary optimization problem (QUBO) and allows for the direct encoding of higher-order constrained binary optimization problems (HCBO) on a square lattice and full parallelizability of gates.
TL;DR: A modular quantum compilation framework for distributed quantum computing that considers network and device constraints and characteristics. It optimizes EPR pair consumption and local transformation for distributed quantum algorithms.
Abstract: For most practical applications, quantum algorithms require large resources in terms of qubit number, much larger than those available with current NISQ processors.With the network and communication functionalities provided by the Quantum Internet, Distributed Quantum Computing (DQC) is considered as a scalable approach for increasing the number of available qubits for computational tasks.For DQC to be effective and efficient, a quantum compiler must find the best partitioning for the quantum algorithm and then perform smart remote operation scheduling to optimize EPR pair consumption.At the same time, the quantum compiler should also find the best local transformation for each partition.In this paper we present a modular quantum compilation framework for DQC that takes into account both network and device constraints and characteristics.We implemented and tested a quantum compiler based on the proposed framework with some circuits of interest, such as the VQE and QFT ones, considering different network topologies, with quantum processors characterized by heavy-hexagon coupling maps.We also devised a strategy for remote scheduling that can exploit both TeleGate and TeleData operations and tested the impact of using either only TeleGates or both.The evaluation results show that TeleData operations can have a positive impact on the number of consumed EPR pairs, depending on the characteristic of compiled circuit.Meanwhile, choosing a more connected network topology helps reduce the number of layers dedicated to remote operations.
TL;DR: In this paper , a large-scale dataset of Performance-Improving Edits, PIE, is used to evaluate and improve the capacity of large language models (LLMs) to suggest functionally correct, performance improving code edits.
Abstract: The waning of Moore's Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program's performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI's CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5x for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10x smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.
TL;DR: Trex as discussed by the authors is a transfer-learning-based framework to detect semantically similar binary functions by learning approximate execution semantics explicitly from functions' traces collected via forced-execution (i.e., by violating the control flow semantics).
Abstract: Detecting semantically similar binary functions – a crucial capability with broad security usages including vulnerability detection, malware analysis, and forensics – requires understanding function behaviors and intentions. This task is challenging as semantically similar functions can be compiled to run on different architectures and with diverse compiler optimizations or obfuscations. Most existing approaches match functions based on syntactic features without understanding the functions’ execution semantics. We present Trex , a transfer-learning-based framework, to automate learning approximate execution semantics explicitly from functions’ traces collected via forced-execution (i.e., by violating the control flow semantics) and transfer the learned knowledge to match semantically similar functions. While it is known that forced-execution traces are too imprecise to be directly used to detect semantic similarity, our key insight is that these traces can instead be used to teach an ML model approximate execution semantics of diverse instructions and their compositions. We thus design a pretraining task, which trains the model to learn approximate execution semantics from the two modalities (i.e., forced-executed code and traces) of the function. We then finetune the pretrained model to match semantically similar functions. We evaluate Trex on 1,472,066 functions from 13 popular software projects, compiled to run on 4 architectures (x86, x64, ARM, and MIPS), and with 4 optimizations ( O0 - O3 ) and 5 obfuscations. Trex outperforms the state-of-the-art solutions by 7.8%, 7.2%, and 14.3% in cross-architecture, optimization, and obfuscation function matching, respectively, while running 8× faster. Ablation studies suggest that the pretraining significantly boosts the function matching performance, underscoring the importance of learning execution semantics. Our case studies demonstrate the practical use-cases of Trex – on 180 real-world firmware images, Trex uncovers 14 vulnerabilities not disclosed by previous studies. We release the code and dataset of Trex at https://github.com/CUMLSec/trex .
TL;DR: SynShine as mentioned in this paper is a machine learning based tool that substantially improves on the state-of-the-art, by learning to use compiler diagnostics, employing a very large neural model that leverages unsupervised pre-training, and relying on multi-label classification rather than autoregressive synthesis to generate the (repaired) output.
Abstract: Novice programmers struggle with the complex syntax of modern programming languages like Java , and make lot of syntax errors. The diagnostic syntax error messages from compilers and IDEs are sometimes useful, but often the messages are cryptic and puzzling. Novices could be helped, and instructors’ time saved, by automated repair suggestions when dealing with syntax errors. Large samples of novice errors and fixes are now available, offering the possibility of data-driven machine-learning approaches to help novices fix syntax errors. Current machine-learning approaches do a reasonable job fixing syntax errors in shorter programs, but don't work as well even for moderately longer programs. We introduce SynShine , a machine-learning based tool that substantially improves on the state-of-the-art, by learning to use compiler diagnostics, employing a very large neural model that leverages unsupervised pre-training, and relying on multi-label classification rather than autoregressive synthesis to generate the (repaired) output. We describe SynShine 's architecture in detail, and provide a detailed evaluation. We have built SynShine into a free, open-source version of Visual Studio Code (VSCode); we make all our source code and models freely available.
TL;DR: Li et al. as mentioned in this paper conducted a large-scale empirical study of inline assembly on more than 7.6 million open-source Ethereum smart contracts from three aspects, namely, source code, bytecode, and transactions after designing new approaches to tackle several technical challenges.
Abstract: Being the most popular programming language for developing Ethereum smart contracts, Solidity allows using inline assembly to gain fine-grained control. Although many empirical studies on smart contracts have been conducted, to the best of our knowledge, none has examined inline assembly in smart contracts. To fill the gap, in this paper, we conduct the first large-scale empirical study of inline assembly on more than 7.6 million open-source Ethereum smart contracts from three aspects, namely, source code, bytecode, and transactions after designing new approaches to tackle several technical challenges. Through a thorough quantitative and qualitative analysis of the collected data, we obtain many new observations and insights. Moreover, by conducting a questionnaire survey on using inline assembly in smart contracts, we draw new insights from the valuable feedback. This work sheds light on the development of smart contracts as well as the evolution of Solidity and its compilers.
TL;DR: In this paper , the authors propose a new technique for finding zero-knowledge proofs (ZKPs) bugs caused by underconstrained polynomial equations over finite fields.
Abstract: As zero-knowledge proofs gain increasing adoption, the cryptography community has designed domain-specific languages (DSLs) that facilitate the construction of zero-knowledge proofs (ZKPs). Many of these DSLs, such as Circom, facilitate the construction of arithmetic circuits, which are essentially polynomial equations over a finite field. In particular, given a program in a zero-knowledge proof DSL, the compiler automatically produces the corresponding arithmetic circuit. However, a common and serious problem is that the generated circuit may be underconstrained, either due to a bug in the program or a bug in the compiler itself. Underconstrained circuits admit multiple witnesses for a given input, so a malicious party can generate bogus witnesses, thereby causing the verifier to accept a proof that it should not. Because of the increasing prevalence of such arithmetic circuits in blockchain applications, several million dollars worth of cryptocurrency have been stolen due to underconstrained arithmetic circuits. Motivated by this problem, we propose a new technique for finding ZKP bugs caused by underconstrained polynomial equations over finite fields. Our method performs semantic reasoning over the finite field equations generated by the compiler to prove whether or not each signal is uniquely determined by the input. Our proposed approach combines SMT solving with lightweight uniqueness inference to effectively reason about underconstrained circuits. We have implemented our proposed approach in a tool called QED2 and evaluate it on 163 Circom circuits. Our evaluation shows that QED2 can successfully solve 70% of these benchmarks, meaning that it either verifies the uniqueness of the output signals or finds a pair of witnesses that demonstrate non-uniqueness of the circuit. Furthermore, QED2 has found 8 previously unknown vulnerabilities in widely-used circuits.
TL;DR: In this paper , a FLOP-efficient Obara-Saika-based recursive evaluation scheme is proposed to improve the efficiency of Gaussian integral evaluation on modern accelerated architectures, leveraging register memory for reduced memory footprint and direct compile-time generation of optimized kernels.
Abstract: To improve the efficiency of Gaussian integral evaluation on modern accelerated architectures, FLOP-efficient Obara-Saika-based recursive evaluation schemes are optimized for the memory footprint. For the 3-center 2-particle integrals that are key for the evaluation of Coulomb and other 2-particle interactions in the density-fitting approximation, the use of multiquantal recurrences (in which multiple quanta are created or transferred at once) is shown to produce significant memory savings. Other innovations include leveraging register memory for reduced memory footprint and direct compile-time generation of optimized kernels (instead of custom code generation) with compile-time features of modern C++/CUDA. Performance of conventional and CUDA-based implementations of the proposed schemes is illustrated for both the individual batches of integrals involving up to Gaussians with low and high angular momenta (up to L = 6) and contraction degrees, as well as for the density-fitting-based evaluation of the Coulomb potential. The computer implementation is available in the open-source LibintX library.
TL;DR: Binary reverse engineering is used to understand and analyse programs for which the source code is unavailable as discussed by the authors , which is difficult and costly, involving considering effort in labelling code with helpful summaries.
Abstract: Binary reverse engineering is used to understand and analyse programs for which the source code is unavailable. Decompilers can help, transforming opaque binaries into a more readable source code-like representation. Still, reverse engineering is difficult and costly, involving considering effort in labelling code with helpful summaries. While the automated summarisation of decompiled code can help reverse engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise de-compiled binary functions. Further-more, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by removing identifiers, and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82 and, 44.21 for summarising source, decompiled, and obfuscated decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.
TL;DR: Ccoft as mentioned in this paper is a framework to detect bugs in C++ compiler front-ends by transforming C++ grammars into a flexible structured format and then employing an equal-chance selection (ECS) strategy to conduct structure-aware grammar mutation to generate diverse C++ programs.
Abstract: C++ is a widely used programming language and the C++ front-end is a critical part of a C++ compiler. Although many techniques have been proposed to test compilers, few studies are devoted to detecting bugs in C++ compiler. In this study, we take the first step to detect bugs in C++ compiler front-ends. To do so, two main challenges need to be addressed, namely, the acquisition of test programs that are more likely to trigger bugs in compiler front-ends and the bug identification from complicated compiler outputs. In this article, we propose a novel framework named Ccoft to detect bugs in C++ compiler front-ends. To address the first challenge, Ccoft implements a practical program generator. The generator first transforms C++ grammars into a flexible structured format and then utilizes an equal-chance selection (ECS) strategy to conduct structure-aware grammar mutation to generate diverse C++ programs. Next, Ccoft employs a set of differential testing strategies to identify various kinds of bugs in C++ compiler front-ends by comparing complex outputs emitted by C++ compilers, thus tackling the second challenge. Empirical evaluation results over two mainstream compilers (i.e., GCC and Clang) show that Ccoft greatly improves two state-of-the-art approaches (i.e., Dharma and Grammarinator) by 135% and 111% in terms of the numbers of detected bugs, respectively. By running Ccoft for three months, we have successfully reported 136 bugs for two C++ compilers, of which 78 (57 confirmed, assigned, or fixed) for GCC and 58 (10 confirmed or fixed) for Clang.
TL;DR: GraphAGILE as discussed by the authors is a domain-specific FPGA-based overlay accelerator for graph neural network (GNN) inference, which can execute various computation kernels of GNNs.
Abstract: This paper presents GraphAGILE, a domain-specific FPGA-based overlay accelerator for graph neural network (GNN) inference. GraphAGILE consists of (1) a novel unified architecture design with an instruction set , and (2) a compiler built upon the instruction set that can quickly generate optimized code. Due to the proposed instruction set architecture (ISA) and the compiler, GraphAGILE does not require any FPGA reconfiguration when performing inference on various GNN models and input graphs. For the architecture design, we propose a novel hardware module named Adaptive Computation Kernel (ACK), that can execute various computation kernels of GNNs, including general matrix multiplication (GEMM), sparse-dense matrix multiplication (SpDMM), and sampled dense-dense matrix multiplication (SDDMM). The compiler takes the specifications of a GNN model and the graph meta data (e.g., the number of vertices and edges) as input, and generates a sequence of instructions for inference execution. We develop the following compiler optimizations to reduce inference latency: (1) computation order optimization that automatically reorders the computation graph to reduce the total computation complexity, (2) layer fusion that merges adjacent layers to reduce data communication volume, (3) data partitioning with a partition-centric execution scheme that partitions the input graph to fit the available on-chip memory of FPGA, (4) kernel mapping that automatically selects execution mode for ACK, and performs task scheduling to overlap computation with data communication and achieves dynamic load balance. We implement GraphAGILE on a state-of-the-art FPGA platform, Xilinx Alveo U250. GraphAGILE can execute widely used GNN models, including GCN, GAT, GIN, GraphSAGE, SGC and other GNN models supported by GraphGym. Experimental results show that GraphAGILE achieves up to $47.1\times$ ( $3.9\times$ ) reduction in end-to-end latency, including the latency of compilation and hardware execution, compared with the state-of-the-art implementations on CPU (GPU), and achieves up to $2.9\times$ reduction in hardware execution latency compared with the state-of-the-art FPGA accelerators.
TL;DR: Zhang et al. as discussed by the authors proposed an architecture-aware compiler for coarse-grained reconfigurable architectures based on reinforcement learning and Monte-Carlo tree search (MCTS) for real-world problems.
Abstract: Coarse-grained reconfigurable architecture (CGRA) has become a promising candidate for data-intensive computing due to its flexibility and high energy efficiency. CGRA compilers map data flow graphs (DFGs) extracted from applications onto CGRAs, playing a fundamental role in fully exploiting hardware resources for acceleration. Yet the existing compilers are time-demanding and cannot guarantee optimal results due to the traversal search of enormous search spaces brought about by the spatio-temporal flexibility of CGRA structures and the complexity of DFGs. Inspired by the amazing progress in reinforcement learning (RL) and Monte-Carlo tree search (MCTS) for real-world problems, we consider constructing a compiler that can learn from past experiences and comprehensively understand the target DFG and CGRA. In this paper, we propose an architecture-aware compiler for CGRAs based on RL and MCTS, called MapZero - a framework to automatically extract the characteristics of DFG and CGRA hardware and map operations onto varied CGRA fabrics. We apply Graph Attention Network to generate an adaptive embedding for DFGs and also model the functionality and interconnection status of the CGRA, aiming at training an RL agent to perform placement and routing intelligently. Experimental results show that MapZero can generate superior-quality mappings and reduce compilation time hundreds of times compared to state-of-the-art methods. MapZero can find high-quality mappings very quickly when the feasible solution space is rather small and all other compilers fail. We also demonstrate the scalability and broad applicability of our framework.
TL;DR: GLSLsmith as discussed by the authors is a tool for program reconditioning that allows differential testing and test-case reduction to simplify bug-triggering programs, even when the programming language of interest features undefined behaviour (UB) and no tools exist to detect and avoid this UB.
Abstract: We introduce program reconditioning, a method for allowing program generation and differential testing to be used to find miscompilation bugs, and test-case reduction to be used to simplify bug-triggering programs, even when (a) the programming language of interest features undefined behaviour (UB) and (b) no tools exist to detect and avoid this UB. We present two program generation tools based on our reconditioning idea: GLSLsmith for the OpenGL Shading Language (GLSL), a widely-used language for graphics programming, and WGSLsmith for the WebGPU Shading Language (WGSL), a new language for web-based graphics rendering. GLSL features many UBs, but unlike for languages such as C and C++ no tools exist to detect them automatically. While the WGSL language specification features very limited UB, early WGSL implementations do exhibit UB, for reasons of initial implementation simplicity, making it challenging to test them to quickly detect and eliminate unrelated miscompilation bugs. Thanks to reconditioning, we show that GLSLsmith and WGSLsmith allow differential testing and test-case reduction to be applied to compilers for GLSL and WGSL for the first time, despite the unavailability of UB detection techniques for these languages. Through a large testing campaign, we have found 24 and 33 bugs in GLSL and WGSL compilers, respectively. We present experiments showing that when reconditioning is disabled, compiler testing leads to a high rate of test programs that appear to trigger miscompilation bugs, but actually just feature UB. We also present a novel approach to managing floating-point roundoff error using reconditioning, implemented for both GLSL and WGSL.
TL;DR: A quantum circuit compiler for a shuttling-based trapped-ion quantum computer that reduces gate counts by factors up to 5.1 compared to standard Pytket and up to 2.2 compared to standard Qiskit compilation.
Abstract: The increasing capabilities of quantum computing hardware and the challenge of realizing deep quantum circuits require fully automated and efficient tools for compiling quantum circuits. To express arbitrary circuits in a sequence of native gates specific to the quantum computer architecture, it is necessary to make algorithms portable across the landscape of quantum hardware providers. In this work, we present a compiler capable of transforming and optimizing a quantum circuit targeting a shuttling-based trapped-ion quantum processor. It consists of custom algorithms set on top of the quantum circuit framework Pytket. The performance was evaluated for a wide range of quantum circuits and the results show that the gate counts can be reduced by factors up to 5.1 compared to standard Pytket and up to 2.2 compared to standard Qiskit compilation.
TL;DR: In this paper , the authors adopt and extend state-of-the-art research in query compilers to propose an efficient query engine embedded in Python for TPC-H queries.
Abstract: The simplicity of Python and its rich set of libraries has made it the most popular language for data science. Moreover, the interpreted nature of Python offers an easy debugging experience for the developers. However, it comes with the price of poor performance compared to the compiled code. In this paper, we adopt and extend state-of-the-art research in query compilers to propose an efficient query engine embedded in Python. Our open-sourced framework enables the developers to do the debugging in Python, while being able to easily build a compiled version of the code for deployment. Our benchmark results on the entire set of TPC-H queries show that our approach covers different types of relational workloads and is competitive with state-of-the-art in-memory engines in both single- and multi-threaded settings.
TL;DR: It is demonstrated that it is practical to prove knowledge of real exploits for real-world processor architectures without the need for source code and without limiting the consideration to narrow vulnerability classes.
Abstract: We consider the problem of proving in zero-knowledge the existence of vulnerabilities in executables compiled to run on real-world processors. We demonstrate that it is practical to prove knowledge of real exploits for real-world processor architectures without the need for source code and without limiting our consideration to narrow vulnerability classes. To achieve this, we devise a novel circuit compiler and a toolchain that produces highly optimized, non-interactive zero-knowledge proofs for programs executed on the MSP430, an ISA commonly used in embedded hardware. Our toolchain employs a highly optimized circuit compiler and a number of novel optimizations to construct efficient proofs for program binaries. To demonstrate the capability of our system, we test our toolchain by constructing proofs for challenges in the Microcorruption capture the flag exercises.