Open AccessPosted Content
PyTorrent: A Python Library Corpus for Large-scale Language Models.
Mehdi Bahrami,N. C. Shrikanth,Shade Ruangwan,Lei Liu,Yuji Mizobuchi,Masahiro Fukuyori,Chen Wei-Peng,Kazuki Munakata,Tim Menzies +8 more
TL;DR: PyTorrent as mentioned in this paper is a large-scale collection of both semantic and natural language resources to leverage active Software Engineering research areas such as code reuse and code comprehensibility, and it contains 218,814 Python package libraries from PyPI and Anaconda environments.
read more
Abstract: A large scale collection of both semantic and natural language resources is essential to leverage active Software Engineering research areas such as code reuse and code comprehensibility. Existing machine learning models ingest data from Open Source repositories (like GitHub projects) and forum discussions (like Stackoverflow.com), whereas, in this showcase, we took a step backward to orchestrate a corpus titled PyTorrent that contains 218,814 Python package libraries from PyPI and Anaconda environment. This is because earlier studies have shown that much of the code is redundant and Python packages from these environments are better in quality and are well-documented. PyTorrent enables users (such as data scientists, students, etc.) to build off the shelf machine learning models directly without spending months of effort on large infrastructure. The dataset, schema and a pretrained language model is available at: this https URL
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
How Important Are Good Method Names in Neural Code Generation? A Model Robustness Perspective
Guang Yang,Yu Zhou,Wei Yang,Tao Yue,Xiang Chen,Taolue Chen +5 more
TL;DR: The importance of good method names in neural code generation is demonstrated. A novel approach, RADAR, is proposed to enhance the robustness of PCGMs against adversarial method name attacks.
6
•Posted Content
AugmentedCode: Examining the Effects of Natural Language Resources in Code Retrieval Models.
Mehdi Bahrami,N. C. Shrikanth,Yuji Mizobuchi,Lei Liu,Masahiro Fukuyori,Chen Wei-Peng,Kazuki Munakata +6 more
TL;DR: Wang et al. as discussed by the authors introduced augmented code (AugmentedCode) retrieval which takes advantage of existing information within the code and constructs augmented programming language to improve the code retrieval models' performance.
1
Constructing Temporal Networks of OSS Programming Language Ecosystems
01 Mar 2023
TL;DR: In this paper , the authors and projects of OSS projects are represented as nodes in a collaboration graph, which enables various forms of social network analysis on the scale of language ecosystems, and they capture the information on the ecosystems' evolution by slicing each network into 30 historical snapshots.
References
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin,Ming-Wei Chang,Kenton Lee,Kristina Toutanova +3 more
- 11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
24.6K
•Proceedings Article
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
- 01 Jan 2019
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
•Posted Content
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TL;DR: This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
7.3K
A Complexity Measure
TL;DR: Several properties of the graph-theoretic complexity are proved which show, for example, that complexity is independent of physical size and complexity depends only on the decision structure of a program.
6K