AIMMX: Artificial Intelligence Model Metadata Extractor

doi:10.1145/3379597.3387448

Proceedings Article10.1145/3379597.3387448

AIMMX: Artificial Intelligence Model Metadata Extractor

Jason Tsay, +4 more

- 29 Jun 2020

- pp 81-92

21

TL;DR: An exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models that enables simplified AI Model Metadata eXtraction from software repositories are presented.

Abstract: Despite all of the power that machine learning and artificial intelligence (AI) models bring to applications, much of AI development is currently a fairly ad hoc process. Software engineering and AI development share many of the same languages and tools, but AI development as an engineering practice is still in early stages. Mining software repositories of AI models enables insight into the current state of AI development. However, much of the relevant metadata around models are not easily extractable directly from repositories and require deduction or domain knowledge. This paper presents a library called AIMMX that enables simplified AI Model Metadata eXtraction from software repositories. The extractors have five modules for extracting AI model-specific metadata: model name, associated datasets, references, AI frameworks used, and model domain. We evaluated AIMMX against 7,998 open-source models from three sources: model zoos, arXiv AI papers, and state-of-the-art AI papers. Our platform extracted metadata with 87% precision and 83% recall. As preliminary examples of how AI model metadata extraction enables studies and tools to advance engineering support for AI development, this paper presents an exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models. Our analysis suggests that while data reproducibility may be relatively poor with 42% of models in our sample citing their datasets, method reproducibility is more common at 72% of models in our sample, particularly state-of-the-art models. Our collected models are searchable in a catalog that uses existing metadata to enable advanced discovery features for efficiently finding models.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article

Data Mining Practical Machine Learning Tools and Techniques

อนิรุธ สืบสิงห์

- 01 Jan 2014

- Journal of management science

13.6K

•Journal Article•10.1109/access.2023.3287195

A Survey of Privacy Risks and Mitigation Strategies in the Artificial Intelligence Life Cycle

01 Jan 2023

- IEEE Access

TL;DR: In this paper , the authors examine privacy risks in different phases of the AI life cycle and review the existing privacy-enhancing solutions, including technologies, requirements, and process solutions to countermeasure these risks.

...read moreread less

25

Journal Article•10.1145/3643916.3644412

How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study

Federica Pepe, +5 more

- 15 Apr 2024

TL;DR: Pre-trained transformer models hosted by Hugging Face lack transparency in terms of datasets, bias, and licenses. There is a need for further research and potential legislation to improve the transparency of ML models.

...read moreread less

14

•Proceedings Article•10.1145/3524842.3528467

Complex Python Features in the Wild

Yi Yang, +2 more

- 01 May 2022

TL;DR: The findings show that usage of dynamic features that pose a threat to static analysis is infrequent, and a list of Python features that any “minimal syntax” ought to handle in order to capture developers' use of the Python language.

...read moreread less

13

•Proceedings Article•10.1145/3540250.3560881

Discrepancies among pre-trained deep neural networks: a new threat to model zoo reliability

Diego Montes, +5 more

- 07 Nov 2022

TL;DR: In this paper , the reliability of pre-trained deep neural networks (DNNs) from model zoos is investigated. But, the authors focus on the accuracy, latency, and architecture of 36 DNNs across four model zooms, and find differences of 1.23%-2.62% in accuracy and 9%-131% in latency.

...read moreread less

12

...

Expand

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Journal Article

Data Mining Practical Machine Learning Tools and Techniques

อนิรุธ สืบสิงห์

- 01 Jan 2014

- Journal of management science

13.6K

•Proceedings Article

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Christian Szegedy, +3 more

- 23 Feb 2016

TL;DR: In this paper, the authors show that training with residual connections accelerates the training of Inception networks significantly, and they also present several new streamlined architectures for both residual and non-residual Inception Networks.

...read moreread less

11K

•Proceedings Article

Hidden technical debt in Machine learning systems

D. Sculley, +9 more

- 07 Dec 2015

TL;DR: It is found it is common to incur massive ongoing maintenance costs in real-world ML systems, and several ML-specific risk factors to account for in system design are explored.

...read moreread less

1.1K

•Proceedings Article•10.21437/INTERSPEECH.2014-564

One billion word benchmark for measuring progress in statistical language modeling.

Ciprian Chelba, +6 more

- 14 Sep 2014

TL;DR: A new benchmark corpus to be used for measuring progress in statistical language modeling, with almost one billion words of training data, is proposed, which is useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.

...read moreread less

1K

...

Expand

AIMMX: Artificial Intelligence Model Metadata Extractor

Chat with Paper

AI Agents for this Paper

Citations

Data Mining Practical Machine Learning Tools and Techniques

A Survey of Privacy Risks and Mitigation Strategies in the Artificial Intelligence Life Cycle

How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study

Complex Python Features in the Wild

Discrepancies among pre-trained deep neural networks: a new threat to model zoo reliability

References

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Data Mining Practical Machine Learning Tools and Techniques

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Hidden technical debt in Machine learning systems

One billion word benchmark for measuring progress in statistical language modeling.

Related Papers (5)

Automatic metadata mining from multilingual enterprise content

iLOG: a framework for automatic annotation of learning objects with empirical usage metadata

From unstructured data to actionable intelligence

Automatic Extraction of Pedagogic Metadata from Learning Content

Improving Data Discovery for Metadata Repositories through Semantic Search