BigDataScript: a scripting language for data pipelines
TL;DR: By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.
read more
Abstract: Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability
Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code
Availability and implementation: BigDataScript is available under open-source license at http://pcingolagithubio/BigDataScript
Contact: mocliamg@inalogniceolbap
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Sustainable data analysis with Snakemake.
Felix Mölder,Kim Philipp Jablonski,Kim Philipp Jablonski,Brice Letcher,Michael B Hall,Christopher Tomkins-Tinch,Christopher Tomkins-Tinch,Vanessa Sochat,Jan Forster,Jan Forster,Soohyun Lee,Sven Twardziok,Alexander Kanitz,Alexander Kanitz,Andreas Wilm,Manuel Holtgrewe,Sven Rahmann,Sven Nahnsen,Johannes Köster,Johannes Köster +19 more
TL;DR: It is shown how the popular workflow management system Snakemake can be used to guarantee reproducibility, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.
A review of bioinformatic pipeline frameworks
TL;DR: The design philosophies of several current pipeline frameworks are surveyed and compared and practical recommendations are provided based on analysis requirements and the user base.
GenPipes: an open-source framework for distributed and scalable genomic analyses
Mathieu Bourgey,Rola Dali,Robert Eveleigh,Kuang Chung Chen,Louis Letourneau,Joel Fillon,Marc Michaud,Maxime Caron,Johanna Sandoval,Francois Lefebvre,Gary Leveque,Eloi Mercier,David Bujold,Pascale Marquis,Patrick Tran Van,David Anderson de Lima Morais,Julien Tremblay,Xiaojian Shao,Edouard Henrion,Emmanuel Gonzalez,Pierre-Olivier Quirion,B. Caron,Guillaume Bourque +22 more
TL;DR: GenPipes is a flexible Python-based framework that facilitates the development and deployment of multi-step workflows optimized for high-performance computing clusters and the cloud, and offers genomics researchers a simple method to analyze different types of data.
Deciphering microbial mechanisms underlying soil organic carbon storage in a wheat-maize rotation system.
Xingjie Wu,Pengfei Liu,Carl-Eric Wegner,Yu Luo,Ke-Qing Xiao,Zhenling Cui,Fusuo Zhang,Werner Liesack,Jingjing Peng +8 more
TL;DR: In this article, a link between microbial life history strategies and soil organic carbon storage in agroecosystems is presumed, but largely unexplored at the gene level, and the authors aimed to elucidate whether and how differential organic material amendments (manure versus peat-vermiculite) affect, relative to sole chemical fertilizer application, the link between microorganisms' life history strategy and soil carbon storage, in a wheat-maize rotation field experiment.
40
NextflowWorkbench: Reproducible and Reusable Workflows for Beginners and Experts
TL;DR: The NextflowWorkbench is presented, which was designed for both beginners and experts, and blends the distinction between user interface and scripting language, and extends and reuses the popular Nextflow workflow description language and shares its advantages.
23
References
Fast and accurate short read alignment with Burrows–Wheeler transform
Heng Li,Richard Durbin +1 more
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
Aaron McKenna,Matthew Hanna,Eric Banks,Andrey Sivachenko,Kristian Cibulskis,Andrew Kernytsky,Kiran V. Garimella,David Altshuler,Stacey Gabriel,Mark J. Daly,Mark A. DePristo +10 more
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3
Pablo Cingolani,Adrian E. Platts,Le Lily Wang,M. Coon,Tung T. Nguyen,Luan Wang,Susan Land,Xiangyi Lu,Douglas M. Ruden +8 more
TL;DR: It appears that the 5′ and 3′ UTRs are reservoirs for genetic variations that changes the termini of proteins during evolution of the Drosophila genus.
10.7K
From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline
Géraldine A. Van der Auwera,Mauricio O. Carneiro,Christopher Hartl,Ryan Poplin,Guillermo del Angel,Ami Levy-Moonshine,Tadeusz Jordan,Khalid Shakir,David Roazen,Joel Thibault,Eric Banks,Kiran V. Garimella,David Altshuler,Stacey Gabriel,Mark A. DePristo +14 more
TL;DR: This unit describes how to use BWA and the Genome Analysis Toolkit to map genome sequencing data to a reference and produce high‐quality variant calls that can be used in downstream analyses.
6.7K
Snakemake--a scalable bioinformatics workflow engine.
Johannes Köster,Sven Rahmann +1 more
TL;DR: Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow.
Related Papers (5)
Mark A. DePristo,Eric Banks,Ryan Poplin,Kiran V. Garimella,Jared Maguire,Christopher Hartl,Anthony A. Philippakis,Anthony A. Philippakis,Anthony A. Philippakis,Guillermo del Angel,Manuel A. Rivas,Manuel A. Rivas,Matt Hanna,Aaron McKenna,Timothy Fennell,Andrew Kernytsky,Andrey Sivachenko,Kristian Cibulskis,Stacey Gabriel,David Altshuler,David Altshuler,Mark J. Daly,Mark J. Daly +22 more