Proceedings Article10.1109/CLOUD.2017.83
Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
Bo Xu,Changlong Li,Hang Zhuang,Jiali Wang,Qingfeng Wang,Xuehai Zhou +5 more
- 25 Jun 2017
- pp 608-615
17
TL;DR: CloudSW is presented, an efficient distributed Smith-Waterman algorithm which leverages Apache Spark and SIMD instructions to accelerate the algorithm and which has excellent scalability and achieves up to 529 giga cell updates per second in protein database search with 50 nodes in Aliyun Cloud.
read more
Abstract: The Smith-Waterman algorithm, which produces the optimal local alignment between pairwise sequences, is universally used as a key component in bioinformatics fields. It is more sensitive than heuristic approaches, but also more time-consuming. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm by leveraging data parallel strategy. However, SIMD-based Smith-Waterman (SW) algorithms show limited scalability. Moreover, the recent next-generation sequencing machines generate sequences at an unprecedented rate, so faster implementations of the sequence alignment algorithms are needed to keep pace. In this paper, we present CloudSW, an efficient distributed Smith-Waterman algorithm which leverages Apache Spark and SIMD instructions to accelerate the algorithm. To facilitate easy integration of distributed Smith-Waterman algorithm into third-party software, we provide application programming interfaces (APIs) service in cloud. The experimental results demonstrate that 1) CloudSW has outstanding performance and achieves up to 3.29 times speedup over DSW and 621 times speedup over SparkSW. 2) CloudSW has excellent scalability and achieves up to 529 giga cell updates per second (GCUPS) in protein database search with 50 nodes in Aliyun Cloud, which is the highest performance that has been reported as far as we know.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Development of Scalable On-Line Anomaly Detection System for Autonomous and Adaptive Manufacturing Processes
TL;DR: The proposed architecture framework and method for the implementation of the Scalable On-line Anomaly Detection System (SOADS) which can detect process anomalies via real-time processing and analyze large amounts of process execution data in the context of autonomous and adaptive manufacturing processes succeeded in large-scale data processing and analysis.
11
•Dissertation
Identifying Polymorphic Malware Variants Using Biosequence Analysis Techniques
Vijay Naidu
- 01 Jan 2018
TL;DR: It is suggested that the number of children under the age of five should be counted as one in a family rather than two in the case of a family of five.
6
High throughput BLAST algorithm using spark and cassandra
TL;DR: A new implementation of the Basic Local Alignment Search Tool algorithm is presented, named Sparky-Blast, which is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, improving both the response time and the system throughput.
5
Comparing SARS-CoV-2 Sequences using a Commercial Cloud with a Spot Instance Based Dynamic Scheduler
Luan Teylo,Alan L. Nunes,Alba Cristina Magalhaes Alves de Melo,Cristina Boeres,Lúcia Maria de A. Drummond,Natália F. Martins +5 more
- 10 May 2021
TL;DR: In this paper, the authors compared SARS-CoV-2 sequences with MASA-OpenMP in the Amazon Elastic Compute Cloud (Amazon EC2), using both spot and on-demand instances.
5
Efficient Execution of Dynamic Programming Algorithms on Apache Spark
Mohammad Mahdi Javanmard,Zafar Ahmad,Jaroslaw Zola,Louis-Noël Pouchet,Rezaul Chowdhury,Robert W. Harrison +5 more
- 01 Sep 2020
TL;DR: This work designs and implements well-decomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm, such as Floyd-Warshall's all-pairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark based on parametric multi-way recursive divide-&-conquer algorithms.
5
References
The Sequence Alignment/Map format and SAMtools
Heng Li,Bob Handsaker,Alec Wysoker,T. J. Fennell,Jue Ruan,Nils Homer,Gabor T. Marth,Gonçalo R. Abecasis,Richard Durbin +8 more
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Fast and accurate short read alignment with Burrows–Wheeler transform
Heng Li,Richard Durbin +1 more
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Fast gapped-read alignment with Bowtie 2
TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Fast and sensitive protein alignment using DIAMOND
TL;DR: DIAMOND is introduced, an open-source algorithm based on double indexing that is 20,000 times faster than BLASTX on short reads and has a similar degree of sensitivity.
11.6K
Identification of common molecular subsequences.
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
11.3K