An improved search algorithm for optimal multiple-sequence alignment

doi:10.1613/JAIR.1534

Open AccessJournal Article10.1613/JAIR.1534

An improved search algorithm for optimal multiple-sequence alignment

Stefan Schroedl

- 01 Jan 2005

- Journal of Artificial Intelligence Resea...

- Vol. 23, Iss: 1, pp 587-623

36

TL;DR: An algorithm is presented that outperforms one of the currently most successful algorithms for optimal multiple sequence alignments, Partial Expansion A*, both in time and memory and is able to calculate for the first time the optimal alignment for almost all of the problems in Reference 1 of the benchmark database BAliBASE.

Abstract: Multiple sequence alignment (MSA) is a ubiquitous problem in computational biology. Although it is NP-hard to find an optimal solution for an arbitrary number of sequences, due to the importance of this problem researchers are trying to push the limits of exact algorithms further. Since MSA can be cast as a classical path finding problem, it is attracting a growing number of AI researchers interested in heuristic search algorithms as a challenge with actual practical relevance. In this paper, we first review two previous, complementary lines of research. Based on Hirschberg's algorithm, Dynamic Programming needs O(kNk-1) space to store both the search frontier and the nodes needed to reconstruct the solution path, for k sequences of length N. Best first search, on the other hand, has the advantage of bounding the search space that has to be explored using a heuristic. However, it is necessary to maintain all explored nodes up to the final solution in order to prevent the search from re-expanding them at higher cost. Earlier approaches to reduce the Closed list are either incompatible with pruning methods for the Open list, or must retain at least the boundary of the Closed list. In this article, we present an algorithm that attempts at combining the respective advantages; like A* it uses a heuristic for pruning the search space, but reduces both the maximum Open and Closed size to O(kNk-1), as in Dynamic Programming. The underlying idea is to conduct a series of searches with successively increasing upper bounds, but using the DP ordering as the key for the Open priority queue. With a suitable choice of thresholds, in practice, a running time below four times that of A* can be expected. In our experiments we show that our algorithm outperforms one of the currently most successful algorithms for optimal multiple sequence alignments, Partial Expansion A*, both in time and memory. Moreover, we apply a refined heuristic based on optimal alignments not only of pairs of sequences, but of larger subsets. This idea is not new; however, to make it practically relevant we show that it is equally important to bound the heuristic computation appropriately, or the overhead can obliterate any possible gain. Furthermore, we discuss a number of improvements in time and space efficiency with regard to practical implementations. Our algorithm, used in conjunction with higher-dimensional heuristics, is able to calculate for the first time the optimal alignment for almost all of the problems in Reference 1 of the benchmark database BAliBASE.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1093/BIOINFORMATICS/BTM076

COBALT: constraint-based alignment tool for multiple protein sequences

Jason S. Papadopoulos, +1 more

- 01 May 2007

- Bioinformatics

TL;DR: It is shown that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT's alignment quality and has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems.

...read moreread less

1K

Proceedings Article•10.1117/12.2006228

Combining multiple thresholding binarization values to improve OCR output

William B. Lund, +2 more

- 04 Feb 2013

TL;DR: This novel approach combines the OCR outputs from multiple thresholded images by aligning the text output and producing a lattICE of word alternatives from which a lattice word error rate (LWER) is calculated.

...read moreread less

52

Book Chapter•10.1007/978-3-540-74128-2_3

Automated Creation of Pattern Database Search Heuristics

Stefan Edelkamp

- 01 Feb 2007

TL;DR: Experiments in heuristic search planning indicate that the total search efforts can be reduced significantly, and genetic algorithms to optimize its output are proposed.

...read moreread less

47

Proceedings Article•10.1145/1555400.1555437

Improving optical character recognition through efficient multiple system alignment

William B. Lund, +1 more

- 15 Jun 2009

TL;DR: An innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice.

...read moreread less

39

•Journal Article•10.1613/JAIR.1940

Multiple-goal heuristic search

Dmitry Davidov, +1 more

- 01 May 2006

- Journal of Artificial Intelligence Resea...

TL;DR: A new framework for anytime heuristic search where the task is to achieve as many goals as possible within the allocated resources is presented and the marginal-utility heuristic is introduced, which estimates the cost and the benefit of exploring a subtree below a search node.

...read moreread less

20

...

Expand

References

Journal Article•10.1016/S0022-2836(05)80360-2

Basic Local Alignment Search Tool

Stephen F. Altschul, +4 more

- 01 Oct 1990

- Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

98.8K

•Journal Article•10.1007/BF01386390

A note on two problems in connexion with graphs

Edsger W. Dijkstra

- 01 Dec 1959

- Numerische Mathematik

TL;DR: A tree is a graph with one and only one path between every two nodes, where at least one path exists between any two nodes and the length of each branch is given.

...read moreread less

25K

Journal Article•10.1093/BIOINFORMATICS/8.3.275

The rapid generation of mutation data matrices from protein sequences

David T. Jones, +2 more

- 01 Jun 1992

- Bioinformatics

TL;DR: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented, by means of an approximate peptide-based sequence comparison algorithm, which is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fastenough to generate a matrix from a specific family or class of proteins in minutes.

...read moreread less

7.2K

Journal Article•10.1016/0004-3702(85)90084-0

Depth-first iterative-deepening: an optimal admissible tree search

Richard E. Korf

- 01 Sep 1985

- Artificial Intelligence

TL;DR: This heuristic depth-first iterative-deepening algorithm is the only known algorithm that is capable of finding optimal solutions to randomly generated instances of the Fifteen Puzzle within practical resource limits.

...read moreread less

1.7K

•Journal Article•10.1145/360825.360861

A linear space algorithm for computing maximal common subsequences

Daniel S. Hirschberg

- 01 Jun 1975

- Communications of The ACM

TL;DR: The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space and an algorithm is presented which will solve this problem in QuadraticTime and in linear space.

...read moreread less

1.2K

...

Expand

An improved search algorithm for optimal multiple-sequence alignment

Chat with Paper

AI Agents for this Paper

Citations

COBALT: constraint-based alignment tool for multiple protein sequences

Combining multiple thresholding binarization values to improve OCR output

Automated Creation of Pattern Database Search Heuristics

Improving optical character recognition through efficient multiple system alignment

Multiple-goal heuristic search

References

Basic Local Alignment Search Tool

A note on two problems in connexion with graphs

The rapid generation of mutation data matrices from protein sequences

Depth-first iterative-deepening: an optimal admissible tree search

A linear space algorithm for computing maximal common subsequences

Related Papers (5)

Frontier search

Externalizing the Multiple Sequence Alignment Problem with Affine Gap Costs

A Formal Basis for the Heuristic Determination of Minimum Cost Paths

Sweep A: space-efficient heuristic search in partially ordered graphs

An iterative method for faster sum-of-pairs multiple sequence alignment.