Top 16 papers presented at Algorithm Engineering and Experimentation in 2010

Showing papers presented at "Algorithm Engineering and Experimentation in 2010"

Proceedings Article•

StreamKM++: a clustering algorithm for data streams

[...]

Marcel R. Ackermann¹, Christiane Lammersen², Marcus Märtens¹, Christoph Raupach¹, Christian Sohler², Kamil Swierkot¹ - Show less +2 more•Institutions (2)

University of Paderborn¹, Technical University of Dortmund²

16 Jan 2010

TL;DR: A new k-means clustering algorithm for data streams of points from a Euclidean space that provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large.

...read moreread less

Abstract: We develop a new k-means clustering algorithm for data streams, which we call StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm [1]. To compute the small sample, we propose two new techniques. First, we use a non-uniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a low dependency on the dimensionality of the data. Second, we propose a new data structure which we call a coreset tree. The use of these coreset trees significantly speeds up the time necessary for the non-uniform sampling during our coreset construction. We compare our algorithm experimentally with two well-known streaming implementations (BIRCH [16] and StreamLS [4, 9]). In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scales much better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low dimensional spaces.

...read moreread less

285 citations

Proceedings Article•

Route planning with flexible objective functions

[...]

Robert Geisberger¹, Moritz Kobitzsch¹, Peter Sanders¹•Institutions (1)

Karlsruhe Institute of Technology¹

16 Jan 2010

TL;DR: The first fast route planning algorithm that answers shortest paths queries for a customizable linear combination of two different metrics, e.

...read moreread less

Abstract: We present the first fast route planning algorithm that answers shortest paths queries for a customizable linear combination of two different metrics, e. g. travel time and energy cost, on large scale road networks. The precomputation receives as input a directed graph, two edge weight functions t(e) and c(e), and a discrete interval [L, U]. The resulting flexible query algorithm finds for a parameter p ∈ [L, U] an exact shortest path for the edge weight t(e)+p·c(e). This allows for different tradeoffs between the two edge weight functions at query time. We apply precomputation based on node contraction, which adds all necessary shortcuts for any parameter choice efficiently. To improve the node ordering, we developed the new concept of gradual parameter interval splitting. Additionally, we improve performance by combining node contraction and a goal-directed technique in our flexible scenario.

...read moreread less

71 citations

Proceedings Article•

Exact solutions and bounds for general art gallery problems

[...]

Tobias Baumgartner¹, Sándor P. Fekete¹, Alexander Kröller¹, Christiane Schmidt¹•Institutions (1)

Braunschweig University of Technology¹

16 Jan 2010

TL;DR: A primal-dual algorithm based on linear programming that provides lower bounds on the necessary number of guards in every step and—in case of convergence and integrality—ends with an optimal solution to the classical Art Gallery Problem.

...read moreread less

Abstract: The classical Art Gallery Problem asks for the minimum number of guards that achieve visibility coverage of a given polygon. This problem is known to be NP-hard, even for very restricted and discrete special cases. For the case of vertex guards and simple orthogonal polygons, Cuoto et al. have recently developed an exact method that is based on a set cover approach. For the general problem (in which both the set of possible guard positions and the point set to be guarded are uncountable), neither constant-factor approximation algorithms nor exact solution methods are known. We present a primal-dual algorithm based on linear programming that provides lower bounds on the necessary number of guards in every step and---in case of convergence and integrality---ends with an optimal solution. We describe our implementation and give results for an assortment of polygons, including non-orthogonal polygons with holes.

...read moreread less

48 citations

Proceedings Article•

Fast local search for steiner trees in graphs

[...]

Eduardo Uchoa¹, Renato F. Werneck²•Institutions (2)

Federal Fluminense University¹, Microsoft²

16 Jan 2010

TL;DR: Efficient algorithms that implement four local searches for the Steiner problem in graphs: vertex insertion, vertex elimination, key-path exchange, and key-vertex elimination are presented.

...read moreread less

Abstract: We present efficient algorithms that implement four local searches for the Steiner problem in graphs: vertex insertion, vertex elimination, key-path exchange, and key-vertex elimination. In each case, we show how to find an improving solution (or prove that none exists in the neighborhood) in O(m log n) time on graphs with n vertices and m edges. Many of the techniques and data structures we use are relevant in the study of dynamic graphs in general, beyond Steiner trees. Besides the theoretical interest, our results have practical impact: these local searches have been shown to find good-quality solutions in practice, but high running times limited their applicability.

...read moreread less

28 citations

Proceedings Article•

Simple and fast nearest neighbor search

[...]

Marcel Birn¹, Manuel Holtgrewe¹, Peter Sanders¹, Johannes Singler¹•Institutions (1)

Karlsruhe Institute of Technology¹

16 Jan 2010

TL;DR: A simple randomized data structure for two-dimensional point sets that allows fast nearest neighbor queries in many cases is presented and an implementation outperforms several previous implementations for commonly used benchmarks.

...read moreread less

Abstract: We present a simple randomized data structure for two-dimensional point sets that allows fast nearest neighbor queries in many cases. An implementation outperforms several previous implementations for commonly used benchmarks.

...read moreread less

25 citations

Proceedings Article•

Tabulation based 5-universal hashing and linear probing

[...]

Mikkel Thorup¹, Yin Zhang²•Institutions (2)

AT&T Labs¹, University of Texas at Austin²

16 Jan 2010

TL;DR: If the pre-computed tables are made 5-universal, then the hash value becomes 5- universal without any other change to the computation, which leads to even bigger gains since the direct methods for 5-Universal hashing use degree 4 polynomials.

...read moreread less

Abstract: Previously [SODA'04] we devised the fastest known algorithm for 4-universal hashing. The hashing was based on small pre-computed 4-universal tables. This led to a five-fold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the pre-computed tables are made 5-universal, then the hash value becomes 5-universal without any other change to the computation. Relatively this leads to even bigger gains since the direct methods for 5-universal hashing use degree 4 polynomials. Experimentally, we find that our method can gain up to an order of magnitude in speed over direct 5-universal hashing. Some of the most popular randomized algorithms have been proved to have the desired expected running time using 5-universal hashing, e.g., a non-recursive variant of quicksort takes O(n log n) expected time [Karloff Raghavan JACM'93], and linear probing does updates and searches in O(1) expected time [Pagh et al. SICOMP'09]. In contrast, inputs have been constructed leading to much worse expected performance with some of the classic primality based 2-universal hashing schemes. In the context of linear probing, we compare our new fast 5-universal hashing experimentally with the fastest known plain universal hashing. We know that any reasonable hashing scheme will work on random input, but from Pagh et al., we know that 5-universal hashing leads to good expected performance on all input. We use a dense interval as an example of a structured yet realistic input, wanting to see if this could push the fastest multiplication-shift based plain universal hashing into bad performance. Even though our 5-universal hashing itself is slower than the fast plain universal hashing, it makes linear probing much more robust.

...read moreread less

23 citations

Proceedings Article•

Implementation and parallelization of a reverse-search algorithm for Minkowski sums

[...]

Christophe Weibel¹•Institutions (1)

McGill University¹

16 Jan 2010

TL;DR: An implementation of a reverse-search algorithm of Fukuda for computing Minkowski sums of polytopes efficiently and uses the exact arithmetic GMP, which ensures robustness of the program and exacts of the results.

...read moreread less

Abstract: We present an implementation of a reverse-search algorithm of Fukuda for computing Minkowski sums of polytopes efficiently. The algorithm allows summing any number of polytopes in any dimension, and is complete in the sense that it does not assume general position. Its running time depends linearly on the size of the output. To the best of our knowledge, this is the only existing implementation that can efficiently compute Minkowski sums in higher dimensions. The implementation uses the exact arithmetic GMP, which ensures robustness of the program and exactness of the results. We furthermore present a parallel version of our implementation to demonstrate the simplicity and efficiency of performing the reverse search in parallel. The results of the performance tests show a near-linear acceleration of our parallel implementation.

...read moreread less

17 citations

Proceedings Article•

Employing (1 − ε) dominating set partitions as backbones in wireless sensor networks

[...]

Dhia Mahjoub¹, David W. Matula¹•Institutions (1)

Southern Methodist University¹

16 Jan 2010

TL;DR: This paper introduces an efficient algorithm for selecting (δ + 1) backbones with disjoint node sets that are each independent (1 − e) dominating sets of G, and provides an efficient topologically based centralized algorithm for determining the backbones.

...read moreread less

Abstract: For a random geometric graph G(n, r) of minimum degree δ, we introduce an efficient algorithm for selecting (δ + 1) backbones with disjoint node sets that are each independent (1 − e) dominating sets of G. The backbone node sets are determined by a graph coloring algorithm employing only the topology (not the geometry) of G(n, r), and the backbone links are selected with link lengths in a narrow window between r and 2r and further to form a planar graph backbone. For large vertex sets (n = 1600, 3200) the resulting backbones are shown to each cover typically over 99% of the vertices of G (i.e. e < 0.01), with about 30% being fully dominating, which is consistent with the ¼ constant approximation factor algorithm proposed recently in [23] for the domatic partition problem in Unit Disk Graphs. We establish experimentally by measures of node degrees, link lengths, and interior triangular face counts that each individual backbone has most of the coverage behavior and routing convenience of the triangular "perfect packing" lattice. We further show for each sample G(n, r) that the relatively few vertices not covered by all (δ + 1) backbones are covered by most of the backbones. Hence backbone rotation in a wireless sensor network would reach all sensors (vertices) sufficiently frequently. Our novel backbone generation algorithm confirms experimentally the existence of these (δ + 1) backbones in a random geometric graph, and provides an efficient topologically based centralized algorithm for determining the backbones. We also point out that our novel backbone construction method is flexible such that any efficient coloring algorithm can be plugged into it. In this paper, we experiment with several coloring algorithms: Smallest Last, Largest First, Lexicographic, Radial Sweep and Random and we compare their respective performance. Our emphasis is, however, on SL since it offers robust properties and interesting expected behavior. We also experiment with several random node distributions: uniform, skewed and normal in both unit square and disk of which we also discuss the results.

...read moreread less

15 citations

Book Chapter•10.1137/1.9781611972900.13•

Navigation in real-world complex networks through embedding in latent spaces

[...]

Xiaomeng Ban¹, Jie Gao¹, Arnout van de Rijt¹•Institutions (1)

Stony Brook University¹

16 Jan 2010

TL;DR: Algorithmic methods are applied to embed nodes in some latent space and employ greedy routing to deliver packages to empirically investigate the navigability of five real-world complex networks from diverse contexts and of varying topology.

...read moreread less

Abstract: Small-world experiments in which packages reach addressees unknown to the original sender through a forwarding chain confirm that acquaintance networks have short paths, a property that was later also discovered in many other networks. They further show that people can find these paths by passing the package on to the acquaintance most socially proximate to the target. This has led researchers to conjecture that perhaps also in many other networks some proximity-based algorithm can be used to find short paths, provided that nodes are given appropriate coordinates. Although potential applications are numerous, ranging from decentralized search to recommendation-based trust to disease control, this conjecture has remained largely unverified. In this paper we apply algorithmic methods to embed nodes in some latent space and employ greedy routing to deliver packages. Using these methods we empirically investigate the navigability of five real-world complex networks from diverse contexts and of varying topology. In each network, we deliver a majority of packages in fewer than six hops.

...read moreread less

12 citations

Proceedings Article•

Algorithm engineering: an attempt at a definition using sorting as an example

[...]

Peter Sanders¹•Institutions (1)

Karlsruhe Institute of Technology¹

16 Jan 2010

TL;DR: Algorithm engineering (AE) is described as a methodology for algorithmic research where design, analysis, implementation and experimental evaluation of form a feedback cycle driving the development of efficient algorithm.

...read moreread less

Abstract: The talk describes algorithm engineering (AE) as a methodology for algorithmic research where design, analysis, implementation and experimental evaluation of form a feedback cycle driving the development of efficient algorithm. Additional important components of the methodology include realistic models, algorithm libraries, and collections of realistic benchmark instances. Examples are given for the fundamental problem of sorting with particular emphasis on huge data sets, advanced hardware, and energy efficiency.

...read moreread less

7 citations

Proceedings Article•

A polynomial delay algorithm for enumerating approximate solutions to the interval constrained coloring problem

[...]

Stefan Canzar¹, Khaled Elbassioni², Julián Mestre²•Institutions (2)

Centrum Wiskunde & Informatica¹, Max Planck Society²

16 Jan 2010

TL;DR: This work studies the interval constrained coloring problem, a combinatorial problem arising in the interpretation of data on protein structure emanating from experiments based on hydrogen/deuterium exchange and mass spectrometry, and proposes a polynomial-delay polynometric-space algorithm for enumerating all exact solutions plus further approximate solutions, whose components are guaranteed to be within an absolute error of one of the optimum.

...read moreread less

Abstract: We study the interval constrained coloring problem, a combinatorial problem arising in the interpretation of data on protein structure emanating from experiments based on hydrogen/deuterium exchange and mass spectrometry. The problem captures the challenging task of increasing the spatial resolution of experimental data in order to get a better picture of the protein structure. Since solutions proposed by any algorithmic framework have to ultimately be verified by biochemists, it is important to provide not just a single solution, but a valuable set of candidate solutions. Our contribution is a polynomial-delay polynomial-space algorithm for enumerating all exact solutions plus further approximate solutions, whose components are guaranteed to be within an absolute error of one of the optimum. Our experiments indicate that these approximate solutions are reasonably close to the optimal ones, in terms of the accumulative error. In addition, the experiments also confirm the effectiveness of the method in reducing the delay between two consecutive solutions considerably, compared to what it takes an integer programming solver to produce the next exact solution.

...read moreread less

Proceedings Article•

Conjunctive filter: breaking the entropy barrier

[...]

Daisuke Okanohara¹, Yuichi Yoshida²•Institutions (2)

University of Tokyo¹, Kyoto University²

16 Jan 2010

TL;DR: The objective is to break this entropy bound and construct more space-efficient data structures and show that many problems can be solved by using a conjunctive filter such as full-text search and database join queries.

...read moreread less

Abstract: We consider a problem for storing a map that associates a key with a set of values. To store n values from the universe of size m, it requires log2(mn) bits of space, which can be approximated as (1.44 + n) log2 m/n bits when n L m. If we allow e fraction of errors in outputs, we can store it with roughly n log2 1/e bits, which matches the entropy bound. Bloom filter is a well-known example for such data structures. Our objective is to break this entropy bound and construct more space-efficient data structures. In this paper, we propose a novel data structure called a conjunctive filter, which supports conjunctive queries on k distinct keys for fixed k. Although a conjunctive filter cannot return the set of values itself associated with a queried key, it can perform conjunctive queries with O(1/√m) fraction of errors. Also, the consumed space is n/k log2 m bits and it is significantly smaller than the entropy bound n/2 log2 m when k ≥ 3. We will show that many problems can be solved by using a conjunctive filter such as full-text search and database join queries. Also, we conducted experiments using a real-world data set, and show that a conjunctive filter answers conjunctive queries almost correctly using about 1/2 ~ 1/4 space as the entropy bound.

...read moreread less

Proceedings Article•

Untangling the braid: finding outliers in a set of streams

[...]

Chiranjeeb Buragohain¹, Luca Foschini², Subhash Suri²•Institutions (2)

Amazon.com¹, University of California, Santa Barbara²

16 Jan 2010

TL;DR: This paper investigates the space complexity of one-pass algorithms for approximating outliers of this kind, proves lower bounds using multi-party communication complexity, and proposes small-memory heuristic algorithms that perform quite well for a variety of synthetic data.

...read moreread less

Abstract: Monitoring the performance of large shared computing systems such as the cloud computing infrastructure raises many challenging algorithmic problems. One common problem is to track users with the largest deviation from the norm (outliers), for some measure of performance. Taking a streamcomputing perspective, we can think of each user's performance profile as a stream of numbers (such as response times), and the aggregate performance profile of the shared infrastructure as a "braid" of these intermixed streams. The monitoring system's goal then is to untangle this braid sufficiently to track the top k outliers. This paper investigates the space complexity of one-pass algorithms for approximating outliers of this kind, proves lower bounds using multi-party communication complexity, and proposes small-memory heuristic algorithms. On one hand, stream outliers are easily tracked for simple measures, such as max or min, but our theoretical results rule out even good approximations for most of the natural measures such as average, median, or the quantiles. On the other hand, we show through simulation that our proposed heuristics perform quite well for a variety of synthetic data.

...read moreread less

Proceedings Article•

Budgeted maximum coverage with overlapping costs: monitoring the emerging infections network

[...]

Donald E. Curtis¹, Sriram V. Pemmaraju¹, Philip M. Polgreen¹•Institutions (1)

University of Iowa¹

16 Jan 2010

TL;DR: This work model the problem of monitoring a listserv, such as the EIN, as a type of budgeted maximum coverage problem that is Budgeted Maximization with Overlapping Costs (BMOC), and identifies small sets of "bellwether" users who are good predictors of important discussions.

...read moreread less

Abstract: The Emerging Infections Network (EIN) (http://ein.idsociety.org/) is a CDC supported "sentinel" network of over 1400 members (currently), designed to connect clinical infectious disease specialists and public health officials. Members primarily communicate through an EIN managed listserv and discuss disease outbreaks, treatment protocols, effectiveness of vaccinations and other disease-control and prevention mechanisms, etc. Recently, researchers at Google and Yahoo! Research have used search engine query logs to tap into the online "wisdom of crowds" and produce disease outbreak trends for flu. Following this work, there is now interest in trying to monitor EIN discussions more carefully to disseminate timely and accurate information on clinical events of possible interest to health officials. We model the problem of monitoring a listserv, such as the EIN, as a type of budgeted maximum coverage problem that we call Budgeted Maximization with Overlapping Costs (BMOC). Even though BMOC seems superficially similar to the budgeted maximum coverage problem considered by Khuller et al. (Inf. Process. Lett., 1999), our problem is fundamentally different from an algorithmic point of view, due to its cost structure. We observe that the greedy algorithm that provides a constant-factor approximation to the budgeted maximum coverage problem can be arbitrarily bad for BMOC. We also present a reduction to BMOC from the k-densest subgraph problem that provides evidence indicating that obtaining a constant-factor approximation for our problem might be quite challenging. Nevertheless, experimental runs of the greedy algorithm on the EIN data show that greedy performs remarkably well relative to OPT. We identify a feature of our EIN data, that we call the overlap condition, and show that the greedy algorithm does indeed yield a constant-factor approximation guarantee if the overlap condition is satisfied. Using an implementation of the greedy algorithm for BMOC on the EIN data, we identify small sets of "bellwether" users who are good predictors of important discussions. We provide evidence to show that tracking just these users reduces the cost of monitoring the EIN significantly without causing any important discussions to be missed.

...read moreread less

Proceedings Article•

Implementing streaming simplification for large labeled meshes

[...]

Catalin Constantin¹, Shawn Brown¹, Jack Snoeyink¹•Institutions (1)

University of North Carolina at Chapel Hill¹

16 Jan 2010

TL;DR: Garland and Heckbert's quadric error metric in conjunction with edge contraction gives a greedy approach to simplify a mesh that can fit in memory, and is applied to streaming meshes, suggested by Isenburg.

...read moreread less

Abstract: Data capture technologies like airborne LIDAR produce extremely large models of digital terrain, which must be simplified to be useful. Garland and Heckbert's quadric error metric in conjunction with edge contraction gives a greedy approach to simplify a mesh that can fit in memory; we adapt it to work with boundaries and labels (e.g., object ID, ground vs. building, or some discrimination between parts of the mesh that is to be preserved during simplification). More importantly, we apply it to streaming meshes, suggested by Isenburg, which are represented as an intermixed sequence of vertices, triangles, and finalization tags indicating the last use of any vertex. These tags essentially document spatial locality in the stream. We discuss the engineering decisions that allow our algorithm to achieve fast, high-quality simplification of gigabyte datasets using a small memory footprint.

...read moreread less

Proceedings Article•

Succinct trees in practice

[...]

Diego Arroyuelo¹, Rodrigo Cánovas², Gonzalo Navarro², Kunihiko Sadakane³•Institutions (3)

Yahoo!¹, University of Chile², National Institute of Informatics³

16 Jan 2010

TL;DR: The latter technique stands out as an excellent practical combination of space occupancy, time performance, and functionality, whereas others, particularly LOUDS, are still interesting in some limited-functionality niches.

...read moreread less

Abstract: We implement and compare the major current techniques for representing general trees in succinct form. This is important because a general tree of n nodes is usually represented in pointer form, requiring O(n log n) bits, whereas the succinct representations we study require just 2n + o(n) bits and carry out many sophisticated operations in constant time. Yet, there is no exhaustive study in the literature comparing the practical magnitudes of the o(n)-space and the O(1)-time terms. The techniques can be classified into three broad trends: those based on BP (balanced parentheses in preorder), those based on DFUDS (depth-first unary degree sequence), and those based on LOUDS (level-ordered unary degree sequence). BP and DFUDS require a balanced parentheses representation that supports the core operations findopen, findclose, and enclose, for which we implement and compare three major algorithmic proposals. All the tree representations require also core operations rank and select on bitmaps, which are already well studied in the literature. We show how to predict the time and space performance of most variants via combining these core operations, and also study some tree operations for which specialized implementations exist. This is especially relevant for a recent proposal (K. Sadakane and G. Navarro, SODA'10) which, although belonging to class BP, deviates from the main techniques in some cases in order to achieve constant time for the widest range of operations. We experiment over various types of real-life trees and of traversals, and conclude that the latter technique stands out as an excellent practical combination of space occupancy, time performance, and functionality, whereas others, particularly LOUDS, are still interesting in some limited-functionality niches.

...read moreread less