TL;DR: This paper presents a set of algorithms, implemented in the PT-Scotch software package, which allows one to order large graphs in parallel, yielding orderings the quality of which is only slightly worse than the one of state-of-the-art sequential algorithms.
Abstract: The parallel ordering of large graphs is a difficult problem, because on the one hand minimum degree algorithms do not parallelize well, and on the other hand the obtainment of high quality orderings with the nested dissection algorithm requires efficient graph bipartitioning heuristics, the best sequential implementations of which are also hard to parallelize. This paper presents a set of algorithms, implemented in the PT-Scotch software package, which allows one to order large graphs in parallel, yielding orderings the quality of which is only slightly worse than the one of state-of-the-art sequential algorithms. Our implementation uses the classical nested dissection approach but relies on several novel features to solve the parallel graph bipartitioning problem. Thanks to these improvements, PT-Scotch produces consistently better orderings than ParMeTiS on large numbers of processors.
TL;DR: A reconfigurable computing architecture using Field Programmable Gate Arrays (FPGAs) to accelerate two computationally intensive kernels of a Quantum Monte Carlo (QMC) application applied to N-body systems is presented.
Abstract: Quantum Monte Carlo methods enable us to determine the ground-state properties of atomic or molecular clusters. Here, we present a reconfigurable computing architecture using Field Programmable Gate Arrays (FPGAs) to accelerate two computationally intensive kernels of a Quantum Monte Carlo (QMC) application applied to N-body systems. We focus on two key kernels of the QMC application: acceleration of potential energy and wave function calculations. We compare the performance of our application on two reconfigurable platforms. Firstly, we use a dual-processor 2.4GHz Intel Xeon augmented with two reconfigurable development boards consisting of Xilinx Virtex-II Pro FPGAs. Using this platform, we achieve a speedup of 3x over a software-only implementation. Following this, the chemistry application is ported to the Cray XD1 supercomputer equipped with Xilinx Virtex-II Pro and Virtex-4 FPGAs. The hardware-accelerated application on one node of the high performance system equipped with a single Virtex-4 FPGA yields a speedup of approximately 25x over the serial reference code running on one node of the dual-processor dual-core 2.2GHz AMD Opteron. This speedup is mainly attributed to the use of pipelining, the use of fixed-point arithmetic for all calculations and the fine-grained parallelism using FPGAs. We can further enhance the performance by operating multiple instances of our design in parallel.
TL;DR: An exhaustive quantitative and comparative analysis of several available and new parallel hyperspectral imaging algorithms by comparing their efficiency on both a fully heterogeneous network of workstations and a massively parallel homogeneous cluster at NASA's Goddard Space Flight Center in Maryland.
Abstract: Imaging spectroscopy, also known as hyperspectral imaging, is a new technique that has gained tremendous popularity in many research areas, including satellite imaging and aerial reconnaissance. In particular, NASA is continuously gathering high-dimensional image data from the surface of the earth with hyperspectral sensors such as the Jet Propulsion Laboratory's Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) or the Hyperion hyperspectral imager aboard NASA's Earth Observing-1 (EO-1) spacecraft. Despite the massive volume of scientific data commonly involved in hyperspectral imaging applications, very few parallel strategies for hyperspectral analysis are currently available, and most of them have been designed in the context of homogeneous computing platforms. However, heterogeneous networks of workstations represent a very promising cost-effective solution that is expected to play a major role in the design of high-performance computing platforms for many on-going and planned remote sensing missions. Our main goal in this paper is to understand parallel performance of hyperspectral imaging algorithms comprising the standard hyperspectral data processing chain (which includes pre-processing, selection of pure spectral components and linear spectral unmixing) in the context of fully heterogeneous computing platforms. For that purpose, we develop an exhaustive quantitative and comparative analysis of several available and new parallel hyperspectral imaging algorithms by comparing their efficiency on both a fully heterogeneous network of workstations and a massively parallel homogeneous cluster at NASA's Goddard Space Flight Center in Maryland.
TL;DR: Challenges are explored and new techniques in automated instrumentation, runtime measurement, and visualization of RC application behavior are presented and ideas for integration with conventional performance analysis tools to create a unified tool for RC applications are presented.
Abstract: Reconfigurable computing (RC) applications employing both microprocessors and FPGAs have potential for large speedup when compared with traditional (software) parallel applications. However, this potential is marred by the additional complexity of these dual-paradigm systems, making it difficult to identify performance bottlenecks and achieve desired performance. Performance analysis concepts and tools are well researched and widely available for traditional parallel applications but are lacking in RC, despite being of great importance due to the applications' increased complexity. In this paper, we explore challenges and present new techniques in automated instrumentation, runtime measurement, and visualization of RC application behavior. We also present ideas for integration with conventional performance analysis tools to create a unified tool for RC applications as well as our initial framework for FPGA instrumentation and measurement. Results from a case study are provided using a prototype of this new tool.
TL;DR: It is observed that proteins with high betweenness centrality, but low connectivity are abundant in the human PIN, and an efficient and portable parallel implementation is designed for the calculation of this compute-intensive centrality metric.
Abstract: Due to fundamental physical limitations and power constraints, we are witnessing a paradigm shift in commodity microprocessor architecture to multicore designs. Continued performance now requires the exploitation of concurrency at the algorithm level. In this article, we demonstrate the application of high performance computing techniques in systems biology and present multicore algorithms for the important research problem of protein-interaction network (PIN) analysis. PINs play an important role in understanding the functional and organizational principles of biological processes. Promising computational techniques for key systems biology research problems such as identification of signaling pathways, novel protein function prediction, and the study of disease mechanisms, are based on topological characteristics of the protein interactome. Several complex network models have been proposed to explain the evolution of protein networks, and these models primarily try to reproduce the topological features observed in yeast, the model eukaryote interactome. In this article, we study the structural properties of a high-confidence human interaction network, constructed by assimilating recent experimentally derived interaction data. We identify topological properties common to the yeast and human protein networks. Betweenness is a quantitative measure of centrality of an entity in a complex network, and is based on computing all-pairs shortest paths in the graph. A novel contribution of our work is the analysis of the degree-betweenness centrality correlation in the human PIN. Jeong et al. empirically showed that betweenness is positively correlated with the essentiality and evolutionary age of a protein. We observe that proteins with high betweenness, but low degree (or connectivity) are abundant in the human PIN. We have designed efficient and portable parallel implementations for the exact calculation of betweenness and other compute-intensive centrality metrics relevant to interactome analysis. For example, on the Sun Fire T2000 server with the UltraSparc T1 (Niagara) processor, we achieve a relative speedup of about 16 using 32 threads for a typical instance of betweenness centrality on a PIN, reducing the running time from nearly 312min to 13s.
TL;DR: A high level of FPGA-specific design including cell lists, systematically determined interpolation and precision, handling of exclusion, and support for molecular dynamics simulations of up to 256K particles are combined for a substantial speed-up over a highly tuned production MD code.
Abstract: FPGA-based acceleration of molecular dynamics simulations (MD) has been the subject of several recent studies. The short-range force computation, which dominates the execution time, is the primary focus. Here we combine: a high level of FPGA-specific design including cell-lists, systematically determined interpolation and precision, handling of exclusion, and support for MD simulations of up to 256K particles. The target system consists of a standard PC with a 2004-era COTS FPGA board. There are several innovations: new microarchitectures for several major components, including the cell-list processor and the off-chip memory controller; and a novel arithmetic mode. Extensive experimentation was required to optimize precision, interpolation order, interpolation mode, table sizes, and simulation quality. We obtain a substantial speed-up over a highly tuned production MD code.
TL;DR: This paper presents a meta-modelling architecture suitable for modeling and simulation of complex EMM systems and some examples show how this architecture can be modified for EMM-like systems.
Abstract: During the most recent decades modern equation-based object-oriented modeling and simulation languages, such as Modelica, have become available. This has made it easier to build complex and more de ...
TL;DR: Numerical results are presented showing the performance of the proposed parallel algorithm of calculating dot product, designed to achieve the results as if computed in K-fold working precision with keeping the fastness of their algorithms.
Abstract: Parallel algorithms for accurate summation and dot product are proposed. They are parallelized versions of fast and accurate algorithms of calculating sum and dot product using error-free transformations which are recently proposed by Ogita et al. [T. Ogita, S.M. Rump, S. Oishi, Accurate sum and dot product, SIAM J. Sci. Comput. 26 (6) (2005) 1955-1988]. They have shown their algorithms are fast in terms of measured computing time. However, due to the strong data dependence in the process of their algorithms, it is difficult to parallelize them. Similarly to their algorithms, the proposed parallel algorithms in this paper are designed to achieve the results as if computed in K-fold working precision with keeping the fastness of their algorithms. Numerical results are presented showing the performance of the proposed parallel algorithm of calculating dot product.
TL;DR: The robustness of the preconditioners is illustrated on a set of linear systems arising from the finite element discretization of elliptic PDEs through extensive parallel experiments on up to 1000 processors.
Abstract: In this paper we study the parallel scalability of variants of additive Schwarz preconditioners for three dimensional non-overlapping domain decomposition methods. To alleviate the computational cost, both in terms of memory and floating-point complexity, we investigate variants based on a sparse approximation or on mixed 32- and 64-bit calculation. The robustness of the preconditioners is illustrated on a set of linear systems arising from the finite element discretization of elliptic PDEs through extensive parallel experiments on up to 1000 processors. Their efficiency from a numerical and parallel performance view point are studied.
TL;DR: In this paper, a machine/programming model for the era of multi-core chips is described, which replaces sequential composition with concurrent composition at all levels in the program except at the level where the compiler is able to make deterministic decisions on scheduling instructions.
Abstract: This paper describes a machine/programming model for the era of multi-core chips. It is derived from the sequential model but replaces sequential composition with concurrent composition at all levels in the program except at the level where the compiler is able to make deterministic decisions on scheduling instructions. These residual sequences of instructions are called microthreads and they are small code fragments that have blocking semantics. Dependencies that would normally be captured by sequential programming are captured in this model using dataflow synchronisation on variables in the contexts of these microthreads. The resulting model provides a foundation for significant advances in computer architecture as well as operating systems and compiler development. The paper takes a high-level perspective on the field of asynchronous distributed systems and comes to the conclusion that dynamic and concurrent models are the only viable solution but that these should not necessarily be visible to the users of the system.
TL;DR: A new reconfigurable architecture is presented to accelerate the two HMMer database search procedures hmmsearch and hmmpfam and it is described how this leads to significant runtime savings on off-the-shelf field-programmable gate arrays (FPGAs).
Abstract: HMMer is a commonly used package for biological sequence database searching with profile hidden Markov model (HMMs). It allows researchers to compare HMMs to sequence databases or sequences to HMM databases. However, such searches often take many hours on traditional computer architectures. These runtime requirements are likely to become even more severe due to the rapid growth in size of both sequence and model databases. We present a new reconfigurable architecture to accelerate the two HMMer database search procedures hmmsearch and hmmpfam. It is described how this leads to significant runtime savings on off-the-shelf field-programmable gate arrays (FPGAs).
TL;DR: The decentralized architecture of DSSM avoids the hierarchical or centralized approaches of traditional Grid architectures, eliminates large-scale flat flooding of unstructured P2P systems, and provides an interoperable, seamless, and infinite storage pool in a Grid environment.
Abstract: Most of currently deployed Grid systems employ hierarchical or centralized approaches to simplify system management. However, the approaches cannot satisfy the requirements of complex Grid applications which involve hundreds or thousands of geographically distributed nodes. This paper proposes a Dynamic and Scalable Storage Management (DSSM) architecture for Grid Oriented Storage (GOS) devices. Since large-scale data intensive applications frequently involve a high degree of data access locality, the DSSM divides GOS nodes into multiple geographically distributed domains to facilitate the locality and simplify the intra-domain storage management. Dynamic GOS agents selected from the domains are organized as a virtual agent domain in a Peer-to-Peer (P2P) manner to coordinate multiple domains. As only the domain agents participate in the inter-domain communication, system wide information dissemination can be done far more efficiently than flat flooding. Grid service based storage resources are adopted to stack simple modular service piece by piece as demand grows. The decentralized architecture of DSSM avoids the hierarchical or centralized approaches of traditional Grid architectures, eliminates large-scale flat flooding of unstructured P2P systems, and provides an interoperable, seamless, and infinite storage pool in a Grid environment. The DSSM architecture is validated by a proof-of-concept prototype system.
TL;DR: This paper proposes three efficient schemes to improve the performance of first-fit allocation strategies commonly used in practice and performs rigorous simulation experiments based on practical workloads to quantify all the proposed schemes and compare them against standard schemes existing in the literature.
Abstract: In this paper, we consider the problem of processor allocation on mesh-based multiprocessor systems. We employ the idea of using migration to minimize fragmentation and the overall processing time of the tasks. In our schemes, we consider the use of task migration whenever required to improve the problem of fragmentation. To this end, we propose three efficient schemes to improve the performance of first-fit allocation strategies commonly used in practice. The first scheme, called the first-fit mesh-bifurcation (FFMB) scheme, attempts to start the search for a free submesh from either the bottom-left corner or the top-left corner of the mesh so as to reduce the amount of fragmentation in the mesh. The next two schemes, called the online dynamic compaction-single corner (ODC-SC) and online dynamic compaction-four corners (ODC-FC) schemes, use task migration to improve the performance of existing submesh allocation strategies. We perform rigorous simulation experiments based on practical workloads as reported in the literature to quantify all our proposed schemes and compare them against standard schemes existing in the literature. Based on the results, we make clear recommendations on the choice of the strategies.
TL;DR: The paper reviews the current approach used by several high-level language compilers to integrate IP within their tool and proposes the CoreLib approach for standardizing this IP integration.
Abstract: This paper begins by summarizing the goals of the OpenFPGA CoreLib Working Group to facilitate the interoperability of FPGA circuit cores within a variety of FPGA design tools, including high-level programming tools targeting FPGA architectures. This effort is contrasted with other IP reuse efforts. The paper reviews the current approach used by several high-level language compilers to integrate IP within their tool. The CoreLib approach for standardizing this IP integration is proposed followed by an example that demonstrates its utility. Finally, the current state of the effort and future plans are presented.
TL;DR: The current state of practice for application development on hybrid systems is reviewed and the vision of the application development languages and tools that would greatly benefit the process of designing, implementing, and deploying applications on hybrid system are presented.
Abstract: Hybrid computing systems (incorporating FPGAs, GPUs, etc.) have received considerable attention recently as an approach to significant performance gains in many problem domains. Deploying applications on these systems, however, has proven to be difficult and very labor intensive. In this paper we review the current state of practice for application development on hybrid systems. We also present our vision of the application development languages and tools that we believe would greatly benefit the process of designing, implementing, and deploying applications on hybrid systems.
TL;DR: This work uses large real-life problems to show that it can significantly reduce the core memory usage in parallel, with a time performance comparable to that of a parallel in-core solver.
Abstract: The memory usage of sparse direct solvers can be the bottleneck to solve large sparse systems of linear equations of the form Ax=b. In order to solve large problems, we have designed a robust out-of-core solver, in which computed factors are stored on disk. We use large real-life problems (up to several million equations and several hundred million nonzeros) to show that we can significantly reduce the core memory usage in parallel (on up to 128 processors), with a time performance comparable to that of a parallel in-core solver. A careful study shows how the low-level I/O mechanisms impact the performance. We describe a low-level I/O layer that avoids the perturbations introduced by system buffers and allows consistently good performance results. To go significantly further in the memory reduction, it is interesting to also store the intermediate working memory on disk. In this paper we describe algorithmic models to address this issue, and study their potential in terms of both memory requirements and I/O volume. The out-of-core solver discussed in this paper is publicly available and already used by several academic and industrial groups. The results of the algorithmic modelling will be the basis to design a new version of this solver; this work may also be a useful reference for other developers of sparse out-of-core solvers.
TL;DR: A parallel implementation is investigated with the primary goals of the parallel execution of multiple instances of modtran as opaque third-party software, the consistency of numeric results in a heterogeneous compute environment, the potential to make use of otherwise idle computing resources available to researchers located at multiple institutions, and acceptable total turnaround time.
Abstract: A critical step in the product generation of satellite or airborne earth observation data is the correction of atmospheric features. Due to the complexity of the underlying physical model and the amount of coordinated effort required to provide, verify and maintain baseline atmospheric observations, one particular scientific modelling program, modtran, whose ancestor was first released in 1972, has become a de facto basis for such processing. While this provides the basis of per-pixel physical modelling, higher-level algorithms, which rely on the output of potentially thousands of runs of modtran are required for the processing of an entire scene. The widely-used atcor family of atmospheric correction software employs the commonly-used strategy of pre-computing a large look up table (lut) of values, representing modtran input parameter variation in multiple dimensions, to allow for reasonable running times in operation. The computation of this pre-computed look up table has previously taken weeks to produce a dvd (about 4GB) of output. The motivation for quicker turnaround was introduced when researchers at multiple institutions began collaboration on extending atcor features into more specialized applications. In this setting, a parallel implementation is investigated with the primary goals of: the parallel execution of multiple instances of modtran as opaque third-party software, the consistency of numeric results in a heterogeneous compute environment, the potential to make use of otherwise idle computing resources available to researchers located at multiple institutions, and acceptable total turnaround time. In both grid and cluster environments, parallel generation of a numerically consistent lut is shown to be possible and reduce ten days of computation time on a single, high-end processor to under two days of processing time with as little as eight commodity CPUs. Runs on up to 64 processors are investigated and the advantages and disadvantages of clusters and grids are briefly explored in reference to the their evaluation in a medium-sized collaborative project.
TL;DR: This paper presents parallel algorithms for matrix-matrix multiplication which are built up from several algorithms in a multi-level structure and can lead to significant performance gains on several parallel platforms.
Abstract: This paper presents parallel algorithms for matrix-matrix multiplication which are built up from several algorithms in a multi-level structure. The upper level consists of Strassen's algorithm which is performed for a predefined number of recursions. The number of recursions can be adapted to the specific execution platform. The intermediate level is performed by a parallel non-hierarchical algorithm and the lower level uses efficient one-processor implementations of matrix-matrix multiplication like BLAS or ATLAS. Both the number of recursions of Strassen's algorithm and the specific algorithms of the intermediate and lower level can be chosen so that a variety of different multi-level algorithms results. Each level of the multi-level algorithms is associated with a hierarchical partition of the set of available processors into disjoint subsets so that deeper levels of the algorithm employ smaller groups of processors in parallel. The algorithms are expressed in the multiprocessor task programming model and are coded with the runtime library Tlib. Performance experiments on several parallel platforms show that the multi-level algorithms can lead to significant performance gains.
TL;DR: Some features of the solver designed during the current GREMLINS (GRid Efficient Method for LINear Systems) project, which is based on a coarse grained iterative method called multisplitting method, are presented.
Abstract: Traditional large sparse linear solvers are not suited in a grid computing environment as they require a large amount of synchronization and communication penalizing the performance on this architecture. This paper presents some features of the solver designed during the current GREMLINS (GRid Efficient Method for LINear Systems) project. The GREMLINS solver limits the amount of communication as it is based on a coarse grained iterative method called multisplitting method. Moreover, the solver can be executed either in a synchronous or an asynchronous mode. In the latter case, iterations are desynchronized and there is no more synchronization at all. It may result in a faster execution time compared to the synchronous case. Some experiments presented in this paper with the GRID'5000 architecture, a nation wide experimental grid in France, allowed us to highlight interesting features of this solver.
TL;DR: This paper presents RNAVLab (RNA Virtual Laboratory), a virtual laboratory for studying RNA secondary structures including pseudoknots that allows scientists to address this challenge of computationally predicting secondary structures with thermodynamic methods when the RNA molecules have long nucleotide sequences and include complex motifs such as pseudok nots.
Abstract: As ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation, their secondary structures have been the focus of many recent studies. Despite the computing power of supercomputers, computationally predicting secondary structures with thermodynamic methods is still not feasible when the RNA molecules have long nucleotide sequences and include complex motifs such as pseudoknots. This paper presents RNAVLab (RNA Virtual Laboratory), a virtual laboratory for studying RNA secondary structures including pseudoknots that allows scientists to address this challenge. Two important case studies show the versatility and functionalities of RNAVLab. The first study quantifies its capability to rebuild longer secondary structures from motifs found in systematically sampled nucleotide segments. The extensive sampling and predictions are made feasible in a short turnaround time because of the grid technology used. The second study shows how RNAVLab allows scientists to study the viral RNA genome replication mechanisms used by members of the virus family Nodaviridae.
TL;DR: Development of performance models for these four-point explicit group (EG) and explicit decoupled group (EDG) methods and detailed study of their hypothetical implementation on two distributed memory multicomputers with different computation speed and communication bandwidth are presented.
Abstract: Since their introduction, the four-point explicit group (EG) and explicit decoupled group (EDG) methods in solving elliptic PDE's have been implemented on various parallel computing architectures such as shared memory parallel computer and distributed computer systems. However, no detailed study on the performance analysis of these algorithms was done in any of these implementations. In this paper we developed performance models for these explicit group methods and present detailed study of their hypothetical implementation on two distributed memory multicomputers with different computation speed and communication bandwidth. Detailed performance analysis based on these models predicted different theoretical performance if the methods were implemented on the clusters. This was confirmed by the experimental results performed on the two distinct clusters. Theoretical analysis and experimental results indicated that both explicit group methods are scalable with respect to number of processors and the problem size.
TL;DR: A matrix factorization called WZ factorization for the solution of symmetric tridiagonal linear systems is presented and when combined with partitioning scheme, it renders a divide and conquer algorithm.
Abstract: We present a matrix factorization called WZ factorization for the solution of symmetric tridiagonal linear systems. When combined with partitioning scheme, it renders a divide and conquer algorithm. Existence theorems are presented and backward error analysis is given. A variant of WZ factorization called WDZ factorization is also presented. Both WZ and WDZ algorithms for parallel solution of large tridiagonal symmetric positive definite linear systems are implemented on parallel machine with MPI as inter node communication.
TL;DR: A set of library development guidelines is provided, which has been validated with the RCLib case study, and is a set of portable libraries with over 100 cores, targeting a wide range of applications.
Abstract: Portable libraries of highly-optimized hardware cores can significantly reduce the development time of reconfigurable computing applications. This paper presents the tradeoffs and challenges in the design of such libraries. A set of library development guidelines is provided, which has been validated with the RCLib case study. RCLib is a set of portable libraries with over 100 cores, targeting a wide range of applications. RCLib portability has been verified in three major High-Performance reconfigurable computing architectures: SRC6, Cray XD1 and SGI RC100. Compared to full-software implementations, applications using RCLib hardware acceleration cores show speedups ranging from one to four orders of magnitude.
TL;DR: A multi-level hybrid approach appears to offer an effective parallelization method for implementation of the MIT forward model on SMP clusters.
Abstract: This paper describes four parallelization approaches used in a finite-difference-based electromagnetic modeller for application in magnetic induction tomography (MIT) and suitable for implementation on computer systems with symmetric multiprocessor (SMP) architecture. The approaches include: (i) splitting by coils using a distributed memory approach, (ii) splitting by physical domain using a distributed memory approach, (iii) splitting by physical domain using hybrid distributed/shared memory approach and (iv) splitting by both coils and physical domain using multi-level distributed and shared memory approaches respectively. All four approaches were implemented and tested on an IBM SP supercomputer. Coil parallelization was the most efficient method due to low inter-processor communication requirements but was limited by the number of coils in the MIT system. Approaches (ii) and (iii) allowed a larger number of processors to be employed but the efficiency versus number of processors was found to drop at a faster rate in comparison to (i). The fourth approach both allowed a larger number of processors to be employed and was found to provide higher efficiency than the parallelization by physical domain only. This multi-level hybrid approach therefore appears to offer an effective parallelization method for implementation of the MIT forward model on SMP clusters.
TL;DR: A distributed algorithm for the restoration of still frames corrupted by vertical line scratches is reported, and is solved using an ad-hoc Genetic Algorithm following a pipeline logical structure.
Abstract: This paper reports a distributed algorithm for the restoration of still frames corrupted by vertical line scratches. The restoration is here approached as an optimisation problem, and is solved using an ad-hoc Genetic Algorithm. The distributed algorithm is designed following a pipeline logical structure. The front end is a network of standard workstations with heterogeneous operating systems. The quality of image is appreciable and the computational time is quite low with respect the sequential version.
TL;DR: A system for vision-based tracking of body posture is described to provide a robust yet simple and inexpensive solution to real-time body tracking through a careful choice of visual and kinematic models.
Abstract: Non-intrusive human body tracking is a key issue in advanced human-computer interaction, with applications ranging from virtual reality to videoconference and telepresence. This paper describes a system for vision-based tracking of body posture. The system is explicitly designed to provide a robust yet simple and inexpensive solution to real-time body tracking through a careful choice of visual and kinematic models. Human posture representation is fully compatible with the MPEG-4 standard. Results of system application to a computer graphics scenario (animation of 3D avatars) are presented and discussed.
TL;DR: The foveated retina approach can be considered as a proposal of ad-hoc hardware implementation of biological inspired solutions, but also as a model for the internal virtual representation of the real world structures.
Abstract: In the literature, different approaches of artificial visual systems inspired to biological solutions can be found In this paper a particular solution will be described with an experimental case study: the foveated retina approach The solution can be considered as a proposal of ad-hoc hardware implementation of biological inspired solutions, but also as a model for the internal virtual representation of the real world structures
TL;DR: An algorithm for partitioning a sparse matrix into a matrix that has blocks on the diagonal such that two consecutive blocks can overlap is presented and is suitable for applying the explicit formulation of multiplicative Schwarz preconditioner.
Abstract: We present an algorithm for partitioning a sparse matrix into a matrix that has blocks on the diagonal such that two consecutive blocks can overlap. We refer to this form of the matrix as block diagonal matrix with overlap. The partitioned matrix is suitable for applying the explicit formulation of multiplicative Schwarz (EFMS) [G.A. Atenekeng Kahou, E. Kamgnia, B. Philippe, An explicit formulation of the multiplicative Schwarz preconditioner, Journal of Applied Numerical Mathematics 57 (2007) 1197-1213] used as a preconditioner for solving a sparse unsymmetric system of linear equations Ax=b. The proposed algorithm partitions the graph of the matrix A into k parts such that every part V"i has connecting edges with at most two neighbors V"i"-"1 and V"i"+"1. First, an ordering algorithm that reduces the profile of the matrix, and an initial block-diagonal partition with overlap is obtained. Second, an iterative strategy is used to further refine the partitioning by allowing vertices to be moved between partitions. Experiments performed on real-world matrices show the usefulness of this approach.
TL;DR: This paper presents coarse grained parallel graph algorithms with small message overheads that solve the following standard graph problems related to graph matching: finding maximum matchings in convex bipartite graphs, and finding maximum weight matching in trees.
Abstract: Parallel graph algorithm design is a very well studied topic. Many results have been presented for the PRAM model. However, these algorithms are inherently fine grained and experiments show that PRAM algorithms do often not achieve the expected speedup on real machines because of large message overheads. In this paper, we present coarse grained parallel graph algorithms with small message overheads that solve the following standard graph problems related to graph matching: finding maximum matchings in convex bipartite graphs, and finding maximum weight matchings in trees. To our knowledge, these are the first efficient parallel algorithms for these problems that are designed for standard commercial parallel machines such as off-the-shelf processor clusters.