TL;DR: The existence of a J-shaped distribution is demonstrated, sources of bias that cause this distribution are identified, ways to overcome these biases are proposed, and it is shown that overcoming these biases helps product review systems better predict future product sales.
Abstract: Introduction While product review systems that collect and disseminate opinions about products from recent buyers (Table 1) are valuable forms of word-of-mouth communication, evidence suggests that they are overwhelmingly positive. Kadet notes that most products receive almost five stars. Chevalier and Mayzlin also show that book reviews on Amazon and Barnes & Noble are overwhelmingly positive. Is this because all products are simply outstanding? However, a graphical representation of product reviews reveals a J-shaped distribution (Figure 1) with mostly 5-star ratings, some 1-star ratings, and hardly any ratings in between. What explains this J-shaped distribution? If products are indeed outstanding, why do we also see many 1-star ratings? Why aren't there any product ratings in between? Is it because there are no "average" products? Or, is it because there are biases in product review systems? If so, how can we overcome them? The J-shaped distribution also creates some fundamental statistical problems. Conventional wisdom assumes that the average of the product ratings is a sufficient proxy of product quality and product sales. Many studies used the average of product ratings to predict sales. However, these studies showed inconsistent results: some found product reviews to influence product sales, while others did not. The average is statistically meaningful only when it is based on a unimodal distribution, or when it is based on a symmetric bimodal distribution. However, since product review systems have an asymmetric bimodal (J-shaped) distribution, the average is a poor proxy of product quality. This report aims to first demonstrate the existence of a J-shaped distribution, second to identify the sources of bias that cause the J-shaped distribution, third to propose ways to overcome these biases, and finally to show that overcoming these biases helps product review systems better predict future product sales. We tested the distribution of product ratings for three product categories (books, DVDs, videos) with data from Amazon collected between February--July 2005: 78%, 73%, and 72% of the product ratings for books, DVDs, and videos are greater or equal to four stars (Figure 1), confirming our proposition that product reviews are overwhelmingly positive. Figure 1 (left graph) shows a J-shaped distribution of all products. This contradicts the law of "large numbers" that would imply a normal distribution. Figure 1 (middle graph) shows the distribution of three randomly-selected products in each category with over 2,000 reviews. The results show that these reviews still have a J-shaped distribution, implying that the J-shaped distribution is not due to a "small number" problem. Figure 1 (right graph) shows that even products with a median average review (around 3-stars) follow the same pattern.
TL;DR: The FLASH3 architecture is described, with emphasis on solutions to the more challenging conflicts arising from solver complexity, portable performance requirements, and legacy codes.
Abstract: FLASH is a publicly available high performance application code which has evolved into a modular, extensible software system from a collection of unconnected legacy codes. FLASH has been successful because its capabilities have been driven by the needs of scientific applications, without compromising maintainability, performance, and usability. In its newest incarnation, FLASH3 consists of inter-operable modules that can be combined to generate different applications. The FLASH architecture allows arbitrarily many alternative implementations of its components to co-exist and interchange with each other, resulting in greater flexibility. Further, a simple and elegant mechanism exists for customization of code functionality without the need to modify the core implementation of the source. A built-in unit test framework providing verifiability, combined with a rigorous software maintenance process, allow the code to operate simultaneously in the dual mode of production and development. In this paper we describe the FLASH3 architecture, with emphasis on solutions to the more challenging conflicts arising from solver complexity, portable performance requirements, and legacy codes. We also include results from user surveys conducted in 2005 and 2007, which highlight the success of the code.
TL;DR: This work presents a new, simple algorithmic idea for the collective communication operations broadcast, reduction, and scan (prefix sums), which beats all previous algorithms for reduction and scan.
Abstract: We present a new, simple algorithmic idea for the collective communication operations broadcast, reduction, and scan (prefix sums). The algorithms concurrently communicate over two binary trees which both span the entire network. By careful layout and communication scheduling, each tree communicates as efficiently as a single tree with exclusive use of the network. Our algorithms thus achieve up to twice the bandwidth of most previous algorithms. In particular, our approach beats all previous algorithms for reduction and scan. Experiments on clusters with Myrinet and InfiniBand interconnect show significant reductions in running time for all three operations sometimes even close to the best possible factor of two.
TL;DR: The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams, and is important both in itself, and as a subroutine within more advanced data stream computations.
Abstract: Many data generation processes can be modeled as data streams. They produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole. For example, the sequence of queries posed to an Internet search engine can be thought of as a stream, as can the collection of transactions across all branches of a supermarket chain. In aggregate, this data can arrive at enormous rates, easily in the realm of hundreds of gigabytes per day or higher. While this data may be archived and indexed within a data warehouse, it is also important to process the data "as it happens," to provide up to the minute analysis and statistics on current trends. Methods to achieve this must be quick to respond to each new piece of information, and use resources which are very small when compared to the total quantity of data. These applications and others like them have led to the formulation of the so-called "streaming model." In this abstraction, algorithms take only a single pass over their input, and must accurately compute various functions while using resources (space and time per item) that are strictly sublinear in the size of the input---ideally, polynomial in the logarithm of the input size. The output must be produced at the end of the stream, or when queried on the prefix of the stream that has been observed so far. (Other variations ask for the output to be maintained continuously in the presence of updates, or on a "sliding window" of only the most recent updates.) Some problems are simple in this model: for example, given a stream of transactions, finding the mean and standard deviation of the bill totals can be accomplished by retaining a few "sufficient statistics" (sum of all values, sum of squared values, etc.). Others can be shown to require a large amount of information to be stored, such as determining whether a particular search query has already appeared anywhere within a large stream of queries. Determining which problems can be solved effectively within this model remains an active research area. The frequent items problem (also known as the heavy hitters problem) is one of the most heavily studied questions in data streams. The problem is popular due to its simplicity to state, and its intuitive interest and value. It is important both in itself, and as a subroutine within more advanced data stream computations. Informally, given a sequence of items, the problem is simply to find those items which occur most frequently. Typically, this is formalized as finding all items whose frequency exceeds a specified fraction of the total number of items. This is shown in Figure 1. Variations arise when the items are given weights, and further when these weights can also be negative. This abstract problem captures a wide variety of settings. The items can represent packets on the Internet, and the weights are the size of the packets. Then the frequent items represent the most popular destinations, or the heaviest bandwidth users (depending on how the items are extracted from the flow identifiers). This knowledge can help in optimizing routing decisions, for in-network caching, and for planning where to add new capacity. Or, the items can represent queries made to an Internet search engine, and the frequent items are now the (currently) popular terms. These are not simply hypothetical examples, but genuine cases where algorithms for this problem have been applied by large corporations: ATT existing work is sometimes claimed to be incapable of a certain guarantee, which in truth it can provide with only minor modifications; and experimental evaluations do not always compare against the most suitable methods. In this paper, we present the main ideas in this area, by describing some of the most significant algorithms for the core problem of finding frequent items using common notation and terminology. In doing so, we also present the historical development of these algorithms. Studying these algorithms is instructive, as they are relatively simple, but can be shown to provide formal guarantees on the quality of their output as a function of an accuracy parameter e. We also provide baseline implementations of many of these algorithms against which future algorithms can be compared, and on top of which algorithms for different problems can be built. We perform experimental evaluation of the algorithms over a variety of data sets to indicate their performance in practice. From this, we are able to identify clear distinctions among the algorithms that are not apparent from their theoretical analysis alone.
TL;DR: This work exhaustively examined 128 GPU data layout configurations to improve register footprint and running time and conclude higher occupancy has greater impact than reduced latency.
Abstract: MUMmerGPU uses highly-parallel commodity graphics processing units (GPU) to accelerate the data-intensive computation of aligning next generation DNA sequence data to a reference sequence for use in diverse applications such as disease genotyping and personal genomics. MUMmerGPU 2.0 features a new stackless depth-first-search print kernel and is 13x faster than the serial CPU version of the alignment code and nearly 4x faster in total computation time than MUMmerGPU 1.0. We exhaustively examined 128 GPU data layout configurations to improve register footprint and running time and conclude higher occupancy has greater impact than reduced latency. MUMmerGPU is available open-source at http://www.mummergpu.sourceforge.net.
TL;DR: Single precision matrix multiplication kernels are presented implementing the C=C-AxB^T operation and the C- AxB operation for matrices of size 64x64 elements, and the performance of 25.55 Gflop/s is reported.
Abstract: Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.
TL;DR: Stanford professor Pat Hanrahan sits down with the noted hedge fund founder, computational biochemist, and (above all) computer scientist to discuss the future of artificial intelligence.
Abstract: Stanford professor Pat Hanrahan sits down with the noted hedge fund founder, computational biochemist, and (above all) computer scientist.
TL;DR: A scalable approach is presented, based on a parallel replay of the target application's communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations.
Abstract: When scaling message-passing applications to thousands of processors, their performance is often affected by wait states that occur when processes fail to reach synchronization points simultaneously. As a first step in reducing the performance impact, we have shown in our earlier work that wait states can be diagnosed by searching event traces for characteristic patterns. However, our initial sequential search method did not scale beyond several hundred processes. Here, we present a scalable approach, based on a parallel replay of the target application's communication behavior, that can efficiently identify wait states at the previously inaccessible scale of 65,536 processes and that has potential for even larger configurations. We explain how our new approach has been integrated into a comprehensive parallel tool architecture, which we use to demonstrate that wait states may consume a major fraction of the execution time at larger scales.
TL;DR: This work introduces the novel graph partitioning and repartitioning heuristic Bubble-FOS/C, and compares it to state-of-the-art libraries Metis and Jostle to reveal that the new heuristic is slower, but generates high-quality solutions that are often superior.
Abstract: The NP-hard graph partitioning problem is an important subtask in load balancing and many other applications. It requires the division of a graph's vertex set into P equally sized subsets such that some objective function is optimized. State-of-the-art libraries addressing this problem show several deficiencies: they are hard to parallelize, focus on small edge-cuts instead of few boundary vertices, and often produce disconnected partitions. This work introduces our novel graph partitioning and repartitioning heuristic Bubble-FOS/C. In contrast to other libraries, Bubble-FOS/C does not try to minimize the edge-cut explicitly, but focuses instead implicitly on good partition shapes. The shapes are optimized by diffusion processes that are embedded into an iterative framework. This approach incorporates a high degree of parallelism. Besides describing the evolution process that led to the new diffusion scheme FOS/C used by Bubble-FOS/C, we reveal some of FOS/C's properties and propose a number of enhancements for a fast and reliable implementation. Our experiments, in which we compare sequential and parallel Bubble-FOS/C implementations to the state-of-the-art libraries Metis and Jostle, reveal that our new heuristic is slower, but generates high-quality solutions that are often superior.
TL;DR: A number of performance tests that are motivated by typical application scenarios are proposed that cover the overhead of providing the MPI_THREAD_MULTIPLE level of thread safety for user programs, the amount of concurrency in different threads making MPI calls, the ability to overlap communication with computation, and other features.
Abstract: As parallel systems are commonly being built out of increasingly large multicore chips, application programmers are exploring the use of hybrid programming models combining MPI across nodes and multithreading within a node. Many MPI implementations, however, are just starting to support multithreaded MPI communication, often focussing on correctness first and performance later. As a result, both users and implementers need some measure for evaluating the multithreaded performance of an MPI implementation. In this paper, we propose a number of performance tests that are motivated by typical application scenarios. These tests cover the overhead of providing the MPI_THREAD_MULTIPLE level of thread safety for user programs, the amount of concurrency in different threads making MPI calls, the ability to overlap communication with computation, and other features. We present performance results with this test suite on several platforms (Linux cluster, Sun and IBM SMPs) and MPI implementations (MPICH2, Open MPI, IBM, and Sun).
TL;DR: The question the authors would like to address here is to what extent a translation service such as Google can produce adequate results in the language other than that being used to write the query.
Abstract: Introduction In multilingual countries (Canada, Hong Kong, India, among others) and large international organizations or companies (such as, WTO, European Parliament), and among Web users in general, accessing information written in other languages has become a real need (news, hotel or airline reservations, or government information, statistics). While some users are bilingual, others can read documents written in another language but cannot formulate a query to search it, or at least cannot provide reliable search terms in a form comparable to those found in the documents being searched. There are also many monolingual users who may want to retrieve documents in another language and then have them translated into their own language, either manually or automatically. Translation services may however be too expensive, not readily accessible or not available within a short timeframe. On the other hand, many documents contain non-textual information such as images, videos and statistics that do not need translation and can be understood regardless of the language involved. In response to these needs and in order to make the Web universally available regardless of any language barriers, in May 2007 Google launched a translation service that now provides two-way online translation services mainly between English and 41 other languages, for example, Arabic, simplified and traditional Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish (http://translate.google.com/). Over the last few years other free Internet translation services have been made available as for example by BabelFish (http://babel.altavista.com/) or Yahoo! (http://babelfish.yahoo.com/). These two systems are similar to that used by Google, given they are based on technology developed by Systran, one of the earliest companies to develop machine translation. Also worth mentioning here is the Promt system (also known as Reverso, http://translation2.paralink.com/), which was developed in Russia to provide mainly translation between Russian and other languages. The question we would like to address here is to what extent a translation service such as Google can produce adequate results in the language other than that being used to write the query. Although we will not evaluate translations per se we will test and analyze various systems in terms of their ability to retrieve items automatically based on a translated query. To be adequate, these tests must be done on a collection of documents written in one given language plus a series of topics (expressing user information needs) written in other languages, plus a series of relevance assessments (relevant documents for each topic).
TL;DR: This paper investigates the accuracy and performance characteristics of GPUs, including results from a preproduction double precision-capable GPU, and accelerates the full Quantum Monte Carlo simulation code DCA++, similarly investigating its tolerance to the precision of arithmetic delivered by GPUs.
Abstract: The tradeoffs of accuracy and performance are as yet an unsolved problem when dealing with Graphics Processing Units (GPUs) as a general-purpose computation device. Their high performance and low cost makes them a desirable target for scientific computation, and new language efforts help address the programming challenges of data parallel algorithms and memory management. But the original task of GPUs - real-time rendering - has traditionally kept accuracy as a secondary goal, and sacrifices have sometimes been made as a result. In fact, the widely deployed hardware is generally capable of only single precision arithmetic, and even this accuracy is not necessarily equivalent to that of a commodity CPU. In this paper, we investigate the accuracy and performance characteristics of GPUs, including results from a preproduction double precision-capable GPU. We then accelerate the full Quantum Monte Carlo simulation code DCA++, similarly investigating its tolerance to the precision of arithmetic delivered by GPUs. The results show that while DCA++ has some sensitivity to the arithmetic precision, the single-precision GPU results were comparable to single-precision CPU results. Acceleration of the code on a fully GPU-enabled cluster showed that any remaining inaccuracy in GPU precision was negligible; sufficient accuracy was retained for scientifically meaningful results while still showing significant speedups.
TL;DR: The first implementation results on a modular arithmetic library on GPUs for cryptography, in C++ for CUDA, provides modular arithmetic, finite field arithmetic and some ECC support.
Abstract: We present below our first implementation results on a modular arithmetic library on GPUs for cryptography. Our library, in C++ for CUDA, provides modular arithmetic, finite field arithmetic and some ECC support. Several algorithms and memory coding styles have been compared: local, shared and register. For moderate sizes, we report up to 2.6 speedup compared to state-of-the-art library.
TL;DR: These relationships using well established instruments in a survey of IS development professionals are looked at to better clarify the importance of these variables in system project success and any perceived differences among different players in IS development.
Abstract: Introduction The success of system development is most often gauged by three primary indicators: the number of days of deviation from scheduled delivery date, the percentage of deviation from the proposed budget, and meeting the needs of the client users. Tools and techniques to help perform well along these dimensions abound in practice and research. However, the project view of systems development should be broader than any particular development tool or methodology. Any given development philosophy or approach can be inserted into a systems development project to best fit the conditions, product, talent, and goals of the markets and organization. In order to best satisfy the three criteria, system development project managers must focus on the process of task completion and look to apply controls that ensure success, promote learning within the team and organization, and end up with a software product that not only meets the requirements of the client but operates efficiently and is flexible enough to be modified to meet changing needs of the organization. In this fashion, the project view must examine both process and product. Often, tasks required for project completion seem contradictory to organizational goals. Within the process, managerial controls are applied in order to retain alignment of the product to the initial, and changing, requirements of the organization. However, freedom from tight controls promotes learning. The product also has contradictions among desired outcomes. Designers must consider tradeoffs between product efficiency and flexibility, with the trend in processing power leading us ever more toward the flexibility side. Still, we rage between conflicting criteria, with the advocates of a waterfall system development lifecycle (SDLC) usually pushing more for control aspects and efficient operations while agile proponents seek more of a learning process and flexible product. Regardless of the development methodology followed, project managers must strive to deliver the system on time, within budget, and to meet the requirements of the user. Thus, both product and process are crucial in the determination of success. To compound the difficulties, those in control of choosing an appropriate methodology view success criteria from a different perspective than other stakeholders. Understanding how different stakeholders perceive these factors impacting eventual project success can be valuable in adjusting appropriate methodologies. Our study looks at these relationships using well established instruments in a survey of IS development professionals to better clarify the importance of these variables in system project success and any perceived differences among different players in IS development (see the sidebar on "How the Study Was Conducted").
TL;DR: It is shown that, in several situations, the oblivious algorithm Dynamic Clustering has scalability performance comparable to non-oblivious algorithms, which is remarkable considering that the authors' oblivious algorithm uses much less information to schedule tasks.
Abstract: Bag-of-Tasks applications are parallel applications composed of independent tasks. Examples of Bag-of-Tasks (BoT) applications include Monte Carlo simulations, massive searches (such as key breaking), image manipulation applications and data mining algorithms. This paper analyzes the scalability of Bag-of-Tasks applications running on master-slave platforms and proposes a scalability-related measure dubbed input file affinity. In this work, we also illustrate how the input file affinity, which is a characteristic of an application, can be used to improve the scalability of Bag-of-Tasks applications running on master-slave platforms. The input file affinity was considered in a new scheduling algorithm dubbed Dynamic Clustering, which is oblivious to task execution times. We compare the scalability of the Dynamic Clustering algorithm to several other algorithms, oblivious and non-oblivious to task execution times, proposed in the literature. We show in this paper that, in several situations, the oblivious algorithm Dynamic Clustering has scalability performance comparable to non-oblivious algorithms, which is remarkable considering that our oblivious algorithm uses much less information to schedule tasks.
TL;DR: The design of a new original energy-efficient Cloud infrastructure called Green Open Cloud is presented and the impact of virtual machines aggregation in terms of energy consumption is studied.
Abstract: Virtualization solutions appear as alternative approaches for companies to consolidate their operational services on a physical infrastructure, while preserving specific functionalities inside the Cloud perimeter (e.g., security, fault tolerance, reliability). These consolidation approaches are explored to propose some energy reduction while switching OFF unused computing nodes. We study the impact of virtual machines aggregation in terms of energy consumption. Some load-balancing strategies associated with the migration of virtual machines inside the Cloud infrastructures will be showed. We will present the design of a new original energy-efficient Cloud infrastructure called Green Open Cloud.
TL;DR: A new way of thinking about how computing artifacts can assist us in living is suggested, drawing on German philosopher Martin Heidegger's analysis of the need for equipment to be physically and cognitively available.
Abstract: This article explores a new way of thinking about how computing artifacts can assist us in living. The field of ubiquitous computing was inspired by Mark Weiser's vision of computing artifacts that disappear. “They weave themselves into the fabric of everyday life until they are indistinguishable from it.” Although Weiser cautioned that achieving the vision of ubiquitous computing would require a new way of thinking about computers, that takes into account the natural human environment, to date no one has articulated this new way of thinking. Here, we address this gap, making the argument that ubiquitous computing artifacts need to be physically and cognitively available. We show what this means in practice, translating our conceptual findings into principles for design. Examples and a specific application scenario show how ubiquitous computing that depends on these principles is both physically and cognitively available, seamlessly supporting living.
TL;DR: This paper presents a prototyping system with distinct flexibility and scalability, RAPTOR-X64, that integrates all key components to realize circuit and system designs with a complexity of up to 200 million transistors and can be easily scaled from the emulation of small embedded systems to the emulator of large MPSoCs with hundreds of processors.
Abstract: A number of FPGA-based rapid prototyping systems for ASIC emulation and hardware acceleration have been developed in recent years. In this paper we present a prototyping system with distinct flexibility and scalability. The designs will be described from an architectural view and measurements of the communication infrastructure will be presented. Additionally, the properties of the system will be shown using examples, that can be scaled from a single-FPGA-implementation to a multi-FPGA, cluster based implementation. Introduction In the process of developing microelectronic systems, a fast and reliable methodology for the realization of new architectural concepts is of vital importance. Prototypical implementations help to convert new ideas into products quickly and efficiently. Furthermore, they allow for the development of hardware and software for a given application in parallel, thus shortening time to market. FPGA-based hardware emulation can be used for functional verification of new MPSoC architectures as well as for HW/SW co-verification and for design-space exploration [1,2,3]. The rapid prototyping systems of the RAPTOR family that have been developed in the System and Circuit Technology group in Paderborn during the last ten years, provide the user with a complete hardware and software infrastructure for ASIC and MPSoC prototyping. A distinctive feature of the RAPTOR systems is that the platform can be easily scaled from the emulation of small embedded systems to the emulation of large MPSoCs with hundreds of processors. 1. RAPTOR-X64 – A Platform for Rapid Prototyping of Embedded Systems The rapid prototyping system RAPTOR-X64, successor of RAPTOR2000 [4], integrates all key components to realize circuit and system designs with a complexity of up to 200 million transistors. Along with rapid prototyping, the system can be used to accelerate 1This work was partly supported by the Collaborative Research Center 614 – Self-Optimizing Concepts and Structures in Mechanical Engineering – University of Paderborn. computationally intensive applications and to perform partial dynamic reconfiguration of Xilinx FPGAs. RAPTOR-X64 is designed as a modular rapid-prototyping system: the base system offers communication and management facilities, which are used by a variety of extension modules, realizing application-specific functionality. For hardware emulation, FPGA modules equipped with the latest Xilinx FPGAs and dedicated memory are used. Prototyping of complete SoCs is enabled by various additional modules providing, e.g., communication interfaces (Ethernet, USB, FireWire, etc.) as well as analog and digital I/Os. The local bus and the broadcast bus, both embedded in the baseboard architecture, add up to a powerful communication infrastructure that guarantees high speed communication with the host system and between individual modules, as depicted in figure 1. Furthermore, direct links between neighboring modules can be used to exchange data with a bandwidth of more than 20 GBit/s. For communication with the host system, either a PCI-X interface or an integrated USB-2.0 interface can be used. Both interfaces are directly connected to the local bus, thus creating both a closely coupled, high speed, PCI-X based communication, or a loosely coupled, USB based communication. As configuration and application data can either be supplied directly from the host system or stored on a compact flash card, standalone operation is also supported. Therefore, the system is especially suitable for infield evaluation and test of embedded applications. In addition to these features, RAPTORX64 offers several diagnostic functions: besides monitoring of the digital system environment (e.g., status of the communication system), relevant environmental information like voltages and temperatures are recorded. All system clocks are fine-grain adjustable over the whole working range, allowing for running hardware applications at ideal speed. The latest FPGA module that is currently available for RAPTOR-X64 (called DB-V4) hosts a Xilinx Virtex-4 FX100 FPGA and 4 GByte DDR2 RAM (see figure 1). The FPGAs include two embedded PowerPC processors and 20 serial highspeed transceivers, each capable of transceiving 6.5 GBit/s in full duplex. Utilizing these transceivers, four copper-based data links with a throughput of up to 32.5 GBit/s each are realized on the DB-V4 module. By adapting the cabling between the modules, the communication topology can be changed without affecting the communication via the RAPTOR base system. Serial data transmission at data rates of up to 6.5 GBit/s necessitates techniques to maintain signal integrity between the FPGAs. Utilizing all integrated signal integrity features of the FPGA and providing a sophisticated PCB environment SelectMAP, CFG-JTAG SelectMAP, CFG-JTAG SelectMAP, CFG-JTAG CTRL+Config Logic Arbiter, MMU Diagnostics, CLK, Configuration, etc. P C IX B us PCI-BusBridge Master, Slave, DMA Local-Bus (32Bit Data / 32Bit Address) Dual-Port SRAM 85 CTRL, SMB 85 CTRL, SMB 85 CTRL, SMB
TL;DR: An algorithm mainly consisting of a part of Divide and Conquer and the twisted factorization is proposed for bidiagonal SVD, which is highly parallelizable when singular values are isolated and needs only O(n) working memory for every type of matrices.
Abstract: An algorithm mainly consisting of a part of Divide and Conquer and the twisted factorization is proposed for bidiagonal SVD. The algorithm costs O(n^2)flops and is highly parallelizable when singular values are isolated. If strong clusters exist, the singular vector computation needs reorthgonalization. In such case, the cost of the algorithm increases to O(n^2+nk^2)flops and the parallelism may worsen depending on the distribution of singular values. Here k is the size of the largest cluster. The algorithm needs only O(n) working memory for every type of matrices.
TL;DR: A parallel algorithm for parameter tuning of parallel applications and its performance on three benchmark codes is compared to the Nelder-Mead algorithm, finding better configurations up to seven times faster.
Abstract: In this paper, we present and evaluate a parallel algorithm for parameter tuning of parallel applications. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithm and propose a strategy to minimize the impact of this variability. We evaluate our algorithm within the Active Harmony system, an automated online/offline tuning framework. We study its performance on three benchmark codes: PSTSWM, HPL and POP. Compared to the Nelder-Mead algorithm, our algorithm finds better configurations up to seven times faster. For POP, we were able to improve the performance of a production sized run by 59%.
TL;DR: The MADbench2 benchmark as mentioned in this paper is derived directly from a large-scale cosmic microwave background (CMB) data analysis package and is used to evaluate I/O performance of modern parallel file systems.
Abstract: With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle its data analysis requirements. However, to use such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale cosmic microwave background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modern parallel file systems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3, XT4, and Intel Itanium2 clusters; GPFS on IBM Power5 and AMD Opteron platforms; a BlueGene/P installation using GPFS and PVFS2 file systems; and CXFS on the SGI Altix3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX- versus MPI-IO, and unique- versus shared-file accesses, using both the default environment as well as highly tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and show that only the two of the nine evaluated systems benefited from MPI-2's asynchronous MPI-IO. On those systems, experimental results indicate that the computational intensity required to hide I/O effectively is already close to the practical limit of BLAS3 calculations. Overall, our study quantifies vast differences in performance and functionality of parallel file systems across state-of-the-art platforms - showing I/O rates that vary up to 75x on the examined architectures - while providing system designers and computational scientists a lightweight tool for conducting further analysis.
TL;DR: A de-escalation management maturity (DMM) model is introduced that provides a useful framework for improving practice and briefly examines three approaches that have been suggested for managing de- escalation.
Abstract: Introduction Taming runaway Information Technology (IT) projects is a challenge that most organizations have faced and that managers continue to wrestle with. These are projects that grossly exceed their planned budgets and schedules, often by a factor of 2--3 fold or greater. Many end in failure; failure not only in the sense of budget or schedule, but in terms of delivered functionality as well. Runaway projects are frequently the result of escalating commitment to a failing course of action, a phenomenon that occurs when investments fail to work out as envisioned and decision-makers compound the problem by persisting irrationally. Keil, Mann, and Rai reported that 30--40% of IT projects exhibit some degree of escalation. To break the escalation cycle, de-escalation of commitment to the failing course of action must occur so that valuable resources can be channeled into more productive use. But, making de-escalation happen is neither easy nor intuitive. This article briefly examines three approaches that have been suggested for managing de-escalation. By combining elements from the three approaches, we introduce a de-escalation management maturity (DMM) model that provides a useful framework for improving practice.
TL;DR: This paper introduces a methodology combining efficient thread scheduling and careful data placement to overcome the limitation coming from both the parallel algorithm and the memory hierarchy, and uses the MAi (Memory Affinity interface) to smoothly adapt the memory policy to the underlying architecture.
Abstract: Simulation of large scale seismic wave propagation is an important tool in seismology for efficient strong motion analysis and risk mitigation. Being particularly CPU-consuming, this three-dimensional problem makes use of parallel computing to improve the performance and the accuracy of the simulations. The trend in parallel computing is to increase the number of cores available at the shared-memory level with possible non-uniform cost of memory accesses. We therefore need to consider new approaches more suitable to such parallel systems. In this paper, we firstly report on the impact of memory affinity on the parallel performance of seismic simulations. We introduce a methodology combining efficient thread scheduling and careful data placement to overcome the limitation coming from both the parallel algorithm and the memory hierarchy. The MAi (Memory Affinity interface) is used to smoothly adapt the memory policy to the underlying architecture. We evaluate our methodology on computing nodes with different NUMA characteristics. A maximum gain of 53% is reported in comparison with a classical OpenMP implementation.
TL;DR: This paper designs an efficient parallel implementation of Fast Fourier Transform (FFT) to fully exploit the architectural features of the Cell/B.E. chip and introduces a novel data decomposition scheme to achieve highly efficient DMA data transfer and vectorization with low programming complexity.
Abstract: Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT), on the Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.), a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. We design an efficient parallel implementation of Fast Fourier Transform (FFT) to fully exploit the architectural features of the Cell/B.E. Our FFT algorithm uses an iterative out-of-place approach and for 1K to 16K complex input samples outperforms all other parallel implementations of FFT on the Cell/B.E. including FFTW. Our FFT implementation obtains a single-precision performance of 18.6 GFLOP/s on the Cell/B.E., outperforming Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. We also optimize Discrete Wavelet Transform (DWT) in the context of JPEG2000 for the Cell/B.E. DWT has an abundant parallelism, however, due to the low temporal locality of the algorithm, memory bandwidth becomes a significant bottleneck in achieving high performance. We introduce a novel data decomposition scheme to achieve highly efficient DMA data transfer and vectorization with low programming complexity. Also, we merge the multiple steps in the algorithm to reduce the bandwidth requirement. This leads to a significant enhancement in the scalability of the implementation. Our optimized implementation of DWT demonstrates 34 and 56 times speedup using one Cell/B.E. chip to the baseline code for the lossless and lossy transforms, respectively. We also provide the performance comparison with the AMD Barcelona (Quad-core Opteron) processor, and the Cell/B.E. excels the AMD Barcelona processor. This highlights the advantage of the Cell/B.E. over general purpose multicore processors in processing regular and bandwidth intensive scientific applications.
TL;DR: GPU acceleration and other computer performance increases will offer critical benefits to biomedical science and improve the quality of biomedical science research.
Abstract: GPU acceleration and other computer performance increases will offer critical benefits to biomedical science.
TL;DR: This research highlights the importance of customer service for consumer technology companies in retaining their customers by examining how the complaint management process can impact customers' intention to continue or discontinue using a given technology.
Abstract: Introduction In 2003, Dell computers shifted support calls for two of its corporate computer lines from its call center in Bangalore, India back to the U.S. The reason was that its customers were not satisfied with the level of technical support they were getting. Apart from the language difficulties, customers also faced difficulty in reaching senior technicians to, perhaps, resolve their problems more quickly. However, such problems are not just limited to computer vendors such as Dell. Recent research from Accenture finds that 75% of the sample of consumer technology company executives believed their companies provided average customer service. However, to their surprise, 58% of their customers had rated customer service to be either average or below average. A further grim detail was that 81% of the respondents who rated customer service as below average expressed intent to purchase from a different vendor next time. This research highlights the importance of customer service for consumer technology companies in retaining their customers. In general, consumer technology companies spend inordinate amounts of time, cost, and effort to get their innovations to market. However, initial acceptance is only the first step towards technology utilization. It is only after a certain amount of use that customers become aware of a technology's benefits and limitations. Having technology is one thing, using it effectively and persisting with it, is quite another. Hence, the study of factors leading to consumer technology repurchase is of critical importance. Consumer technologies, in particular, demand attention due to their commoditization, increased complexity, advances in technology, and focus on high serviceability. We can note the following when we think of consumer technologies such as PCs, laptops, or mobile phones: • The marketplace for these technologies is characterized by fierce competition amongst numerous players leading to a continuous price decline. For instance, almost all computer vendors now offer laptops for a few hundred dollars as compared to thousands of dollars a few years back. As prices continue to decline, it is imperative that companies focus on providing high-level customer service to differentiate from competitors and retain their existing customers, and prevent them from discontinuing their product. • Consumer technologies have also become more complex with more functionality being constantly added to the core product. Take the case of mobile phones: What was once a simple device for making phone calls has been morphed to include a digital camera, mp3 player, organizer, and a Web browser, to name a few. With such additional functionality and increased complexity, a customer is likely to encounter problems whose cause is difficult to identify correctly, yet need to be resolved quickly before the customer switches to a competing product. • Technological advances and a new generation of products have meant that both the technology providers as well as customers have to be knowledgeable in utilizing the consumer technologies. Without proper knowledge of the technology, support staff often struggle to resolve the problems in a timely manner. For example, in resolving problems with new release of operating system like Windows Vista, both the customers as well as Microsoft technical staff are required to have certain amount of knowledge about the system. A crucial aspect of customer service is being able to resolve consumer concerns during their use of technology. These factors contribute to difficulties in retaining customers for the consumer technology companies. One of the ways to have satisfied customers is continuing to address customer complaints effectively. Customers expect to have any service or product failures diagnosed and resolved quickly. In this context, we chose to examine how the complaint management process can impact customers' intention to continue or discontinue using a given technology. The complaint management process is not just a customer service issue and is not just limited to customer service personnel. It also has to do with the overall policies governing the customer service function. As Dell discovered, their policy of not limiting the time junior technical support personnel spent in resolving customer complaints (instead of referring to senior personnel) had impact on customers' satisfaction. Moreover, encouraging customer participation and feedback while addressing their concerns can lead to innovative practices within the company. For instance, Cingular involves its customers in its usability lab and leverages its interactions with them to design better mobile phone services. Hence, apart from customers' satisfaction, a good complaint management process can also help leverage customers' input to design better offerings. In our research we open the black box of the consumer technology complaint management process to learn how it affects customer satisfaction and intent to continue/discontinue a given product or service.
TL;DR: The ability to measure the impact of an IT implementation on organizational performance depends on the extent the impact both ripples through to the firm's bottom line and can be accurately differentiated from a host of confounding business factors.
Abstract: Introduction It's been over 10 years since corporate America embraced ERP systems, but hard evidence on the financial benefits that ERP systems have provided has been elusive. This debate has spilled into the mainstream media, as America's two largest ERP vendors regularly advertise that their customers have benefited financially by using their products; some of these ads even cite studies that discredit the other's claims of financial superiority. Investments in IT are typically justified by the productivity and profitability improvements that follow their implementation. It seems intuitive that IT will help streamline existing business processes, which should lead to a more efficient and ultimately more profitable company. Organizations of all types and sizes have invested heavily in IT based on this simple rationale, but the associated financial benefits have been difficult to nail down. While managers struggled to value their firms' IT investments, researchers tried to better understand the factors that made it so difficult to value corporate investments in IT. Among the factors that have been suggested, three may be especially useful. They are: • Firm-specific resources and capabilities that meaningfully impact the success of IT implementations; • External forces that exert themselves on the firm; and • The nature of the financial indices used to value IT investments. Much of the early research on the value of IT investments was based on industry-level data that masked the effects of important firm-specific resources and capabilities. When researchers finally examined firm-level data, they realized that differences in, for example, IT expertise and management, the quality of a firm's leadership and other human resources, and the uniqueness of its operations affected the success of IT implementations. Companies with firm-specific advantages generally out-produced their competitors. The success of IT implementations also depends on the unique set of unwieldy external forces that exert themselves on the firm. For example, Melville etal. suggest a host of external forces that may affect the impact of IT implementations on organizational performance, including the degree of competition within an industry, the impact of the firm's trading partners, and country characteristics. Unfortunately, studies of the relationship between IT and organizational performance are plagued by disagreements about the external constructs that should be examined, how these constructs are operationalized, and the nature of their interrelationships. There have also been concerns expressed about the financial indices used to measure the affects of IT implementations on corporate performance. Foremost among these indices are measures of corporate productivity and profitability. Productivity is associated with how efficiently a firm manages its business processes to produce a dollar of sales. For example, employee productivity is often calculated as the dollar level of sales generated per dollar paid to employees (net sales/employee cost). Firms usually have substantial control over their business processes, and this makes them potentially easier to measure and value financially. For example, firms often use proprietary processes to more efficiently manage their inventories in order to generate a higher level of sales. Therefore, measuring how efficiently a firm manages its inventory can be calculated using inventory turnover (net sales/inventory). On the other hand, profitability is an organizational performance measure that can be affected by factors unrelated to the IT investment. These factors include, for example, the number and quality of the firm's competitors and trading partners, intra-firm shifts in spending, and macro-economic changes in interest rates, exchange rates and inflation; many pundits would argue that even the formal recognition as a success by their trading partner should provide a competitive advantage that eventually shows up in measures of organizational performance. However, the ability to measure the impact of an IT implementation on organizational performance depends on the extent the impact both ripples through to the firm's bottom line and can be accurately differentiated from a host of confounding business factors. Thus, IT implementations that enhance a firm's ability to better manage business processes are more likely to have an impact that can be both accurately measured and valued credibly. Although IT implementations should also have an impact on organizational performance, accurately measuring their impact has proven to be difficult. The very mixed results regarding the post-adoption profitability of firms implementing ERPs may be an example of this. We suspect that the impact of ERP and other IT implementations on organizational performance will be easy to value financially if we can ever measure them accurately (that is, absent the distortions created primarily by hard to control business factors). Based on what we've described above, we would expect to see significant improvements in both productivity and profitability after the SAP implementation period for SAP successes themselves (SAPPROD and SAPPROF). And we would expect SAP successes to improve versus their competitors (DIFPROD and DIFPROF). However, as also noted above, we believe that the inability to accurately measure the component(s) of profitability directly related to the ERP implementation is likely to compromise the profitability results.