Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Central processing unit
  4. 2004
  1. Home
  2. Topics
  3. Central processing unit
  4. 2004
Showing papers on "Central processing unit published in 2004"
Journal Article•10.1145/1015706.1015800•
Brook for GPUs: stream computing on graphics hardware

[...]

Ian Buck1, Tim Foley1, Daniel Reiter Horn1, Jeremy Sugerman1, Kayvon Fatahalian1, Mike Houston1, Pat Hanrahan1 •
Stanford University1
1 Aug 2004
TL;DR: This paper presents Brook for GPUs, a system for general-purpose computation on programmable graphics hardware that abstracts and virtualizes many aspects of graphics hardware, and presents an analysis of the effectiveness of the GPU as a compute engine compared to the CPU.
Abstract: In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

1,389 citations

Proceedings Article•10.1145/1058129.1058148•
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

[...]

Kayvon Fatahalian1, Jeremy Sugerman1, Pat Hanrahan1•
Stanford University1
29 Aug 2004
TL;DR: An in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times, finds even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches.
Abstract: Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.

379 citations

Patent•
Split embedded DRAM processor

[...]

Eric M. Dowling
2 Jul 2004
TL;DR: In this article, the architecture of the instruction set, data paths, addressing, control, caching, and interfaces are developed to allow the system to operate using a standard programming model, which can be accelerated either with or without the express knowledge of the processor.
Abstract: A processing architecture includes a first CPU core portion coupled to a second embedded dynamic random access memory (DRAM) portion. These architectural components jointly implement a single processor and instruction set. Advantageously, the embedded logic on the DRAM chip implements the memory intensive processing tasks, thus reducing the amount of traffic that needs to be bussed back and forth between the CPU core and the embedded DRAM chips. The embedded DRAM logic monitors and manipulates the instruction stream into the CPU core. The architecture of the instruction set, data paths, addressing, control, caching, and interfaces are developed to allow the system to operate using a standard programming model. Specialized video and graphics processing systems are developed. Also, an extended very long instruction word (VLIW) architecture implemented as a primary VLIW processor coupled to an embedded DRAM VLIW extension processor efficiently deals with memory intensive tasks. In different embodiments, standard software can be accelerated either with or without the express knowledge of the processor.

307 citations

Proceedings Article•10.5555/968878.969044•
Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times

[...]

Kihwan Choi1, Ramakrishna Soma1, Massoud Pedram1•
University of Southern California1
16 Feb 2004
TL;DR: The proposed DVFS technique relies on dynamically-constructed regression models that allow the CPU to calculate the expected workload and slack time for the next time slot, and thus, adjust its voltage and frequency in order to save energy while meeting soft timing constraints.
Abstract: This paper presents an intra-process dynamic voltage and frequency scaling (DVFS) technique targeted toward non real-time applications running on an embedded system platform. The key idea is to make use of runtime information about the external memory access statistics in order to perform CPU voltage and frequency scaling with the goal of minimizing the energy consumption while translucently controlling the performance penalty. The proposed DVFS technique relies on dynamically-constructed regression models that allow the CPU to calculate the expected workload and slack time for the next time slot, and thus, adjust its voltage and frequency in order to save energy while meeting soft timing constraints. This is in turn achieved by estimating and exploiting the ratio of the total off-chip access time to the total on-chip computation time. The proposed technique has been implemented on an XScale-based embedded system platform and actual energy savings have been calculated by current measurements in hardware. For memory-bound programs, a CPU energy saving of more than 70% with a performance degradation of 12% was achieved. For CPU-bound programs, 15/spl sim/60% CPU energy saving was achieved at the cost of 5-20% performance penalty.

261 citations

Proceedings Article•10.1145/1013235.1013282•
Dynamic voltage and frequency scaling based on workload decomposition

[...]

Kihwan Choi1, Ramakrishna Soma1, Massoud Pedram1•
University of Southern California1
9 Aug 2004
TL;DR: In this article, the CPU workload is decomposed in two parts: on-chip and off-chip, and the workload decomposition itself is performed at run time based on statistics reported by a performance monitoring unit (PMU) without a need for application profiling or compiler support.
Abstract: This paper presents a technique called "workload decomposition" in which the CPU workload is decomposed in two parts: on-chip and off-chip. The on-chip workload signifies the CPU clock cycles that are required to execute instructions in the CPU whereas the off-chip workload captures the number of external memory access clock cycles that are required to perform external memory transactions. When combined with a dynamic voltage and frequency scaling (DVFS) technique to minimize the energy consumption, this workload decomposition method results in higher energy savings. The workload decomposition itself is performed at run time based on statistics reported by a performance monitoring unit (PMU) without a need for application profiling or compiler support. We have implemented the proposed DVFS with workload decomposition technique on the BitsyX platform, an Intel PXA255-based platform manufactured by ADS Inc., and performed detailed energy measurements. These measurements show that, for a number of widely used software applications, a CPU energy saving of 80% can be achieved for memory-bound programs while satisfying the user-specified timing constraints.

232 citations

Patent•
System and method for parallel execution of data generation tasks

[...]

Jeffrey A. Andrews1, Nicholas R. Baker1, J. Andrew Goossen1, Michael Abrash1•
Microsoft1
30 Dec 2004
TL;DR: In this article, a CPU module includes a host element configured to perform a high-level host-related task, and one or more data-generating processing elements configured to process the input data to produce output data.
Abstract: A CPU module includes a host element configured to perform a high-level host-related task, and one or more data-generating processing elements configured to perform a data-generating task associated with the high-level host-related task. Each data-generating processing element includes logic configured to receive input data, and logic configured to process the input data to produce output data. The amount of output data is greater than an amount of input data, and the ratio of the amount of input data to the amount of output data defines a decompression ratio. In one implementation, the high-level host-related task performed by the host element pertains to a high-level graphics processing task, and the data-generating task pertains to the generation of geometry data (such as triangle vertices) for use within the high-level graphics processing task. The CPU module can transfer the output data to a GPU module via at least one locked set of a cache memory. The GPU retrieves the output data from the locked set, and periodically forwards a tail pointer to a cacheable location within the data-generating elements that informs the data-generating elements of its progress in retrieving the output data.

185 citations

Patent•
Information processing apparatus and method, program, and recording medium

[...]

Masahiko Sato1, Kazuhiro Watanabe, Kozo Obayashi•
Sony Broadcast & Professional Research Laboratories1
25 Mar 2004
TL;DR: In this article, the authors proposed a system to download software more securely to update the functions of software. But, the system is not applicable to personal computers, and it requires the files to be downloaded from a server into an information processing apparatus and the checksums stored in the memory are compared with those of the server.
Abstract: The present invention is intended to download software more securely to update the functions of software. When files are downloaded from a server into an information processing apparatus, a CPU downloads the files into bank A or bank B of a storage unit and copies the files into bank C. The CPU calculates a checksum of each file and a total checksum of all files and stores the obtained checksums into a memory. The checksums stored in the memory are compared with those of the server. Any file that has an error is downloaded again. In addition, at the end of the operation of the information processing apparatus, a normal end flag is stored in the memory. If, at the next startup of the information processing apparatus, its last operation is found to have been abnormally ended, the apparatus is started up by use of bank C. The present invention is applicable to personal computers.

132 citations

Journal Article•
The synergy between power-aware memory systems and processor voltage scaling

[...]

Xiaobo Fan, Carla Schlatter Ellis, Alvin R. Lebeck
01 Jan 2004-Lecture Notes in Computer Science
TL;DR: In this article, the authors show that there is a positive synergistic effect between DVS and power-aware memories that can transition into lower power states, which can offer greater energy savings than either technique alone (89% vs 39% and 54%).
Abstract: Energy consumption is becoming a limiting factor in the development of computer systems for a range of application domains. Since processor performance comes with a high power cost, there is increased interest in scaling the CPU voltage and clock frequency. Dynamic Voltage Scaling (DVS) is the technique for exploiting hardware capabilities to select an appropriate clock rate and voltage to meet application requirements at the lowest energy cost. Unfortunately, the power and performance contributions of other system components, in particular memory, complicate some of the simple assumptions upon which most DVS algorithms are based. We show that there is a positive synergistic effect between DVS and power-aware memories that can transition into lower power states. This combination can offer greater energy savings than either technique alone (89% vs. 39% and 54%). We argue that memory-based criteria-information that is available in commonly provided hardware counters-are important factors for effective speed-setting in DVS algorithms and we develop a technique to estimate overall energy consumption based on them.

126 citations

Patent•
Method and apparatus for optimizing code execution using annotated trace information having performance indicator and counter information

[...]

Jimmie Earl Dewitt1, Frank Eliot Levine1, Christopher Michael Richardson1, Robert John Urquhart1•
IBM1
14 Jan 2004
TL;DR: In this article, a method, apparatus, and computer instructions in a data processing system for processing instructions are provided, where instructions are received at a processor in the data processing systems and counting of each event associated with the execution of the instruction is enabled.
Abstract: A method, apparatus, and computer instructions in a data processing system for processing instructions are provided. Instructions are received at a processor in the data processing system. If a selected indicator is associated with the instruction, counting of each event associated with the execution of the instruction is enabled. In some embodiments, a compiler may obtain performance profile data, annotated by output obtained from the use of the performance indicators and counters, along with the instructions/data of the computer program and use this information to optimize the manner by which the computer program is executed, instructions/data are stored, and the like. The optimizations may be to optimize the instruction paths, optimize the time spent in initial application load, the manner by which the cache and memory is utilized, and the like.

108 citations

Book•
Ray tracing on a stream processor

[...]

Pat Hanrahan, Timothy John Purcell
1 Jan 2004
TL;DR: The results show that a GPU-based streaming ray tracer has the potential to outperform CPU-based algorithms without requiring fundamentally new hardware, helping to bridge the current gap between realistic and interactive rendering.
Abstract: Ray tracing is an image synthesis technique which simulates the interaction of light with surfaces. Most high-quality, photorealistic renderings are generated by global illumination techniques built on top of ray tracing. Real-time ray tracing has been a goal of the graphics community for many years. Unfortunately, ray tracing is a very expensive operation. VLSI technology has just reached the point where the computational capability of a single chip is sufficient for real-time ray tracing. Supercomputers and clusters of PCs have only recently been able to demonstrate interactive ray tracing and global illumination applications. In this dissertation we show how a ray tracer can be written as a stream graphics processor (GPU)—allowing the CPU to execute stream programs. We describe an implementation of our streaming ray tracer on the CPU and provide an analysis of the bandwidth and computational requirements of our implementation. In addition, we use our ray tracer to evaluate simulated GPUs with enhanced program execution models. We also present an implementation and evaluation of global illumination with photon mapping on the CPU as an extension of our ray tracing system. Finally, we examine hardware trends that favor the streaming model of computation. Our results show that a GPU-based streaming ray tracer has the potential to outperform CPU-based algorithms without requiring fundamentally new hardware, helping to bridge the current gap between realistic and interactive rendering.

94 citations

Patent•
Accelerated video encoding using a graphics processing unit

[...]

Guobin Shen1, Shipeng Li1, Guangping Gao1•
Microsoft1
22 Oct 2004
TL;DR: In this article, the authors propose a technique that enables the GPU to perform motion estimation for video encoding in parallel with the video encoding process performed by the CPU, which greatly accelerates the performance of video encoding.
Abstract: A video encoding system uses both a central processing unit (CPU) and a graphics processing unit (GPU) to perform video encoding. The system implements a technique that enables the GPU to perform motion estimation for video encoding. The technique allows the GPU to perform a motion estimation process in parallel with the video encoding process performed by the CPU. The performance of video encoding using such a system is greatly accelerated as compared to encoding using just the CPU. Also, data related to motion estimation is arranged and provided to the GPU in a way that utilizes the capabilities of the GPU. Data about video frames may be collocated to enable multiple channels of the GPU to process tasks in parallel. The depth buffer of the GPU may be used to consolidate repeated calculations and searching tasks during the motion estimation process.
Patent•
System and method for accelerating and optimizing the processing of machine learning techniques using a graphics processing unit

[...]

David W. Steinkraus1, Ian Andrew Buck1, Patrice Y. Simard1•
Microsoft1
4 Nov 2004
TL;DR: In this article, a system and method for processing machine learning techniques (such as neural networks) and other non-graphics applications using a graphics processing unit (GPU) to accelerate and optimize the processing.
Abstract: A system and method for processing machine learning techniques (such as neural networks) and other non-graphics applications using a graphics processing unit (GPU) to accelerate and optimize the processing. The system and method transfers an architecture that can be used for a wide variety of machine learning techniques from the CPU to the GPU. The transfer of processing to the GPU is accomplished using several novel techniques that overcome the limitations and work well within the framework of the GPU architecture. With these limitations overcome, machine learning techniques are particularly well suited for processing on the GPU because the GPU is typically much more powerful than the typical CPU. Moreover, similar to graphics processing, processing of machine learning techniques involves problems with solving non-trivial solutions and large amounts of data.
Patent•
Configuration in a configurable system on a chip

[...]

Brian Fox1, Andreas Papaliolios1•
Xilinx1
16 Dec 2004
TL;DR: In this paper, a user can customize the configuration sequence of a configurable system on a chip (CSoC), thereby adding considerable flexibility to the configuration process, and also provide certain features, transparent to the user, which optimize system resources and ensure the correct initialization of the CSoC.
Abstract: The present invention allows a user to customize the configuration sequence of a configurable system on a chip (CSoC), thereby adding considerable flexibility to the configuration process. The present invention also provides certain features, transparent to the user, which optimize system resources and ensure the correct initialization of the CSoC. The CSoC leverages an on-chip central processing unit (CPU) to control the configuration process of the configurable system logic (CSL). Advantageously, the CSL configuration memory cells as well as other programmable locations in the CSoC are addressable as part of a system bus address space. The system bus is a multi-use structure that can be used for both configuring and reading of memory cells. In this manner, the CSoC optimizes system resources.
Patent•
Accelerating video decoding using a graphics processing unit

[...]

Goubin Shen1, Lihua Zhu1, Shipeng Li1, Ya Qin Zhang1, Richard F. Rashid1 •
Microsoft1
9 Feb 2004
TL;DR: An accelerated video decoding system utilizes a graphics processing unit to perform motion compensation, image reconstruction, and color space conversion processes, while utilizing a central processing unit (CPU) to perform other decoding processes as mentioned in this paper.
Abstract: An accelerated video decoding system utilizes a graphics processing unit to perform motion compensation, image reconstruction, and color space conversion processes, while utilizing a central processing unit to perform other decoding processes.
Patent•
Dependable microcontroller, method for designing a dependable microcontroller and computer program product therefor

[...]

Riccardo Mariani, Silvano Motto, Monia Chiavacci
9 Jul 2004
TL;DR: In this article, a microcontroller comprising a central processing unit and a further fault processing unit is proposed for automotive System On Chip (SoC) applications, which includes a method for designing and verifying such faultrobust system on chip, and a fault-injection technique based on e-language.
Abstract: A microcontroller comprising a central processing unit and a further fault processing unit suitable for performing validation of operations of said central processing unit. The further fault processing unit is external and different with respect to said central processing unit and said further fault processing unit comprises at least a module for performing validation of operations of said central processing unit and one or more modules suitable for performing validation of operations of other functional parts of said microcontroller. Validation of operations of said central processing unit is performed by using one or more of the following fault tolerance techniques: data shadowing; codef data processing legality check; addressing legality check; ALU concurrent integrity checking; concurrent mode/interrupt check. The proposed microcontroller is particularly suitable for application in System On Chip (SoC) and was developed by paying specific attention to the possible use in automotive System On Chip. The invention also includes a method for designing and verify such fault-robust system on chip, and a fault-injection technique based on e-language.
Patent•
Navigation system having means for determining a route with optimized consumption

[...]

Gregor Scholl1•
Siemens1
6 Aug 2004
TL;DR: In this article, the authors present an input unit, a position-determining unit and a central processor unit (CPU) for calculating a route between a first and a second location by reference to map data which contains information for determining a predicted fuel consumption for the route, and an output unit for outputting travel instructions which are matched with the current position of the vehicle.
Abstract: A navigation system of a motor vehicle has an input unit, a position-determining unit, a central processor unit (CPU) for calculating a route between a first and a second location by reference to map data which contains information for determining a predicted fuel consumption for the route, and an output unit for outputting travel instructions which are matched with the current position of the vehicle. A maximum travel time for a journey from the first location to the second location is predefined by a user using the input unit and the route is determined such that the predefined maximum travel time is not exceeded, and the anticipated fuel consumption is minimized.
Proceedings Article•10.1109/ACSSC.2004.1399080•
High Level Exploration of Quantum-Dot Cellular Automata (QCA)

[...]

Konrad Walus1, G. Schulhof1, Graham A. Jullien1•
University of Calgary1
7 Nov 2004
TL;DR: A high level evaluation of an emerging nanotechnology to determine a set of technology requirements is presented and two different QCA circuits are presented and the technology requirements based on the specifications of these circuits are evaluated.
Abstract: In this work, we present a high level evaluation of an emerging nanotechnology to determine a set of technology requirements. The technology under question is Quantum-Dot Cellular Automata (QCA). As a vehicle, we present two different QCA circuits and evaluate the technology requirements based on the specifications of these circuits. These circuits are a simple 4-bit arithmetic logic unit (ALU) and a 4/spl times/4 memory which are building blocks to more complex systems such as a computer central processing unit (CPU).
Proceedings Article•10.1109/RTTAS.2004.1317263•
Data management in real-time systems: a case of on-demand updates in vehicle control systems

[...]

Thomas Gustafsson1, Jörgen Hansson1•
Linköping University1
25 May 2004
TL;DR: A new algorithm (ODTB) for updating data items that can skip unnecessary updates allowing for better utilization of the CPU is proposed.
Abstract: Real-time and embedded applications normally have constraints both with respect to timeliness and freshness of data they use. At the same time it is important that the resources are utilized as efficient as possible, e.g., for CPU resources unnecessary calculations should be lowered as much as possible. This is especially true for vehicle control systems, which are our targeting application area. The contribution of this paper is a new algorithm (ODTB) for updating data items that can skip unnecessary updates allowing for better utilization of the CPU. Performance evaluations on an engine electronic control unit for automobiles show that a database system using the new updating algorithm reduces the number of recalculations to zero in steady states. We also evaluate the algorithm using a simulator and show that the ODTB performs better than well-established updating algorithms (up to 50% more committed transactions).
Proceedings Article•10.1145/1027527.1027737•
Practical voltage scaling for mobile multimedia devices

[...]

Wanghong Yuan1, Klara Nahrstedt1•
University of Illinois at Urbana–Champaign1
10 Oct 2004
TL;DR: PDVS seeks to minimize the total energy of the whole device while meeting multimedia timing requirements, and extends traditional real-time scheduling by deciding what execution speed in addition to when to execute what applications.
Abstract: This paper presents the design, implementation, and evaluation of a practical voltage scaling (PDVS) algorithm for mobile devices primarily running multimedia applications. PDVS seeks to minimize the total energy of the whole device while meeting multimedia timing requirements. To do this, PDVS extends traditional real-time scheduling by deciding what execution speed in addition to when to execute what applications. PDVS makes these decisions based on the discrete speed levels of the CPU, the total power of the device at different speeds, and the probability distribution of CPU demand of multimedia applications. We have implemented PDVS in the Linux kernel and evaluated it on an HP laptop. Our experimental results show that PDVS saves energy substantially without affecting multimedia performance. It saves energy by 14.4% to 37.2% compared to scheduling algorithms without voltage scaling and by up to 10.4% compared to previous voltage scaling algorithms that assume an ideal CPU with continuous speeds and cubic power-speed relationship.
Book Chapter•10.1016/B978-012088469-8.50059-0•
Steps towards cache-resident transaction processing

[...]

Stavros Harizopoulos1, Anastassia Ailamaki1•
Carnegie Mellon University1
31 Aug 2004
TL;DR: Steps is proposed, a technique that minimizes instruction cache misses in OLTP workloads by multiplexing concurrent transactions and exploiting common code paths, and yields up to 96.7% reduction in instruction caches misses for each additional concurrent transaction.
Abstract: Online transaction processing (OLTP) is a multibillion dollar industry with high-end database servers employing state-of-the-art processors to maximize performance. Unfortunately, recent studies show that CPUs are far from realizing their maximum intended throughput because of delays in the processor caches. When running OLTP, instruction-related delays in the memory subsystem account for 25 to 40% of the total execution time. In contrast to data, instruction misses cannot be overlapped with out-of-order execution, and instruction caches cannot grow as the slower access time directly affects the processor speed. The challenge is to alleviate the instruction-related delays without increasing the cache size. We propose Steps, a technique that minimizes instruction cache misses in OLTP workloads by multiplexing concurrent transactions and exploiting common code paths. One transaction paves the cache with instructions, while close followers enjoy a nearly miss-free execution. Steps yields up to 96.7% reduction in instruction cache misses for each additional concurrent transaction, and at the same time eliminates up to 64% of mispredicted branches by loading a repeating execution pattern into the CPU. This paper (a) describes the design and implementation of Steps, (b) analyzes Steps using microbenchmarks, and (c) shows Steps performance when running TPC-C on top of the Shore storage manager.
Patent•
Security system for computer transactions

[...]

Ningjun Liu, Glenn Eastlack
9 Jul 2004
TL;DR: In this paper, a USB Security Key, a remote terminal and a secure access appliance are used to provide security for a central computer, where the IP address/name of the central computer is hidden from unauthorized access and provides an audit trail.
Abstract: A Security system for computer transactions incorporates a USB Security Key, a remote terminal and a secure access appliance to provide Security for a central computer. The USB Security Key is coded with a personal digital certificate and is required to be inserted into the remote terminal, along with the input of a personal identification number, before communications with the secure access appliance can be authenticated. The remote terminal is provided only with a central processing unit, random access memory, and restricted access, non-volatile flash memory storage device, which when used with a central computer, eliminates the need to store data on a permanent memory storage device. Software applications can be downloaded from the central computer for operation by the remote terminal. Since the IP address/name of the central computer is hidden by the secure access appliance, the central computer remains secure from unauthorized access and provides an audit trail.
Patent•
Instruction controlled data processing device

[...]

Carlos Antonio Alba Pinto1, Balakrishnan Srinivasan1, Ramanathan Sethuraman1•
Philips1
22 Jun 2004
TL;DR: In this article, a data processing device has a plurality of functional units and issues instructions in successive instruction cycles, each intended for one functional unit at a time, and an instruction of a second type causes a combination of different functional units to respond in the same instruction execution cycle.
Abstract: The data processing device has a plurality of functional units and issues instructions in successive instruction cycles. Instructions of a first type are each intended for one functional unit at a time. An instruction of a second type causes a combination of functional units to respond in the same instruction execution cycle, a result from one functional unit being used by another as part of the execution of the same instruction. Preferably, the device supports alternative operation at a number of different instruction cycle rates, dependent on whether an executed program segment contains instructions of the second type. The fastest instruction cycle rate does not allow execution of the instruction of the second type, because operation by different functional units does not fit within the instruction execution cycle. When possible, the device saves power by switching to a slower clock rate, in which case instructions of the second type are executed to save additional power, by reducing the number of instructions that have to be issued.
Patent•
Cellular phone capable of measuring temperature by IR means

[...]

Tsai Tony, Ho David, Jing Zhen, Zhu Yi-Min
3 Jun 2004
TL;DR: In this article, a cellular phone capable of measuring temperature including a liquid crystal display, a CPU and an IR temperature measurement means coupled to the CPU is presented, where the CPU performs a plurality of processing and conversion steps to show a corresponding temperature on the display.
Abstract: The present invention is to provide a cellular phone capable of measuring temperature including a liquid crystal display, a CPU and an IR temperature measurement means coupled to the CPU, wherein a first GPO pin and a second GPO pin of the CPU are adapted to control power supply of a power supply IC and start or stop of the IR temperature measurement means respectively, and in response to receiving a temperature measurement signal, the CPU performs a plurality of processing and conversion steps to show a corresponding temperature on the display.
Patent•
Module for a gaming machine

[...]

Binh T. Nguyen, Craig A. Paulsen, Mike Kinsley, John Goodman
25 Aug 2004
TL;DR: In this article, the authors present various modules for use with gaming machines, including a central processing unit (CPU) and a memory device such as a dual-ported random access memory (DPRAM).
Abstract: The present invention provides various modules for use with gaming machines. One such module is configured to receive data from a portable memory device and/or from a network device, e.g., from a game server. In some embodiments, the module includes, or is disposed within, a player tracking unit. Some embodiments of the module include a central processing unit (“CPU”) and a memory device such as a dual-ported random access memory (“DPRAM”). Data, such as software or content, may be downloaded to the module's CPU and written to the module's memory. According to some embodiments, data are written to a DPRAM in the module and simultaneously written from the DPRAM to the gaming machine via a high-speed digital bus. In some implementations, a memory in the module is configured to emulate a memory of the gaming machine. This allows a CPU of the gaming machine to execute software stored in the memory in the module. In alternative implementations, a CPU in the module can execute software stored in the memory in the module.
Proceedings Article•10.1109/RTTAS.2004.1317274•
Dynamic CPU management for real-time, middleware-based systems

[...]

Eric Eide1, Tim Stack1, John Regehr1, Jay Lepreau1•
University of Utah1
28 May 2004
TL;DR: The results show that the broker connects to the system transparently and allows it to function in the face of run-time CPU resource contention, and these features allow the broker to be easily combined with other QOS mechanisms and policies, as part of an overall end-to-end QOS management system.
Abstract: Many real-world distributed, real-time, embedded (ORE) systems, such as multiagent military applications, are built using commercially available operating systems, middleware, and collections of pre-existing software. The complexity of these systems makes it difficult to ensure that they maintain high quality of service (QOS). At design time, the challenge is to introduce coordinated QOS controls into multiple software elements in a non-invasive manner. At run time, the system must adapt dynamically to maintain high QOS in the face of both expected events, such as application mode changes, and unexpected events, such as resource demands from other applications. We describe the design and implementation of a CPU broker for these types of ORE systems. The CPU broker mediates between multiple real-time tasks and the facilities of a real-time operating system: using feedback and other inputs, it adjusts allocations over tune to ensure that high application-level QOS is maintained. The broker connects to its monitored tasks in a non-invasive manner, is based on and integrated with industry-standard middleware, and implements an open architecture for new CPU management policies. Moreover, these features allow the broker to be easily combined with other QOS mechanisms and policies, as part of an overall end-to-end QOS management system. We describe our experience in applying the CPU Broker to a simulated DUE military system. Our results show that the broker connects to the system transparently and allows it to function in the face of run-time CPU resource contention.
Proceedings Article•10.1109/ISCAS.2004.1329513•
Graphics processor unit (GPU) acceleration of finite-difference time-domain (FDTD) algorithm

[...]

S.E. Krakiwsky1, L.E. Turner1, Michal Okoniewski1•
University of Calgary1
23 May 2004
TL;DR: Off-the-shelf graphics processor units (GPUs) can be successfully used to accelerate FDTD simulations and it is demonstrated that the GPU outperforms a central processing unit (CPU) of comparable technology generation.
Abstract: The finite-difference time-domain (FDTD) algorithm has become a tool of choice in many areas of RF and microwave engineering and optics. However, FDTD runs too slow for some simulations to be practical, even when carried out on supercomputers. The development of dedicated hardware to accelerate FDTD computations has been investigated. In this paper, we demonstrate that off-the-shelf graphics processor units (GPUs) can be successfully used to accelerate FDTD simulations. Using C++, OpenGL, and several OpenGL extensions, a modern GPU has been programmed to solve a simple two dimensional electromagnetic scattering problem. The GPU outperforms a central processing unit (CPU) of comparable technology generation.
Patent•
Processor, data processing system and method for synchronizing access to data in shared memory

[...]

Guy Lynn Guthrie1, Sheldon B. Levenstein1, William J. Starke1, Derek Edward Williams1•
IBM1
14 Oct 2004
TL;DR: In this article, a processor core includes a store-through upper level cache, a reservation register, and sequencer logic that fails store-conditional operations without communication with the reservation logic.
Abstract: A processing unit for a multiprocessor data processing system includes a processor core and a lower level cache including a reservation logic that records reservations of the processor core. The reservation logic passes or fails store-conditional operations received from the processor core based upon whether the processor core has reservations for target store addresses of the store-conditional operations. The processor core includes a store-through upper level cache, a reservation register, and sequencer logic that, by reference to the reservation register, fails a store-conditional operation without communication with said reservation logic.
Book•
Fundamentals of computer organization and architecture

[...]

Mostafa Abd-El-Barr, Hesham El-Rewini
3 Dec 2004
TL;DR: This chapter discusses the RISC/CISC Evolution Cycle, which began with an introduction to Computer Systems, then moved to Reduced Instruction Set Computers (RISCs), and finally to Multiprocessors, which combines RISC and CISCs.
Abstract: Preface.1. Introduction to Computer Systems.1.1. Historical Background.1.2. Architectural Development & Styles.1.3. Technological Development.1.4. Performance Measures.1.5. Summary.Exercises.References and Further Reading.2. Instruction Set Architecture & Design.2.1. Memory Locations and Operations.2.2. Addressing Modes.2.3. Instruction Types.2.4. Programming Examples.2.5. Summary.Exercises.References and Further Reading.3. Assembly Language Programming.3.1. A Simple Machine.3.2. Instructions Mnemonics and Syntax.3.3. Assembler Directives and Commands.3.4. Assembly and Execution of Programs.3.5. Example: The X 86 Family.3.6. Summary.Exercises.References and Further Reading.4. Computer Arithmetic.4.1. Number Systems.4.2. Integer Arithmetic.4.3. Floating Point Arithmetic.4.4. Summary.Exercises.References and Further Readings.5. Processing Unit Design.5.1. CPU Basics.5.2. Register Set.5.3. Data Path.5.4. The CPU Instruction Cycle.5.5. Control Unit.5.6. Summary.Exercises.References.6. Memory System Design I.6.1. Basic Concepts.6.2. Cache Memory.6.3. Summary.Exercises.References and Further Readings.7. Memory System Design II.7.1. Main Memory.7.2. Virtual Memory.7.3. Read-Only Memory.7.4. Summary.Exercises.References and Further Readings.8. Input-Output Design and Organization.8.1. Basic Concepts.8.2. Programmed I/O.8.3. Interrupt-Driven I/O.8.4. Direct Memory Access (DMA).8.5. Busses.8.6. Input-Output Interfaces.8.7. Summary.Exercises.References and Further Readings.9. Pipelining Design Techniques.9.1. General Concepts.9.2. Instruction Pipeline.9.3. Arithmetic pipeline.9.4. Summary.Exercises.References and Further Reading.10. Reduced Instruction Set Computers (RISCs).10.1. RISC/CISC Evolution Cycle.10.2. RISCs Design Principles.10.3. Overlapped Register Windows.10.4. RISCs Versus CISCs.10.5. Pioneer (University) RISC Machines.10.6. Example of Advanced RISC Machines.10.7. Summary.Exercises.References and Further Readings.11. Introduction to Multiprocessors.11.1. Introduction.11.2. Classification of Computer Architectures.11.3. SIMD Schemes.11.4. MIMD Schemes.11.5. Interconnection Networks.11.6. Analysis and Performance Metrics.11. 7. Summary.Exercises.References and Further Readings.Index.
Book Chapter•10.1016/B978-075065759-4/50002-2•
Introduction to Microcontrollers

[...]

David Calcutt, Fred Cowan, Hassan Parchizadeh
1 Jan 2004
TL;DR: A microcontroller is a computer with most of the necessary support chips onboard and a central processing unit (CPU) that ‘executes’ programs.
Abstract: A microcontroller is a computer with most of the necessary support chips onboard All computers have several things in common, namely: • A central processing unit (CPU) that ‘executes’ programs
Patent•
Instruction group formation and mechanism for SMT dispatch

[...]

Brian W. Curran1, Brian R. Konigsburg1, Hung Qui Le1, David A. Luick1, Dung Quoc Nguyen1 •
IBM1
14 Oct 2004
TL;DR: In this article, a more efficient method of handling instructions in a computer processor, by associating resource fields with respective program instructions, where the resource fields indicate which of the processor hardware resources are required to carry out the program instructions and determining resource availability for simultaneously executing the merged program instructions based on the calculated resource requirements.
Abstract: A more efficient method of handling instructions in a computer processor, by associating resource fields with respective program instructions wherein the resource fields indicate which of the processor hardware resources are required to carry out the program instructions, calculating resource requirements for merging two or more program instructions based on their resource fields, and determining resource availability for simultaneously executing the merged program instructions based on the calculated resource requirements. Resource vectors indicative of the required resource may be encoded into the resource fields, and the resource fields decoded at a later stage to derive the resource vectors. The resource fields can be stored in the instruction cache associated with the respective program instructions. The processor may operate in a simultaneous multithreading mode with different program instructions being part of different hardware threads. When the resource availability equals or exceeds the resource requirements for a group of instructions, those instructions can be dispatched simultaneously to the hardware resources. A start bit may be inserted in one of the program instructions to define the instruction group. The hardware resources may in particular be execution units such as a fixed-point unit, a load/store unit, a floating-point unit, or a branch processing unit.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve