TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).
TL;DR: In this paper, a preloader uses a cache replacement manager to manage requests for retrievals, insertions, and removal of web page components in a component cache, and a profile server predicts a user's next content request.
Abstract: A preloader works in conjunction with a web/app server and optionally a profile server to cache web page content elements or components for faster on-demand and anticipatory dynamic web page delivery. The preloader uses a cache manager to manage requests for retrievals, insertions, and removal of web page components in a component cache. The preloader uses a cache replacement manager to manage the replacement of components in the cache. While the cache replacement manager may utilize any cache replacement policy, a particularly effective replacement policy utilizes predictive information to make replacement decisions. Such a policy uses a profile server, which predicts a user's next content request. The components that can be cached are identified by tagging them within the dynamic scripts that generate them. The preloader caches components that are likely to be accessed next, thus improving a web site's scalability.
TL;DR: In this paper, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network, and the cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in a wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computers according to the message, and controlling these one or multiple cache servers to cache selected web information selected for the mobile Computer, so as to enable faster accesses to the selected Web information by the Mobile Computer.
Abstract: In the disclosed information delivery scheme for delivering WWW information provided by information servers on the Internet to mobile computers connected to the Internet through a wireless network, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network. The cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in the wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computer according to the message, and controlling these one or more cache servers to cache selected WWW information selected for the mobile computer, so as to enable faster accesses to the selected WWW information by the mobile computer. Also, the cache servers can be managed by selecting one or more cache servers located within a geographic range defined for an information provider who provides WWW information from an information server, and controlling these one or more cache servers to cache selected WWW information selected for the information provider, so as to enable faster accesses to the selected WWW information by the mobile computer.
TL;DR: This paper focuses on the features of the M340 cache sub-system and illustrates the effect on power and performance through benchmark analysis and actual silicon measurements.
Abstract: Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the devices components. The M7CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with pro-grammable features was added to the M3 core. These features allow the architecture to be optimized based on the applications requirements. In this paper, we focus on the features of the M340 cache sub-system and illustrate the effect on power and perfor-mance through benchmark analysis and actual silicon measure-ments.
TL;DR: A practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement is presented.
Abstract: As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.
TL;DR: In this article, a field programmable gate array (FPGA) which includes first and second arrays of configurable logic blocks, and first-and second-level configuration cache memories coupled to the first and the second arrays, respectively, is described.
Abstract: A field programmable gate array (FPGA) which includes first and second arrays of configurable logic blocks, and first and second configuration cache memories coupled to the first and second arrays of configurable logic blocks, respectively. The first configuration cache memory array can either store values for reconfiguring the first array of configurable logic blocks, or operate as a RAM. Similarly, the second configuration cache array can either store values for reconfiguring the second array of configurable logic blocks, or operate as a RAM. The first configuration cache memory array and the second configuration cache memory array are independently controlled, such that partial reconfiguration of the FPGA can be accomplished. In addition, the second configuration cache memory array can store values for reconfiguring the first (rather than the second) array of configurable logic blocks, thereby providing a second-level reconfiguration cache memory.
TL;DR: In this paper, a relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache, where data is preloaded and responsively cached in the cache memory based on user preferences.
Abstract: An apparatus and method for caching data in a storage device (26) of a computer system (10). A relatively high-speed, intermediate-volume storage device (25) is operated as a user-configurable cache. Requests to access a mass storage device (46) such as a disk or tape (26, 28) are intercepted by a device driver (32) that compares the access request against a directory (51) of the contents of the user-configurable cache (25). If the user-configurable cache contains the data sought to be accessed, the access request is carried out in the user-configurable cache instead of being forwarded to the device driver for the target mass storage device (46). Because the user-cache is implemented using memory having a dramatically shorter access time than most mechanical mass storage devices, the access request is fulfilled much more quickly than if the originally intended mass storage device was accessed. Data is preloaded and responsively cached in the user-configurable cache memory based on user preferences.
TL;DR: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte and an instruction recoding technique is described that increases instruction cache energy savings to 18%.
Abstract: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 gate delays. Simulation results show that we can reduce total data cache energy by around 26% and instruction cache energy by around 10% for SPECint95 and MediaBench benchmarks. We also describe the use of an instruction recoding technique that increases instruction cache energy savings to 18%.
TL;DR: In this paper, the authors propose a Web page cache that stores Web pages such that servers will be able to retrieve valid dynamic pages without going to a dynamic content server or origin Web server for the page every time a user requests that dynamic page.
Abstract: A Web page cache that stores Web pages such that servers will be able to retrieve valid dynamic pages without going to a dynamic content server or origin Web server for the page every time a user requests that dynamic page. The dynamic content cache receives information that defines data upon which each dynamic page is dependent, such that when the value of any dependency data item changes, the associated dynamic page is marked as invalid or deleted. The dynamic page cache stores dependency data, receives change event information, and indicates when pages in the cache are invalidated or need to be refreshed.
TL;DR: In this article, a disk drive consisting of a cache memory and a cache control system with a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command is described.
Abstract: The present invention relates to a disk drive 10 comprising a cache memory 14 and a cache control system having a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command. The cache memory has a plurality of sequentially-ordered memory clusters 46 for caching disk data stored in sectors (not shown) on disks of a disk assembly 38 . Conventionally the disk sectors are identified by logical block addresses (LBAs). The cache control system 12 along with the tag memory 22 and means for allocating tag records are embedded within the cache control system 12 and thereby configured only for use in defining variable length segments of the memory clusters 46 . The segments are defined without regard to the sequential order of the memory clusters 46.
TL;DR: In this paper, the authors present a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host, where a lossy state record is provided for memory segments in a cache memory.
Abstract: The present invention may be embodied in a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host. A lossy state record is provided for memory segments in a cache memory. The lossy state record allows hosts commands to be mixed for streaming and non-streaming data without flushing of cache data for a command mode change.
TL;DR: A way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis via a novel hardware mechanism, called column caching.
Abstract: We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis. On-chip memory in the form of cache can be made to act like scratch-pad memory via a novel hardware mechanism, which we call column caching. Column caching enables dynamic cache partitioning in software, by mapping data regions to a specified sets of cache “columns” or “ways.” When a region of memory is exclusively mapped to an equivalent sized partition of cache, column caching provides the same functionality and predictability as a dedicated scratchpad memory for time-critical parts of a real-time application. The ratio between scratchpad size and cache size can be easily and quickly varied for each application, or each task within an application. Thus, software has much finer software control of on-chip memory, providing the ability to dynamically tradeoff performance for on-chip memory.
TL;DR: In this paper, a disk drive including a cache memory having a plurality of sequentially-ordered memory clusters for caching disk data stored in sectors (not shown) on disks of a disk assembly is described.
Abstract: The present invention relates to a disk drive including a cache memory having a plurality of sequentially-ordered memory clusters for caching disk data stored in sectors (not shown) on disks of a disk assembly. The disk sectors are identified by logical block addresses (LBAs). A cache control system of the disk drive comprises a cluster control block memory, having a plurality of cluster control blocks (CCB), and a tag memory 22, having a plurality of tag records, that are embedded within the cache control system. Each CCB includes a cluster segment record with an entry for associating the CCB with a particular memory cluster and for forming variable length segments of the memory clusters without regard to the sequential order of the memory clusters. Each tag record assigns a segment to a continuous range of LBAs and defines the CCBs forming the segment. Each segment of the memory clusters is for caching data from a contiguous range of the logical block addresses.
TL;DR: The importance of different Web proxy workload characteristics in making good cache replacement decisions is analyzed, and results indicate that higher cache hit rates are achieved using size-based replacement policies.
TL;DR: In this paper, the cache employs one or more prefetch ways for storing prefetch cache lines and accessed cache lines, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into non-prefetch ways.
Abstract: A cache employs one or more prefetch ways for storing prefetch cache lines and one or more ways for storing accessed cache lines. Prefetch cache lines are stored into the prefetch way, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into the non-prefetch ways. Accessed cache lines are thereby maintained within the cache separately from prefetch cache lines. When a prefetch cache line is presented to the cache for storage, the prefetch cache line may displace another prefetch cache line but does not displace an accessed cache line. A cache hit in either the prefetch way or the non-prefetch ways causes the cache line to be delivered to the requesting microprocessor in a cache hit fashion. The cache is further configured to move prefetch cache lines from the prefetch way to the non-prefetch way if the prefetch cache lines are requested (i.e. they become accessed cache lines). Instruction cache lines may be moved immediately upon access, while data cache line accesses may be counted and a number of accesses greater than a predetermined threshold value may occur prior to moving the data cache line from the prefetch way to the non-prefetch way. Additionally, movement of an accessed cache line from the prefetch way to the non-prefetch way may be delayed until the accessed cache line is to be replaced by a prefetch cache line.
TL;DR: In this paper, the cache memory is partitioned among a set of threads of a multi-threaded processor, and when a cache miss occurs, a replacement line is selected in a partition of the cache space which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.
Abstract: A method and apparatus which provides a cache management policy for use with a cache memory for a multi-threaded processor. The cache memory is partitioned among a set of threads of the multi-threaded processor. When a cache miss occurs, a replacement line is selected in a partition of the cache memory which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.
TL;DR: In this article, a cache system for looking up one or more elements of an external memory includes a set of cache memory elements coupled to the external memory, a cache cache cache memory cells (CAMs) containing an address and a pointer to a cache memory element, and a matching circuit having an input such that the CAM asserts a match output when the input is the same as the address in the CAM cell.
Abstract: A cache system for looking up one or more elements of an external memory includes a set of cache memory elements coupled to the external memory, a set of content addressable memory cells (CAMs) containing an address and a pointer to one of the cache memory elements, and a matching circuit having an input such that the CAM asserts a match output when the input is the same as the address in the CAM cell. The cache memory element which a particular CAM points to changes over time. In the preferred implementation, the CAMs are connected in an order from top to bottom, and the bottom CAM points to the least recently used cache memory element.
Abstract: A write cache that reduces the number of memory accesses required to write data to main memory. When a memory write request is executed, the request not only updates the relevant location in cache memory, but the request is also directed to updating the corresponding location in main memory. A separate write cache is dedicated to temporarily holding multiple write requests so that they can be organized for more efficient transmission to memory in burst transfers. In one embodiment, all writes within a predefined range of addresses can be written to memory as a group. In another embodiment, entries are held in the write cache until a minimum number of entries are available for writing to memory, and a least-recently-used mechanism can be used to decide which entries to transmit first. In yet another embodiment, partial writes are merged into a single cache line, to be written to memory in a single burst transmission.
TL;DR: In this paper, a cache system and method in accordance with the invention includes a cache near the target devices and another cache at the requesting host side so that the data traffic across the computer network is reduced.
Abstract: A cache system and method in accordance with the invention includes a cache near the target devices and another cache at the requesting host side so that the data traffic across the computer network is reduced. A cache updating and invalidation method are described.
TL;DR: In this article, a technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits, which can be inserted at the most significant bits of the cache index.
Abstract: A processor includes logic (612) for tagging a thread identifier (TID) for usage with processor blocks that are not stalled. Pertinent non-stalling blocks include caches, translation look-aside buffers (TLB) (1258, 1220), a load buffer asynchronous interface, an external memory management unit (MMU) interface (320, 330), and others. A processor (300) includes a cache that is segregated into a plurality of N cache parts. Cache segregation avoids interference, 'pollution', or 'cross-talk' between threads. One technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits. The cache utilizes cache indexing logic. For example, the TID bits can be inserted at the most significant bits of the cache index.
TL;DR: In this article, the data access layer determines whether a data item required by an application program is in the cache, and if it is, the access layer obtains the item from the cache; otherwise, it obtains from the data source.
Abstract: A middle-tier Web server (230) with a queryable cache (219) that contains items from one or more data sources (241). Items are included in the cache (223) on the basis of the probability of future hits on the items. When the data source (241) determines that an item that has been included in the cache (223) has changed, it sends an update message to the server (230), which updates the item if it is still included in the cache. In a preferred embodiment, the data source is a database system and triggers in the database system are used to generate update messages. In a preferred embodiment, the data access layer determines whether a data item required by an application program is in the cache. If it is, the data access layer obtains the item from the cache; otherwise, it obtains the item from the data source. The queryable cache includes a miss table that accelerates the determination of whether a data item is in the cache. The miss table is made up of miss table entries that relate the status of a data item to the query used to access the data item. There are three statuses: miss, indicating that the item is not in the cache, hit, indicating that it is, and unknown, indicating that it is not known whether the item is in the cache. When an item is referenced, the query used to access it is presented to the table. If the entry for the query has the status miss, the data access layer obtains the item from the data source instead of attempting to obtain it from the cache. If the entry has the status unknown, the data access layer attempts to obtain it from the cache and the miss table entry for the item is updated in accordance with the result. When a copy of an item is added to the cache, miss table entries with the status miss are set to indicate unknown.
TL;DR: In this paper, a cache server among a plurality of cache servers among a cluster server apparatus cache server, while optimally distributing loads on the plurality of the cache servers, is considered.
Abstract: A cluster server apparatus operable to continuously carrying out data distribution to terminals even if among a plurality of cache servers of the cluster server apparatus cache server, while optimally distributing loads on the plurality of cache servers. A cluster control unit of the cluster server apparatus distributes requests from the terminals based on the load of each of the plurality of cache servers. A cache server among the plurality of cache servers distributes, requested data (streaming data) to a terminal if the requested data is stored in a streaming data storage unit of the cache server, while distributing data from a content server the requested data if it is not stored in the streaming data storage unit. The data distributed from the content server is redundantly stored in the respective streaming data storage units of two or more cache servers. One cache server detects the state of distribution of the other cache server that stores the same data as that stored in the one cache server. If the one cache server becomes unable to carry out distribution, the other cache server continues data distribution instead.
TL;DR: This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler and finds that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application.
Abstract: This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler. Trace information is collected by an exception-based tracing tool called JTRACE, without any instrumentation to the Java programs or the JIT compiler.First, we find that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application. We also note that going beyond 2-way cache associativity improves the cache miss ratio marginally. Second, we observe that Java programs generate a substantial amount of short-lived objects. However, the size of frequently-referenced long-lived objects is more important to the cache performance, because it tends to determine the application's working set size. Finally, we note that the default heap configuration which starts from a small initial heap size is very inefficient since it invokes a garbage collector frequently. Although the direct costs of garbage collection decrease as we increase the available heap size, there exists an optimal heap size which minimizes the total execution time due to the interaction with the virtual memory performance.
TL;DR: In this article, a multiple virtual machine system with a microprocessor for executing instructions and issuing memory reads and writes to a current partition of a plurality of partitions is described, and a next partition cache is selected by the partition management unit when the next partition becomes active on the microprocessor to receive next partition memory read hits and misses.
Abstract: A multiple virtual machine system with a microprocessor for executing instructions and issuing memory reads and writes to a current partition of a plurality of partitions. The multiple virtual machine may contain a partition management unit for receiving the memory reads and writes from the microprocessor for the current partition. A current partition cache is selected by the partition management unit for the current partition to receive the current partition memory reads and writes from the partition management unit. The current partition cache resolves memory read hits and misses. A next partition cache is selected by the partition management unit when the next partition becomes active on the microprocessor to receive next partition memory reads and writes from the microprocessor. An external memory containing data organized in frame blocks and containing cache block addresses for the plurality of partitions provides data to the microprocessor when the current partition cache resolves the memory miss, and provides appropriate frame block data to the next partition cache to restore the next partition cache to a previous state.
TL;DR: This work investigates the memory system performance of several algorithms for transposing an N/spl times/N matrix in-place, where N is large and the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time.
Abstract: We investigate the memory system performance of several algorithms for transposing an N/spl times/N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high: low-level performance tuning "hacks", such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical non-linear layouts are inherently superior to the standard canonical layouts (such as row- or column-major) for this problem.
TL;DR: A simple, yet efficient optimization technique is presented that improves memory-hierarchy performance for pointer-centric applications by up to 24% and reduces cache misses byUp to 35% by selecting an improved ordering for the data members of pointer-based data structures.
Abstract: We present and evaluate a simple, yet efficient optimization technique that improves memory-hierarchy performance for pointer-centric applications by up to 24% and reduces cache misses by up to 35%. This is achieved by selecting an improved ordering for the data members of pointer-based data structures. Our optimization is applicable to all type-safe programming languages that completely abstract from physical storage layout; examples of such languages are Java and Oberon. Our technique does not involve programmers in the optimization process, but runs fully automatically, guided by dynamic profiling information that captures which paths through the program are taken with that frequencey. The algorithm first strives to cluster data members that are accessed closely after one another onto the same cache line, increasing spatial locality. Then, the data members that have been mapped to a particular cache line are ordered to minimize load latency in case of a cache miss.
TL;DR: In this article, a cache management processor is coupled with a cache cache management memory by a second link to manipulate the cache management structure in a hash table with linked lists at each hash queue element in accordance with cache management command and search key.
Abstract: A system and method for managing data stored in a cache block in a cache memory includes a cache block is located at a cache block address in the cache memory, and the data in the cache block corresponds to a storage location in a storage array identified by a storage location identifier. A storage processor accesses the cache block in the cache memory and provides a cache management command to a command processor. A processor memory coupled to the storage processor stores a search key based on the storage location identifier corresponding to the cache block. A command processor coupled to the storage processor receives a cache management command specified by the storage processor and transfers the storage location identifier from the processor memory. A cache management memory stores a cache management structure including the cache block address and the search key. A cache management processor is coupled to the cache management memory by a second link to manipulate the cache management structure in a hash table with linked lists at each hash queue element within the cache management memory in accordance with the cache management command and the search key.
TL;DR: In this paper, the authors describe a system for servicing a full cache line in response to a partial cache line request, which includes a storage to store at least one cache line, a hit/miss detector, and a data mover.
Abstract: A system is described for servicing a full cache line in response to a partial cache line request. The system includes a storage to store at least one cache line, a hit/miss detector, and a data mover. The hit/miss detector receives a partial cache line read request from a requesting agent and dispatches a fetch request to a memory device to fetch a full cache line data that contains data requested in the partial cache line read request from the requesting agent. The data mover loads the storage with the full cache line data returned from the memory device and forwards a portion of the full cache line data requested by the requesting agent. If data specified in a subsequent partial cache line request from the requesting agent is contained within the full cache line data specified in the previously dispatched fetch request, the hit/miss detector will send a command to the data mover to forward another portion of the full cache line data stored in the storage to the requesting agent. In one embodiment, the system also includes a write combining logic to combine two or more consecutive write requests that meet defined conditions into a single write request.
TL;DR: An active cache as mentioned in this paper is an OAP system that can not only answer queries that match data stored in the cache, but can also handle queries that require aggregation or other computation of the stored data.
Abstract: An “active cache”, for use by On-Line Analytic Processing (OLAP) systems, that can not only answer queries that match data stored in the cache, but can also answer queries that require aggregation or other computation of the data stored in the cache.
TL;DR: This work proposes a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory, and proposes a modulo scheduling scheme for this architecture which outperforms previous cluster-oriented schedulers.
Abstract: Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. We propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory inter-cluster communications so that the final schedule results in a cluster assignment that favors cluster locality in cache references and register accesses. It has been evaluated for both 2- and 4-cluster configurations and for differing numbers and latencies of inter-cluster buses. The proposed algorithm produces schedules with very low communication requirements and outperforms previous cluster-oriented schedulers.