TL;DR: A 128-channel neural recording integrated circuit with on-the-fly spike feature extraction and wireless telemetry with computationally efficient spike detection and feature extraction algorithms attribute to an auspicious DSP implementation on-chip.
Abstract: This paper reports a 128-channel neural recording integrated circuit (IC) with on-the-fly spike feature extraction and wireless telemetry. The chip consists of eight 16-channel front-end recording blocks, spike detection and feature extraction digital signal processor (DSP), ultra wideband (UWB) transmitter, and on-chip bias generators. Each recording channel has amplifiers with programmable gain and bandwidth to accommodate different types of biological signals. An analog-to-digital converter (ADC) shared by 16 amplifiers through time-multiplexing results in a balanced trade-off between the power consumption and chip area. A nonlinear energy operator (NEO) based spike detector is implemented for identifying spikes, which are further processed by a digital frequency-shaping filter. The computationally efficient spike detection and feature extraction algorithms attribute to an auspicious DSP implementation on-chip. UWB telemetry is designed to wirelessly transfer raw data from 128 recording channels at a data rate of 90 Mbit/s. The chip is realized in 0.35 mum complementary metal-oxide-semiconductor (CMOS) process with an area of 8.8 times 7.2 mm2 and consumes 6 mW by employing a sequential turn-on architecture that selectively powers off idle analog circuit blocks. The chip has been tested for electrical specifications and verified in an ex vivo biological environment.
TL;DR: An overview of the state of the art of recent research activities in noncoherent ultra-wideband receiver front ends is provided with a focus on architectures that perform the initial signal processing tasks in the analog domain, such that the receiver does not need to sample the UWB received signals at Nyquist rate.
Abstract: The need for low-complexity devices with low-power consumption motivates the application of suboptimal noncoherent ultra-wideband (UWB) receivers. This article provides an overview of the state of the art of recent research activities in this field. It introduces energy detection and autocorrelation receiver front ends with a focus on architectures that perform the initial signal processing tasks in the analog domain, such that the receiver does not need to sample the UWB received signals at Nyquist rate. Common signaling and multiple access schemes are reviewed for both front ends. An elaborate section illustrates various performance tradeoffs to highlight preferred system choices. Practical issues are discussed, including, for low-data-rate schemes, the allowed power allocation per pulse according to the regulator's ruling and the estimated power consumption of a receiver chip. A large part is devoted to signal processing steps needed in a digital receiver. It starts with synchronization and time-of-arrival estimation schemes, introduces studies about the narrowband interference problem, and describes solutions for high-data-rate and multiple access communications. Drastic advantages concerning complexity and robustness justify the application of noncoherent UWB systems, particularly for low-data-rate systems.
TL;DR: In this article, a planar chalcogenide waveguide is demonstrated to be able to perform radio-frequency spectral measurements with a terahertz bandwidth and high bit-rate tests show that the chip-based system is potentially useful for ultrafast signal processing.
Abstract: A system based on a highly nonlinear planar chalcogenide waveguide is demonstrated to be able to perform radio-frequency spectral measurements with a terahertz bandwidth. High bit-rate tests show that the chip-based system is potentially useful for ultrafast signal processing.
TL;DR: A chip-level power control algorithm that is systematically designed based on optimal control theory that can precisely control the power of a CMP chip to the desired set point while maintaining the temperature of each core below a specified threshold is proposed.
Abstract: As chip multiprocessors (CMP) become the main trend in processor development, various power and thermal management strategies have recently been proposed to optimize system performance while controlling the power or temperature of a CMP chip to stay below a constraint. The availability of per-core DVFS (dynamic voltage and frequency scaling) also makes it possible to develop advanced management strategies. However, most existing solutions rely on open-loop search or optimization with the assumption that power can be estimated accurately, while others adopt oversimplified feedback control strategies to control power and temperature separately, without any theoretical guarantees. In this paper, we propose a chip-level power control algorithm that is systematically designed based on optimal control theory. Our algorithm can precisely control the power of a CMP chip to the desired set point while maintaining the temperature of each core below a specified threshold. Furthermore, an online model estimator is designed to achieve analytical assurance of control accuracy and system stability, even in the face of significant workload variations or unpredictable chip or core variations. Empirical results on a physical testbed show that our controller outperforms two state-of-the-art control algorithms by having better SPEC benchmark performance and more precise power control. In addition, extensive simulation results demonstrate the efficacy of our algorithm for various CMP configurations.
TL;DR: The design and implementation of 64-bit and 128-bit plastic transponder chips for radio-frequency identification tags and the reading distance that can be achieved with the authors' plastic rectifiers are discussed, and it is shown that this reading distance is not limited by the performance of the plastic rectifier ortransponder chip.
TL;DR: In this paper, an approach for high density packaging of semiconductor chips using silicon space transformer chip level package structures, which allow high density chip interconnection and/or integration of multiple chips or chip stacks high I/O interconnection, and heterogeneous chip or function integration.
Abstract: Apparatus and methods are provided for high density packaging of semiconductor chips using silicon space transformer chip level package structures, which allow high density chip interconnection and/or integration of multiple chips or chip stacks high I/O interconnection and heterogeneous chip or function integration.
TL;DR: In this article, a semiconductor package structure consisting of a substrate unit and a first chip stack structure is described, where the substrate unit comprises a circuit structure having test pads and each of the chips has a plurality of through silicon plugs connecting to the test pads.
Abstract: Semiconductor package structures and methods for manufacturing the same are provided. The semiconductor package structure comprises a substrate unit and a first chip stack structure. The substrate unit comprises a circuit structure having test pads. The first chip stack structure comprises chips, and each of the chips has a plurality of through silicon plugs. The through silicon plugs of two adjacent chips are electrically connected and further electrically connected to the test pads of the substrate unit for electrical testing. Another semiconductor package structure provided by the present invention comprises a first semiconductor chip and a second semiconductor chip. Each of the semiconductor chips has test pads for electrical testing and a plurality of through silicon plugs connecting to the test pads. The second semiconductor chip is mounted on the first semiconductor chip, and a portion of the through silicon plugs of two semiconductor chips are electrically connected with each other.
TL;DR: A case is made for a new approach to designing on-chip interconnection networks that addresses the challenges of energy, space, and design complexity.
Abstract: Buffers in on-chip networks consume significant energy, occupy chip area, and increase design complexity. In this paper, we make a case for a new approach to designing on-chip interconnection netwo...
TL;DR: This technology may provide a solution to the need for a high sensitivity, rapid, and automated ChIP assay, and in doing so facilitate the use of ChIP for many interesting and valuable applications.
Abstract: Chromatin immunoprecipitation (ChIP) is a powerful assay used to probe DNA–protein interactions. Traditional methods of implementing this assay are lengthy, cumbersome and require a large number of cells, making it difficult to study rare cell types such as certain cancer and stem cells. We have designed a microfluidic device to perform sensitive ChIP analysis on low cell numbers in a rapid, automated fashion while preserving the specificity of the assay. Comparing ChIP results for two modified histone protein targets, we showed our automated microfluidic ChIP (AutoChIP) from 2,000 cells to be comparable to that of conventional ChIP methods using 50,000–500,000 cells. This technology may provide a solution to the need for a high sensitivity, rapid, and automated ChIP assay, and in doing so facilitate the use of ChIP for many interesting and valuable applications.
TL;DR: In this paper, a glass substrate with flip-chip-bonded photodiode chips was vertically stacked on a Si substrate with the bonded laser diode chip using Au-Au SAB.
Abstract: Hybrid integration of multiple optical chips in three dimensions is an important technology for realizing highly functional, compact optoelectronic microsystems. In this paper, we report successful 3-D integration of optical chips performed using Au-Au surface-activated bonding (SAB) at a relatively low bonding temperature of 150degC. A glass substrate with the flip-chip-bonded photodiode chips was vertically stacked on a Si substrate with the bonded laser diode chip using Au-Au SAB. By applying this technique, compact and thin optical microsensors (2.8 mm times 2.8 mm times 1 mm thick) were fabricated. The feasibility of measuring velocity was demonstrated using prototype microsensors.
TL;DR: In this article, a very large-scale integrated chip that emulates nonlinear active cochlear signal processing is presented. But the chip's responses resemble that of a living cochlea's: frequency responses become larger and more sharply tuned when active coupling is turned on.
Abstract: We present a mixed-signal very-large-scale-integrated chip that emulates nonlinear active cochlear signal processing. Modeling the cochlea's micromechanics, including outer hair cell (OHC) electromotility, this silicon (Si) cochlea features active coupling between neighboring basilar membrane (BM) segments-a first. Neighboring BM segments, each implemented as a class AB log-domain second-order section, exchange currents representing OHC forces. This novel active-coupling architecture overcomes the major shortcomings of existing cascade and parallel filter-bank architectures, while achieving the highest number of digital outputs in an Si cochlea to date. An active-coupling architecture Si cochlea with 360 frequency channels and 2160 pulse-stream outputs occupies 10.9 mm2 in a five-metal 1-poly 0.25-mum CMOS process. The chip's responses resemble that of a living cochlea's: Frequency responses become larger and more sharply tuned when active coupling is turned on. For instance, gain increases by 18 dB and Q 10 increases from 0.45 to 1.14. This enhancement decreases with increasing input intensity, realizing frequency-selective automatic gain control. Further work is required to improve performance by reducing large variations from tap to tap.
TL;DR: In this paper, a logic block area is defined to include a first chip level where layout features are placed according to a first virtual grate, and a second chip level in which layout feature are placed based on a second virtual grate.
Abstract: A semiconductor chip is defined to include a logic block area having a first chip level in which layout features are placed according to a first virtual grate, and a second chip level in which layout features are placed according to a second virtual grate. A rational spatial relationship exists between the first and second virtual grates. A number of cells are placed within the logic block area. Each of the number of cells is defined according to an appropriate one of a number of cell phases. The appropriate one of the number of cell phases causes layout features in the first and second chip levels of a given placed cell to be aligned with the first and second virtual grates as positioned within the given placed cell.
TL;DR: In this article, a new ultrathin chip fabrication and assembly process, consisting of a preprocess module Chipfilm and a postprocess module Pick, Crack, and Place, is presented.
Abstract: A new ultrathin chip fabrication and assembly process, consisting of a preprocess module Chipfilm and a postprocess module Pick, Crack, and Place, is presented. In contrast to the established wafer thinning technique, the preprocessed wafer substrates are prepared with extremely narrow buried cavities beneath the chip areas at a well-defined distance from the wafer surface, thus precisely defining the chip thickness a priori. After CMOS integration on those dedicated wafer substrates, chips are detached from the wafer surface by etching trenches at the chip edges into the buried cavities and breaking of residual anchors by mechanical force in the postprocess. The feasibility of the new process is demonstrated through a mixed-signal circuit having 38 000 digital and 2700 analog transistors, showing full functionality within specifications for 20-mum-thin chips even under a bending stress of up to 110 MPa.
TL;DR: A new growth recipe for producing carbon nanotubes (CNTs) combined with a new bonding technique was implemented in a microfabricated gas chromatography (micro-GC) chip, resulting in better separation efficiency and peak symmetry in the separation of a mixture of five n-alkanes.
TL;DR: A chip package including at least a shielding layer for better electromagnetic interferences shielding is provided in this article, where the shielding layer disposed over the top surface of the laminate substrate can protect the chip package from the underneath EMI radiation.
Abstract: A chip package including at least a shielding layer for better electromagnetic interferences shielding is provided The shielding layer disposed over the top surface of the laminate substrate can protect the chip package from the underneath EMI radiation The chip package may further include another shielding layer over the molding compound of the chip package
TL;DR: This thesis has created an automatic stencil code tuner, or auto-tuner, that incorporates several optimizations into a single software framework, thereby allowing for much greater productivity than hand-tuning.
Abstract: As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates.
One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel.
We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32× range in available hardware threads, a 4.5× range in attained DRAM bandwidth, and a 6.3× range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge.
Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner, that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4× faster than a straightforward implementation and is more scalable across cores.
By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.
TL;DR: The principle of joint FDE/antenna diversity combining is introduced and various SC multiple access techniques with one-tap FDE are reviewed, i.e., SC-FDMA, SC-CDMA, block spread CDMA, and so on.
Abstract: SUMMARY Single-carrier (SC) multiple access is a promising uplink multiple access technique because of its low peak-to-average power ratio (PAPR) property and high frequency diversity gain that is achievable through simple one-tap frequency-domain equalization (FDE) in a strong frequency-selective channel. The multiple access capability can be obtained by combining either frequency division multiple access (FDMA) or code division multiple access (CDMA) with SC transmission. In this article, we review the recent research on the SC multiple access techniques with one-tap FDE. After introducing the principle of joint FDE/antenna diversity combining, we review various SC multiple access techniques with one-tap FDE, i.e., SC-FDMA, SC-CDMA, block spread CDMA, and
TL;DR: A single-chip ultra-wideband (UWB) receiver was developed using 0.18 mum CMOS technology, and inter-chip wireless data communication by integrated antennas was confirmed, and Dual-Dirac model could explain the probability distribution of the cycle of double GMP template.
Abstract: A single-chip ultra-wideband (UWB) receiver was developed using 0.18 mum CMOS technology, and inter-chip wireless data communication by integrated antennas was confirmed. Timing pulse and data pulse with on-off keying were alternately sent from a transmitting antenna. Double Gaussian monocycle pulse (GMP) template generators performed detections of timing and data pulses. A single GMP template, whose probability distribution of the pulse repetition cycle is given by Gaussian, showed a random jitter of 4.87 ps. Dual-Dirac model could explain the probability distribution of the cycle of double GMP template. Obtained random jitter and deterministic jitter were 4.6 ps and 14.4 ps, respectively. The receiver successfully recovered 200 Mbps data at the distance of 0.5 mm.
TL;DR: In this paper, a modulation device includes a spread code generation unit which generates spread code having a predetermined cycle; an audio signal input unit to which an audio signals is input; a first modulation unit which phase-modulates the spread code in each cycle on the basis of a data code; and a combining unit which combines the audio signal with a modulation signal which has been generated on a basis of the phase modulated spread code and distributed in a frequency range higher than a predetermined frequency.
Abstract: A modulation device includes: a spread code generation unit which generates a spread code having a predetermined cycle; an audio signal input unit to which an audio signal is input; a first modulation unit which phase-modulates the spread code in each cycle on the basis of a data code; and a combining unit which combines the audio signal with a modulation signal which has been generated on the basis of the phase-modulated spread code and distributed in a frequency range higher than a predetermined frequency to output a combined signal.
TL;DR: In this paper, a chip package including a shielding layer conformally covering the underlying molding compound for is provided, which can smoothly cover the molding compounds and over the rounded or blunted, top edges of the compound.
Abstract: A chip package including a shielding layer conformally covering the underlying molding compound for is provided. The shielding layer can smoothly cover the molding compound and over the rounded or blunted, top edges of the molding compound, which provides better electromagnetic interferences shielding and better shielding performance.
TL;DR: A flip-chip package as discussed by the authors includes a package carrier, a semiconductor die having a die face and a die edge, and bumps arranged on the rewiring laminate structure for electrically connecting the semiconductor dies with the package carrier.
Abstract: A flip-chip package includes a package carrier; a semiconductor die having a die face and a die edge, the semiconductor die being assembled face-down to a chip side of the package carrier, and contact pads are situated on the die face; a rewiring laminate structure between the semiconductor die and the package carrier, the rewiring laminate structure including a re-routed metal layer, and at least a portion of the re-routed metal layer projects beyond the die edge; and bumps arranged on the rewiring laminate structure for electrically connecting the semiconductor die with the package carrier
TL;DR: In this paper, the authors present a method, apparatus, and computer instructions for managing a set of signal paths for a chip when a defective signal path within the set of signals for the chip is detected.
Abstract: A method, apparatus, and computer instructions for managing a set of signal paths for a chip. A defective signal path within the set of signal paths for the chip is detected. Signals are re-routed through the set of signal paths such that the defective signal path is removed from the set of signal paths and sending signals using remaining data signal paths in the set of signal paths and using an extra signal path in response to detecting the defective signal path.
TL;DR: In this paper, a microfluidic buffer exchange of particle and cell suspensions in a continuous flow format on a chip is presented, where ultrasonic standing wave technology is utilized to confine particulate matter to the center of a buffer exchange channel while particle free buffer is sequentially aspirated via capillaries that branch off from the buffer exchange channels.
Abstract: A microfluidic strategy to perform buffer exchange of particle and cell suspensions in a continuous flow format on, chip is presented. Ultrasonic standing wave technology is utilized to confine particulate matter to the centre of a buffer exchange channel while particle free buffer is sequentially aspirated via capillaries that branch off from the buffer exchange channel. At each such branch, clean buffer is supplied at an equal flow-rate from a capillary at the opposing channel wall, generating a sideways translation of the original buffer, laminated with a wash buffer stream. Each such junction increases the buffer exchange ratio accordingly. The reported buffer exchange system provides means to adjust buffer exchange conditions on-line by tuning the ratio of the cross-flow wash buffer relative the sample suspension flow, rate. The system performance was evaluated using 5 μm polystyrene microbeads and a dye as the model contaminant. Wash efficiencies up to 96.4% were accomplished with a 0.2% solid content bead suspension, using eight cross-flow junctions, effectively exchanging the carrier buffer twice. The corresponding data for erythrocyte washing was recorded to be 98.3% at a haematocrit of 2%.
TL;DR: An all-digital outphasing modulator to be used in a software-defined radio (SDR) transmitter is presented and current consumption of the phase-to-digital converter is 70 mA, which can be reduced by gating off the circuit during idle times.
Abstract: An all-digital outphasing modulator to be used in a software-defined radio (SDR) transmitter is presented. A digitally-controlled oscillator (DCO) followed by two digital phase rotators (DPR) are the main building blocks. The DCO and DPRs are enclosed, respectively, by a phase-locked loop and two delay-locked loop. All functions, including filtering and calibration are performed in a digital signal processor (DSP). The chip was fabricated in a 90 nm CMOS process and tested for GSM and WCDMA. The total active area of the chip is 3 mm2 . Semi-analog blocks (DCO core, DPR core and their buffers) and digital processor consume 30 mA and 25 mA, respectively. Current consumption of the phase-to-digital converter is 70 mA, which can be reduced to 30% by gating off the circuit during idle times.
TL;DR: In this paper, a new semiconductor chip area-based approach is proposed to compare and assess different motor drive converter topologies, based on the drive's operating point and the optimal partitioning of the transistor and diode chip areas.
Abstract: In order to find the optimal converter topology for a given ac motor drive, as defined by its mission profile, suitable assessment criteria have to be applied. A new semiconductor chip area based approach is proposed to compare and assess different motor drive converter topologies. It determines the total semi-conductor chip area based on the drive's operating point and the optimal partitioning of the transistor and diode chip areas. This approach not only provides a distinct figure-of-merit for comparison but also enables the semiconductor costs of different converter topologies to be determined. The chip area based comparison has been successfully used to assess three 3-phase ac-dc-ac converter topologies for a 15 kW (20 HP) motor drive. It is shown that the Voltage DC-Link Back-to-Back Converter based drive provides the best overall performance in terms of chip area, cost, efficiency, and available nominal torque.
TL;DR: In this paper, the authors proposed an analytically-based approach for the design of a miniaturized single-band and dual-band two-way Wilkinson power divider.
Abstract: This paper proposes an analytically-based approach for the design of a miniaturized single-band and dual-band two-way Wilkinson power divider. This miniaturization is achieved by realizing the power divider's impedance transformers using slow wave structures. These slow wave structures are designed by periodically loading transmission lines with capacitances, which reduces the phase velocity of the propagating waves and hence engender higher electric lengths using smaller physical lengths. The dispersive analysis of the slow wave structure used is included in the design approach to ensure a smooth nondispersive transmission line operation in the case of dual-band applications. The design methodology is validated with the design of a single-band, reduced size, two-way Wilkinson power divider at 850 and 620 MHz. An approximate length reduction of 25%-35% is achieved with this technique. For dual-band applications, this paper describes the design of a reduced size, two-way Wilkinson power divider for dual-band global system for mobile communications and code division multiple access applications at 850 and 1960 MHz, respectively. An overall reduction factor of 28%, in terms of chip area occupied by the circuit, is achieved. The electromagnetic simulation and experimental results validate the design approach. The circuit is realized with microstrip technology, which can be easily fabricated using conventional printed circuit processes.
TL;DR: This work demonstrates a chip for detection, alignment, and feature extraction simultaneously for 64 channels on ASIC using a Matlab/Simulink-based architecture design framework.
Abstract: Spike sorting is an important processing step in various neuroscientific and clinical studies. An on-chip spike-sorting DSP must provide data-rate reduction while maintaining a power density much less than 800 μW/mm2. Most existing designs either provide only spike detection for multi-channel processing, or they provide detection and feature extraction only for a single channel. We demonstrate a chip for detection, alignment, and feature extraction simultaneously for 64 channels. Spike-sorting algorithms identified from a complexity-performance analysis are implemented on ASIC using a Matlab/Simulink-based architecture design framework. The chip has a modular architecture, which allows it to be configured to process 16, 32, 48, or 64 channels. Inactive cores are power-gated to reduce power consumption when the chip operates for less than 64 channels. The chip is implemented in a 90-nm CMOS process and has a power dissipation of 130 μW (power density of 30 μW/mm2) when processing all 64 channels. A data-rate reduction of 91.25% (11.71 Mbps to 1.02 Mbps) is achieved.
TL;DR: A semiconductor chip package includes a substrate comprising a substrate body having a first main surface, a second main surface and a cavity that defines an opening in the first main, and a layer of electrically conductive material integral with the substrate body as discussed by the authors.
Abstract: A semiconductor chip package includes a substrate having a cavity, a stacked package comprising the semiconductor chip package, and methods of fabricating the chip and the stacked packages. According to an example embodiment, the semiconductor chip package includes a substrate comprising a substrate body having a first main surface, a second main surface, and a cavity that defines an opening in the first main surface, and a layer of electrically conductive material integral with the substrate body. The layer of electrically conductive material constitutes an interconnection pattern of the substrate. The semiconductor chip packages further includes a semiconductor chip disposed within the cavity and mounted to the substrate. The chip includes electrical contacts in the form of pads and the pads face in a direction towards the bottom of the cavity such that the chip has a flip-chip orientation with respect to the substrate. The pads are electrically conductively bonded to respective portions of the interconnection pattern.
TL;DR: The MIPA4k, a 64×64 cell mixed-mode image processor array chip, includes an image sensor, A/D/A conversion, embedded digital and analog memories and hardware-optimized grayscale and binary processing cores.
Abstract: This paper presents the MIPA4k, a 64×64 cell mixed-mode image processor array chip. The processor cell includes an image sensor, A/D/A conversion, embedded digital and analog memories and hardware-optimized grayscale and binary processing cores. This paper presents the architecture of the processor cell and the different functional hardware. The processor has been manufactured in a 0.13 micron CMOS technology and the chip size is 5.1×4.5 mm2 with a cell area of 72×61 µm2.
TL;DR: The crosspoint-queued (CQ) switch is introduced, a new buffered-crossbar switch architecture with large crosspoint buffers and no input queues, and it is shown how it can be readily implemented in a single SRAM-based chip using current technology.
Abstract: This paper calls for rethinking packet-switch architectures by cutting all dependencies between the switch fabric and the linecards. Most single-stage packet-switch architectures rely on an instantaneous communication between the switch fabric and the linecards. Today, however, this assumption is breaking down, because effective propagation times are too high and keep increasing with the line rates. In this paper, we argue for a self-sufficient switch fabric by moving all the buffering from the linecards to the switch fabric. We introduce the crosspoint-queued (CQ) switch, a new buffered-crossbar switch architecture with large crosspoint buffers and no input queues, and show how it can be readily implemented in a single SRAM-based chip using current technology. For a crosspoint buffer size of one, we provide a closed-form throughput formula for all work-conserving schedules under uniform Bernoulli i.i.d. arrivals. Furthermore, we study the performance of the switch for larger buffer sizes and show that it nearly behaves as an ideal output-queued switch. Finally, we confirm our results using synthetic as well as trace-based simulations.