TL;DR: Deep Compression as mentioned in this paper proposes a three-stage pipeline: pruning, quantization, and Huffman coding to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
Abstract: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.
TL;DR: The author explains the design and implementation of the Levinson-Durbin Algorithm, which automates the very labor-intensive and therefore time-heavy and expensive process of designing and implementing a Quantizer.
Abstract: 1 Introduction- 11 Signals, Coding, and Compression- 12 Optimality- 13 How to Use this Book- 14 Related Reading- I Basic Tools- 2 Random Processes and Linear Systems- 21 Introduction- 22 Probability- 23 Random Variables and Vectors- 24 Random Processes- 25 Expectation- 26 Linear Systems- 27 Stationary and Ergodic Properties- 28 Useful Processes- 29 Problems- 3 Sampling- 31 Introduction- 32 Periodic Sampling- 33 Noise in Sampling- 34 Practical Sampling Schemes- 35 Sampling Jitter- 36 Multidimensional Sampling- 37 Problems- 4 Linear Prediction- 41 Introduction- 42 Elementary Estimation Theory- 43 Finite-Memory Linear Prediction- 44 Forward and Backward Prediction- 45 The Levinson-Durbin Algorithm- 46 Linear Predictor Design from Empirical Data- 47 Minimum Delay Property- 48 Predictability and Determinism- 49 Infinite Memory Linear Prediction- 410 Simulation of Random Processes- 411 Problems- II Scalar Coding- 5 Scalar Quantization I- 51 Introduction- 52 Structure of a Quantizer- 53 Measuring Quantizer Performance- 54 The Uniform Quantizer- 55 Nonuniform Quantization and Companding- 56 High Resolution: General Case- 57 Problems- 6 Scalar Quantization II- 61 Introduction- 62 Conditions for Optimality- 63 High Resolution Optimal Companding- 64 Quantizer Design Algorithms- 65 Implementation- 66 Problems- 7 Predictive Quantization- 71 Introduction- 72 Difference Quantization- 73 Closed-Loop Predictive Quantization- 74 Delta Modulation- 75 Problems- 8 Bit Allocation and Transform Coding- 81 Introduction- 82 The Problem of Bit Allocation- 83 Optimal Bit Allocation Results- 84 Integer Constrained Allocation Techniques- 85 Transform Coding- 86 Karhunen-Loeve Transform- 87 Performance Gain of Transform Coding- 88 Other Transforms- 89 Sub-band Coding- 810 Problems- 9 Entropy Coding- 91 Introduction- 92 Variable-Length Scalar Noiseless Coding- 93 Prefix Codes- 94 Huffman Coding- 95 Vector Entropy Coding- 96 Arithmetic Coding- 97 Universal and Adaptive Entropy Coding- 98 Ziv-Lempel Coding- 99 Quantization and Entropy Coding- 910 Problems- III Vector Coding- 10 Vector Quantization I- 101 Introduction- 102 Structural Properties and Characterization- 103 Measuring Vector Quantizer Performance- 104 Nearest Neighbor Quantizers- 105 Lattice Vector Quantizers- 106 High Resolution Distortion Approximations- 107 Problems- 11 Vector Quantization II- 111 Introduction- 112 Optimality Conditions for VQ- 113 Vector Quantizer Design- 114 Design Examples- 115 Problems- 12 Constrained Vector Quantization- 121 Introduction- 122 Complexity and Storage Limitations- 123 Structurally Constrained VQ- 124 Tree-Structured VQ- 125 Classified VQ- 126 Transform VQ- 127 Product Code Techniques- 128 Partitioned VQ- 129 Mean-Removed VQ- 1210 Shape-Gain VQ- 1211 Multistage VQ- 1212 Constrained Storage VQ- 1213 Hierarchical and Multiresolution VQ- 1214 Nonlinear Interpolative VQ- 1215 Lattice Codebook VQ- 1216 Fast Nearest Neighbor Encoding- 1217 Problems- 13 Predictive Vector Quantization- 131 Introduction- 132 Predictive Vector Quantization- 133 Vector Linear Prediction- 134 Predictor Design from Empirical Data- 135 Nonlinear Vector Prediction- 136 Design Examples- 137 Problems- 14 Finite-State Vector Quantization- 141 Recursive Vector Quantizers- 142 Finite-State Vector Quantizers- 143 Labeled-States and Labeled-Transitions- 144 Encoder/Decoder Design- 145 Next-State Function Design- 146 Design Examples- 147 Problems- 15 Tree and Trellis Encoding- 151 Delayed Decision Encoder- 152 Tree and Trellis Coding- 153 Decoder Design- 154 Predictive Trellis Encoders- 155 Other Design Techniques- 156 Problems- 16 Adaptive Vector Quantization- 161 Introduction- 162 Mean Adaptation- 163 Gain-Adaptive Vector Quantization- 164 Switched Codebook Adaptation- 165 Adaptive Bit Allocation- 166 Address VQ- 167 Progressive Code Vector Updating- 168 Adaptive Codebook Generation- 169 Vector Excitation Coding- 1610 Problems- 17 Variable Rate Vector Quantization- 171 Variable Rate Coding- 172 Variable Dimension VQ- 173 Alternative Approaches to Variable Rate VQ- 174 Pruned Tree-Structured VQ- 175 The Generalized BFOS Algorithm- 176 Pruned Tree-Structured VQ- 177 Entropy Coded VQ- 178 Greedy Tree Growing- 179 Design Examples- 1710 Bit Allocation Revisited- 1711 Design Algorithms- 1712 Problems
TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Abstract: An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
TL;DR: The state of the art in data compression is arithmetic coding, not the better-known Huffman method, which gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
Abstract: The state of the art in data compression is arithmetic coding, not the better-known Huffman method. Arithmetic coding gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
TL;DR: This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
Abstract: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.