BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

Question

1. What are the key architectural properties that binary vanilla ViTs miss out on compared to CNNs?

2. What is the threshold vector in binary fully-connected layer?

3. What are the key components and operations in a binary vision transformer?

4. What is the representational capability gap between binary ResNet-34 and binary ViT?

Accepted Answer

Binary vanilla ViTs miss out on several key architectural properties that CNNs have, which contribute to CNNs' higher representational capability. These properties include: 1. Average pooling layer: Binary vanilla ViTs use a token pooling layer, while CNNs use an average pooling layer to account for information from all tokens/patches. 2. Multiple average pooling branches: Binary vanilla ViTs lack the ability to compensate for the loss of representational capability from a binary fully-connected layer, unlike CNNs that have multiple average pooling branches. 3. Affine transformation: Binary vanilla ViTs do not have an affine transformation right before the addition of each main residual branch, which prevents the scale of each main residual branch from overwhelming the scale of each main branch. 4. Pyramid structure: Binary vanilla ViTs do not have a pyramid structure that allows binary features to be processed at higher resolution at the early stages without increasing computational complexity. These architectural properties, inspired by CNNs, are introduced in BinaryViT to enhance the representational capability of binary pure ViTs without significantly increasing the number of operations and parameters.

Accepted Answer

The threshold vector, b X R Din, can be applied to the real value inputs right before applying the sign function to allow these inputs to have some distributional shift. It helps in determining the distribution of the input activations. The threshold vector can be determined by computing the mean value of all elements inside the matrix, as mentioned in references [30], [31], and [32].

Accepted Answer

A binary vision transformer (ViT) consists of N transformer encoder blocks, each containing a multi-head attention (MHA) module and a feed-forward network (FFN) module. The image is first split into fixed-size patches, linearly projected, and appended with a cls-token embedding. The output of the embedding layer becomes the input for the first transformer block. In each transformer block, the input undergoes pre-batch normalization, and the MHA module calculates query, key, and value for each head using normalized inputs. The attention score is computed using the softmax function, and the binary attention probability matrix is obtained by rounding the output. The outputs from all heads are concatenated and passed through a fully-connected layer. A residual connection is applied, followed by a second pre-batch normalization and FFN. The final output is obtained by adding the FFN output to the residual output. The parameters in the softmax, normalization layers, and classifier are kept in full precision. Knowledge distillation is used to improve performance by minimizing the soft crossentropy loss between the student's logit and the teacher's logit. The attention scores and FFN outputs are not distilled, and the partially-random-initialization method is used for the first patch embedding layer to enhance performance.

Accepted Answer

The representational capability gap between binary ResNet-34 and binary ViT is significant. For a binary DeiT-S, the element-wise representational capability is calculated to be R(DeiT-S) = 153, 216, whereas for a binary ResNet-34, it is R(ResNet-34) = 71, 193, 472. This order of magnitude difference in representational capability contributes to the performance gap between binary ResNet-34 and binary ViT. The representational capability of a model is determined by the number of possible absolute values that each element in a matrix/tensor can have. Binary CNNs like ResNet-34 have a higher representational capability compared to binary ViTs like DeiT-S. This gap can be attributed to the different architectural designs and element-wise operations used in binary CNNs and binary ViTs. To improve the representational capability of binary ViTs, various design modifications have been proposed, such as adding global average pooling, multiple average pooling branches, and a pyramid structure from CNNs to ViTs. These modifications aim to enhance the representational capability of binary ViTs and bridge the gap with binary CNNs.

Accepted Answer

Global average pooling in binary CNNs increases the element-wise representational capability by up to 196x, as it considers information from all tokens. This allows the final classifier layer to have more flexibility in adjusting its output during training. In binary ViT models, replacing cls-token pooling with average pooling resulted in a performance increase from 48.5% to 56.4% top-1 on ImageNet-1k. However, the impact of average pooling on the number of operations (OPs) is negligible. This approach is particularly useful for binary networks compared to full-precision networks, as each output token before the classifier layer has limited representational capability. By incorporating global average pooling, the model can better utilize the information from all tokens, leading to improved performance.

Accepted Answer

Binary convolution has more representational power than a binary fully connected layer when parameters are equal. In a binary fully connected layer with a weight matrix of 384x384, the output tensor has an element-wise representational capability of 384. In contrast, a binary convolutional layer with a weight filter of 3x3x128x128 has an element-wise representational capability of 1152, which is 3 times larger. This increased capability allows the binary convolutional layer to be more flexible in adjusting its output. The additional branches in the binary ViT, consisting of average pooling layers with different kernel sizes, further enhance the representational capability without significantly increasing the number of parameters or operations. This multi-branch structure improved the binary ViT's performance from 56.4% to 60.2%.

Accepted Answer

Batch normalization is placed right before being added by a residual connection in networks tested for binarization problems such as ResNet and MobileNet. The placement helps normalize the signal propagating through the residual or prevents over-consumption of the main branch's signal by the residual branch. Three configurations were tested: res-post-norm, sandwich-configuration, and pre-norm with LayerScale. The res-post-norm configuration achieved 61.4% top-1 accuracy, sandwich configuration achieved 61.8%, and pre-norm with LayerScale configuration also achieved 61.8%. Having any affine transformation right before the residual connection is better than no affine transformation at all. In binary ViT, main residual connections in the transformer have an affine transformation before the residual connection, improving performance from 60.2% to 61.8%.

Accepted Answer

The pyramid structure improves element-wise representational capability by allowing binary neural networks to have a higher representational capability with a lower hidden dimension. In a network with a pyramid structure, a binary fully connected layer with a low hidden dimension applied on a high-resolution feature map can contribute significantly more to the element-wise representational capability compared to a binary fully connected layer with a high hidden dimension and a low feature-map resolution. This is demonstrated in the example of a network with 4 stages, where the first stage with a hidden dimension of 64 contributes up to 200704 of the element-wise representational capability. The pyramid structure allows for increased representational capability without increasing computational complexity. The architecture is designed to match the number of parameters of the DeiT-S model and is compatible with the transition from the 2nd to the 3rd stage. Downsampling layers and attention to large sequence sizes are used to manage computational cost. Overall, the pyramid structure enhances the representational capability of binary neural networks, making them more efficient and effective in processing high-resolution feature maps.

Accepted Answer

BinaryViT outperforms other SOTA binary models like ReAct-Net and BiMLP. It achieves comparable performance with the MobileNet version of ReActNet-B, using less parameters and fewer operations. BinaryViT also outperforms the ResNet-34 version of ReActNet, with significantly fewer FLOPs and OPs. Compared to BiMLP-S, BinaryViT has a lower number of FLOPs and OPs due to the absence of overlapping convolutional layers in patch embedding layers. Overall, BinaryViT's architecture, including global average pooling, multi-branch layers, and a pyramid structure, contributes to its competitive accuracy against SOTA binary models.

Accepted Answer

Binary CNNs possess key architectural properties that enable them to have a higher representational capability compared to binary vanilla ViTs. These properties include the use of convolutions, which allow CNNs to capture spatial hierarchies and local patterns effectively. Additionally, CNNs can leverage their inherent structure to perform operations such as pooling, which reduces the spatial dimensions of the input while preserving essential information. In contrast, binary vanilla ViTs lack these convolutional operations, limiting their representational capacity. The introduction of operations from the CNN architecture into a pure ViT architecture, as mentioned in the provided section, aims to bridge this gap and enhance the representational capability of ViTs without relying on convolutions. This approach includes the implementation of an average pooling layer, a novel block with multiple average pooling branches, an affine transformation before each main residual connection, and a pyramid structure. These modifications enable the proposed architecture to capture more complex patterns and achieve competitive performance on the ImageNet-1k dataset, rivaling that of prior binary CNNs.

BinaryViT: Pushing Binary Vision Transformers Towards Convolutional Models

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the key architectural properties that binary vanilla ViTs miss out on compared to CNNs?

2. What is the threshold vector in binary fully-connected layer?

3. What are the key components and operations in a binary vision transformer?

4. What is the representational capability gap between binary ResNet-34 and binary ViT?

5. How does global average pooling affect binary CNNs?

6. Binary convolution vs binary fully connected layer?

7. Where is batch normalization placed before residual connection?

8. How does the pyramid structure improve element-wise representational capability?

9. How does BinaryViT compare to other SOTA binary models?

10. What architectural properties do binary CNNs have over binary vanilla ViTs?

References

Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows

Accurate and Compact Convolutional Neural Networks with Trained Binarization.

Related Papers (5)

Human Organ Classifications from Computed Tomography Images Using Deep-Convolutional Neural Network

Image Target Detection Based on Deep Convolutional Neural Network

Face image classification by pooling raw features

Object Recognition in Images with Low-Resolution using Convolutional Neural Network

Two-stage pooling of deep convolutional features for image retrieval