1. What are the key architectural properties that binary vanilla ViTs miss out on compared to CNNs?
Binary vanilla ViTs miss out on several key architectural properties that CNNs have, which contribute to CNNs' higher representational capability. These properties include: 1. Average pooling layer: Binary vanilla ViTs use a token pooling layer, while CNNs use an average pooling layer to account for information from all tokens/patches. 2. Multiple average pooling branches: Binary vanilla ViTs lack the ability to compensate for the loss of representational capability from a binary fully-connected layer, unlike CNNs that have multiple average pooling branches. 3. Affine transformation: Binary vanilla ViTs do not have an affine transformation right before the addition of each main residual branch, which prevents the scale of each main residual branch from overwhelming the scale of each main branch. 4. Pyramid structure: Binary vanilla ViTs do not have a pyramid structure that allows binary features to be processed at higher resolution at the early stages without increasing computational complexity. These architectural properties, inspired by CNNs, are introduced in BinaryViT to enhance the representational capability of binary pure ViTs without significantly increasing the number of operations and parameters.
read more
2. What is the threshold vector in binary fully-connected layer?
The threshold vector, b X R Din, can be applied to the real value inputs right before applying the sign function to allow these inputs to have some distributional shift. It helps in determining the distribution of the input activations. The threshold vector can be determined by computing the mean value of all elements inside the matrix, as mentioned in references [30], [31], and [32].
read more
3. What are the key components and operations in a binary vision transformer?
A binary vision transformer (ViT) consists of N transformer encoder blocks, each containing a multi-head attention (MHA) module and a feed-forward network (FFN) module. The image is first split into fixed-size patches, linearly projected, and appended with a cls-token embedding. The output of the embedding layer becomes the input for the first transformer block. In each transformer block, the input undergoes pre-batch normalization, and the MHA module calculates query, key, and value for each head using normalized inputs. The attention score is computed using the softmax function, and the binary attention probability matrix is obtained by rounding the output. The outputs from all heads are concatenated and passed through a fully-connected layer. A residual connection is applied, followed by a second pre-batch normalization and FFN. The final output is obtained by adding the FFN output to the residual output. The parameters in the softmax, normalization layers, and classifier are kept in full precision. Knowledge distillation is used to improve performance by minimizing the soft crossentropy loss between the student's logit and the teacher's logit. The attention scores and FFN outputs are not distilled, and the partially-random-initialization method is used for the first patch embedding layer to enhance performance.
read more
4. What is the representational capability gap between binary ResNet-34 and binary ViT?
The representational capability gap between binary ResNet-34 and binary ViT is significant. For a binary DeiT-S, the element-wise representational capability is calculated to be R(DeiT-S) = 153, 216, whereas for a binary ResNet-34, it is R(ResNet-34) = 71, 193, 472. This order of magnitude difference in representational capability contributes to the performance gap between binary ResNet-34 and binary ViT. The representational capability of a model is determined by the number of possible absolute values that each element in a matrix/tensor can have. Binary CNNs like ResNet-34 have a higher representational capability compared to binary ViTs like DeiT-S. This gap can be attributed to the different architectural designs and element-wise operations used in binary CNNs and binary ViTs. To improve the representational capability of binary ViTs, various design modifications have been proposed, such as adding global average pooling, multiple average pooling branches, and a pyramid structure from CNNs to ViTs. These modifications aim to enhance the representational capability of binary ViTs and bridge the gap with binary CNNs.
read more