Content-based product image retrieval using squared-hinge loss trained convolutional neural networks

Question

1. What are the visual features used in image retrieval?

2. What is the proposed method for extracting features from product images?

3. What CNN models are used for feature extraction?

4. What are the three steps in the method of this study?

Accepted Answer

Visual features used in image retrieval include shape-based features like edges or moment invariants, color-based features using a histogram of pixel values, and key-point-based features such as SIFT and SURF. Recent studies have also focused on convolutional neural networks (CNN) for image matching and retrieval, where CNN-based features are obtained by training the model for image classification and then modifying it for feature extraction. These features represent shapes, color distribution, and other visual aspects of the image, aiding in the retrieval of similar images from a database.

Accepted Answer

The proposed method involves training CNN models with squared-hinge loss as an alternative to softmax loss for feature extraction. The extracted image features are then indexed using the nearest-neighbour (NN) indexing technique. This method aims to improve content-based product image retrieval by providing an alternative to existing CNN-based feature extraction methods. The study evaluates different CNN models, training parameters, and loss functions to determine the best configuration for achieving optimal results. The extracted features and the NN indexing technique can be applied to content-based retrieval in e-commerce shops, making it a valuable contribution to the field.

Accepted Answer

Various CNN models are used for feature extraction in research. For example, in [11] and [12], the FC6 and FC7 layers of the AlexNet model are utilized. Razavian et al. [14] applied the OverFeat model, extracting features from the first FC layer (layer 22). The HybridNet model [16] was used in [17] to extract features using the activation of the first FC layer (FC6). These models are applied for general image classification and retrieval, as well as specific image applications like product images. [18] used a self-built network model, while [19] applied CNN features from the VGG-19 model for fashion product image retrieval. Elleuch et al. [20] used features from the Inception V3 model's bottleneck layer on a clothing dataset.

Accepted Answer

The method in this study consists of three steps. First, transfer learning is applied to a pre-trained CNN model. Second, the model is trained with images and category labels using squared-hinge loss. Finally, image features are extracted and indexed using NN indexing technique. These steps result in a fine-tuned CNN model for content-based product image retrieval. Figure 2 illustrates the proposed method for this process.

Accepted Answer

FCn-1 nodes in CNN models represent the image feature vector extracted from the model. Each node reflects a feature vector element, and the number of nodes affects the feature dimension. When selecting a CNN model for image feature extraction, it is crucial to consider the FCn-1 nodes. Different CNN models with varying FCn-1 nodes can be used to observe the correlation between feature dimension and retrieval accuracy. Table 1 illustrates the use of CNN models with different FCn-1 nodes to analyze this relationship.

Accepted Answer

Transfer learning enhances accuracy in CNN models by adopting a pre-trained model from a high-accuracy domain. In the case of product image retrieval, the source model's convolution and pooling layers are transferred to the target domain. The last fully connected layer (FCn) of the source model is replaced with a new layer (FC'n) that matches the number of classes in the product dataset. This fine-tuning process allows the model to adapt to the specific characteristics of the target domain, resulting in improved accuracy. The fine-tuned model is then retrained using the product image dataset, as shown in Figure 3. Overall, transfer learning enables the leveraging of existing knowledge and models to achieve better performance in new domains.

Accepted Answer

The typical loss function for multiclass classification tasks is Softmax loss (LS). It is expressed in equation (2) as Ls = -1 * N * log(e^fyi / e^fj) * N * i=1. Here, fyi represents the output from the fully-connected layer for input with label yi, while fj represents the output for label j. N is the number of samples, and K is the total class number. Softmax loss (LS) calculates the probabilities for all labels, making it suitable for multiclass classification tasks.

Accepted Answer

Feature extraction in CNN models serves as the inference process, where fine-tuned CNN models extract CNN-based features. These features are vectors representing the node values of the fully-connected layer before the last layer in the model, FCn-1. This layer has a global receptive field, making it suitable for global image features. The extracted vector features are normalized using l2-norm to ensure values are within a specific range. These normalized features, F, are stored in a feature database for efficient image searching. Nearest-neighbor indexing is applied for efficient storage and retrieval of features, making it suitable for image search with CNN-based features.

Accepted Answer

Image retrieval involves preprocessing the image query, extracting features using a fine-tuned CNN model, normalizing the feature using l2 normalization, and measuring similarity using Euclidean distance. The k-NN search algorithm is then used to find the most similar features in the indexed database, returning the k smallest distance feature vectors as the query result. This process enables efficient retrieval of similar images based on the query image's features.

Accepted Answer

The experiments utilized labeled product image datasets, specifically the Stanford online product (SOP) and InShop DeepFashion (InShop) datasets. SOP includes home product images and consists of 12 superclasses, 22,634 classes, and 120,053 images. InShop contains clothing images with 23 superclasses, 7,982 classes, and 52,712 images. Both datasets are also used in [28]. These datasets were chosen for their fine-grained categories and diverse product images, making them suitable for training and evaluating the CNN model.

Accepted Answer

CNN models trained with softmax and squared-hinge loss significantly impact retrieval accuracy. In the provided section, retrieval experiments were conducted using CNN-based features from ResNet18-H and MobileNetV2-H models trained with softmax (S) and squared-hinge (H) loss. The results showed that ResNet18-H and MobileNetV2-H achieved better accuracy in the SOP dataset, with ResNet18-H performing slightly better in mAP@k metric. The gap between the two models was more pronounced in the InShop dataset, with MobileNetV2-H delivering superior performance. The average improvement in retrieval accuracies using squared-hinge loss trained feature over softmax loss trained feature was 3.3% in SOP and 3.7% in InShop. This confirms that utilizing CNN features from models trained with squared-hinge loss enhances accuracy compared to models trained with softmax loss. Additionally, feature vectors extracted from CNN models vary in dimensions, affecting computational resource requirements. ResNet18, with the lowest feature dimension, still provides competitive accuracy, making it a preferred choice.

Content-based product image retrieval using squared-hinge loss trained convolutional neural networks

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the visual features used in image retrieval?

2. What is the proposed method for extracting features from product images?

3. What CNN models are used for feature extraction?

4. What are the three steps in the method of this study?

5. What is the role of FCn-1 nodes in CNN models?

6. How does transfer learning improve accuracy in CNN models?

7. What is the typical loss function for multiclass classification tasks?

8. What is the purpose of feature extraction in CNN models?

9. What is the process of image retrieval?

10. What datasets were used for the experiments?

11. How do CNN models trained with softmax and squared-hinge loss affect retrieval accuracy?

References

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

ImageNet classification with deep convolutional neural networks

Distinctive Image Features from Scale-Invariant Keypoints

Going deeper with convolutions

Related Papers (5)

SAR image classification based on multi-feature fusion decision convolutional neural network

Convolutional Neural Network based Eye Recognition from Distantly Acquired Face Images for Human Identification

Human Organ Classifications from Computed Tomography Images Using Deep-Convolutional Neural Network

Disease Classification within Dermascopic Images Using features extracted by ResNet50 and classification through Deep Forest.

Co-occurrence of deep convolutional features for image search