Fine-grained visual recognition with salient feature detection

博客小编 (84) 2024-06-26 20:01:01

Fine-grained visual recognition with salient feature detection

Hui Feng, Shanshan Wang, Shuzhi Sam Ge

School of Transportation, Wuhan University of Technology
Department of Electrical and Computer Engineering, National University of Singapore

School of Transportation：交通学院 Wuhan University of Technology：武汉理工大学，武理工 National University of Singapore，NUS：新加坡国立大学，国大 salient ['seɪlɪənt]：adj. 显著的，突出的，跳跃的 n. 凸角，突出部分 fine-grained ['fain-'ɡreind]：adj. 细粒的，有细密纹理的

Abstract

Computer vision based fine-grained recognition has received great attention in recent years. Existing works focus on discriminative part localization and feature learning. In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many salient parts of object as possible at first. Then, we figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, an improved accuracy over state-of-art algorithms has been obtained on CUB200-2011 bird dataset.
基于计算机视觉的细粒度识别近年来受到了极大的关注。现有的工作集中在判别部分定位和特征学习。在本文中，为了提高细粒度识别的性能，我们首先尝试尽可能精确地定位尽可能多的对象的显著部分。然后，我们通过使用单独的部分进行对象分类获得分类概率。最后，通过从每个部分提取有效特征并将它们组合，然后输入分类器进行识别，在 CUB200-2011 鸟类数据集上获得了最高的精度。

discriminative [dɪs'krɪmɪnətɪv]：adj. 区别的，歧视的，有识别力的 Caltech-UCSD Birds-200-2011，CUB-200-2011

1 Introduction

Fine-grained recognition is an active topic in computer vision and pattern recognition, and is now widely applied in industry and academia, for instance, to classify different species of birds or plants to evaluate the natural ecosystem change [24], or to recognize car models for visual census estimation [10]. Comparing with the coarse-grained recognition of traditional object recognition tasks, the purpose is to identify finer subordinate categories, such as bird species [27], car models [18], aircraft types [21]. Fine-grained recognition is very challenging due to the significant differences between samples of the same category and the obvious similarities between different categories [38, 28].

academia [,ækə'diːmɪə]；n. 学术界，学术生涯 census ['sensəs]：vt. 实施统计调查 n. 人口普查，人口调查 species ['spiːʃiːz; -ʃɪz; 'spiːs-]：n. 物种，种类 adj. 物种上的 plant [plɑːnt]：n. 工厂，车间，植物，设备，庄稼 vt. 种植，培养，栽培，安置 vi. 种植 coarse-grained ['kɔ:sɡreind]：adj. 粗鲁的，木纹粗糙的 subordinate [sə'bɔːdɪnət]：n. 下属，下级，部属，属下 adj. 从属的，次要的 vt. 使...居下位，使...服从

Exciting progress has been made in this area as the involvement of many community researchers in recently years. Generally, part localization and feature description are two key factors that affect classification accuracy. To seek more precise part localization, pose-normalized descriptor [35] or pose alignment [17] are applied to all images before they are used for feature extraction. Then, convolutional neural networks are employed as descriptors to learn discriminative features. We know that although convolutional neural networks are significantly powerful in learning features, it has poor interpretability [31, 1]. Therefore, the questions of which parts have more discriminative features than others, and how does the parts with less discriminative features affect the classification accuracy, is still unknown.

involvement [ɪn'vɒlvm(ə)nt]：n. 参与，牵连，包含，混乱，财政困难 interpretability：n. 可解释性，解释能力

When we, as human, face the issue of fine-grained recognition, what do we do? Figure 1 shows a guide for ornithologist to identify common birds. From Figure 1, we can see that, for the purpose of recognizing five species of birds coming from two categories, several parts (e.g., bill, plumage, leg) and features (e.g., length, color, shape) are used as the indicators. Intuitively, human beings rely on plenty of information when they recognize the species of, for example, the length and shape of bill, the color of plumage and leg, and so on. There is an idiom in China called The Blind Men and The Elephant: four blind men wished to know what an elephant looked like. The man who touched the elephant’s ear claimed that it is like a great fan, while the man regarded the elephant as a big pillar when he felt the elephant’s leg. Of course, none of them were right before they felt all parts of the elephant. The principle behind this idiom is also suitable to the fine-grained recognition, because the more information we get, the better our judgment will be.

ornithologist ['ɔrnɪ'θɑlədʒɪst]：n. 鸟类学者 bill [bɪl]：n. 法案，广告，账单，票据，钞票，清单 vt. 宣布，开账单，用海报宣传 plumage ['pluːmɪdʒ]：n. 翅膀，鸟类羽毛 idiom ['ɪdɪəm]：n. 成语，习语，土话 pillar ['pɪlə]：n. 柱子，柱形物，栋梁，墩 vt. 用柱支持 elephant ['elɪf(ə)nt]：n. 象，大号图画纸 intuitively [ɪn'tjʊɪtɪvli]：adv. 直观地，直觉地 redshank ['redʃæŋk]：n. 红脚鹬 greenshank ['griːnʃæŋk]：n. 青足鹬 marsh [mɑːʃ]：n. 沼泽，湿地 adj. 沼泽的，生长在沼泽地的 sandpiper ['sæn(d)paɪpə]：n. 鹬 curlew ['kɜːl(j)uː]：n. 麻鹬 plover ['plʌvə]：n. 珩，千鸟，珩科鸟 underpart ['ʌndəpɑːt]：n. 下部，次要角色，附属地位 mottle ['mɒt(ə)l]：n. 斑点，杂色，斑驳 vt. 使呈杂色，使显得斑驳陆离 pacific [pə'sɪfɪk]：adj. 和平的，温和的，平静的 n. 太平洋 adj. 太平洋的

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第1张
Figure 1: A guide for ornithologist to identify common birds.

In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many parts of object as possible at first. Then, we want to figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, the accuracy outperforms the state-of-art accuracy on CUB200-2011 dataset [27]. We call our whole method as fine-granularity part-CNN (FP-CNN).

fine-granularity part-CNN，FP-CNN

The key contributions of this work can be summarized as follows:

We trained a deep neural network to detect and locate saliency parts of object with high probability by generating labeled part images according to part annotation.
We compared and analyzed the effects of different parts on the recognition accuracy and found that the classification accuracy of all other components except the head in bird database is relatively low.
The experimental results conducted on CUB200-2011 bird datasets illustrate the state-of-art performance of the proposed approach.

This paper is organized as follows. A review of related work is presented in section 2. Section 3 describes the proposed fine-grained recognition method, followed by the experimental evaluation in section 4. Finally, we conclude this paper in section 5.

2 Related Work

In this section, we introduce the state-of-art work involved in fine-grained recognition from the perspective of whether human labeled information (e.g., bounding box and part annotations) is leveraged, i.e., strongly-supervised and weekly-supervised fine-grained recognition. We should remind that both of these two categories methods requires class labels, and that is the reason why we could not call them as unsupervised recognition.

2.1 Strongly-supervised fine-grained recognition

A large corpus of strongly-supervised fine-grained recognition methods have been proposed in recent works [33, 19, 13, 32, 4, 7, 5, 9, 17, 28]. where bounding box or part annotations, or both of them are used during the training stage for part location and presentative feature learning, and/or even bounding box is used in the test stage. Part R-CNN [33] was proposed to leverage deep convolutional features computed on bottom-up region proposals for detection and part description based on pose normalization [7, 4]. Segmentation-based methods are also very effective for fine-grained recognition, where region-level cues are used to infer foreground segmentation masks to eliminate background interference [5, 9, 30, 17, 28]. The recently proposed Mask-CNN [28] achieves the state-of-art classification accuracy on CUB200-2011. In order to locate the parts of birds during the test phase, two masks are generated with the help of part key points, and a fully convolutional network are trained based on the masks. Then, a three-stream CNN model is constructed for fine-grained recognition. The expressive results had been illustrated in their literature with the state-of-art accuracy of 87.3%. However, the limitation of this work is that, except for the original object image, only two parts (i.e., head and torso) are used to learn identifiable features, while the other parts are ignored, resulting in insufficient recognition of some important details. In our work, we do not make any priori assumption about the importance of various parts for fine-grained recognition, and all components are taken into account. As in [28], only part annotation is used in the training stage, and we obtain the average of 88.2% accuracy on CUB200-2011.

corpus ['kɔːpəs]：n. 语料库，文集，本金 presentative [prɪ'zentətɪv]：adj. 表象的，直觉的，有圣职推荐权的 foreground ['fɔːgraʊnd]：n. 前景，最显著的位置 torso ['tɔːsəʊ]：n. 躯干，裸体躯干雕像，未完成的作品，残缺不全的东西 priori [praɪ'ɔ:raɪ]：adj. 先验的，优先的 adv. 先验地

2.2 Weakly-supervised fine-grained recognition

Weakly-supervised recognition requires only image level class labels rather than uses any of part annotations, bounding box, or segmentation masks [16, 25, 20, 36, 38, 37]. Some works are based on generating parts using segmentation and alignment [9, 16], while the others are inclined to leverage visual attention mechanism [29, 38, 37]. Jonathan et al. [16] proposed to discover the parts without any part annotations by aligning images with similar poses, and then a convolutional neural network was used for training a feature descriptors. A bilinear convolutional neural networks was proposed to captures part-feature interactions under the motivation that modular separation of two CNNs is able to affect the overall appearance [25]. A multi-attention convolutional neural network (MA-CNN) was presented in [38] to generate more efficient distinguishable parts and to learn better fine-grained features from parts in a mutual enhanced manner. The parts were located by detecting the convolutional feature channel whose peak responses occurs at adjacent locations. Zhao et al. [37] proposed a diversified visual attention network (DVAN), where multiple attention canvases with various locations and scales were generated for incremental object representation. Instead of finding multiple attention areas in an image at the same time, they suggested finding different regions of attention multiple times, and using recurrent neural network to predict the object class.

incline [ɪn'klaɪn]；vi. 倾斜，倾向，易于 vt. 使倾斜，使倾向于 n. 倾斜，斜面，斜坡 mutual ['mjuːtʃʊəl; -tjʊəl]：adj. 共同的，相互的，彼此的 distinguishable [dɪ'stɪŋgwɪʃəbl]：adj. 可区别的，辨认得出的，可辨识的 diversify [daɪˈvɜːsɪˌfaɪ]：vt. 使多样化，使变化，增加产品种类以扩大 canvas ['kænvəs]：n. 帆布 vt. 用帆布覆盖，用帆布装备 adj. 帆布制的

3 Approach

In this section, we present the proposed method. We at first introduce the method to localize the parts of object in a precise way with the part annotation in hand. Then, we compare and analyze the classification accuracy when using different parts of the object.

3.1 Local Feature Location and Detection

The localization of possible discriminative parts is one of the core issue of fine-grained recognition. Existing methods leveraging attention mechanism for part location are based on the intuition that some of parts have higher vision saliency than the others. This intuition, to some extend, indeed reflect the style of human beings inspecting this world, because it is a large burden for our vision system and brain to process so huge amount of information [14]. However, when we intend to perform fine granularity classification, this maybe mislead us, especially when the object we want to recognize has marginally visual difference that even the filed experts can distinguish.

intuition [ɪntjʊ'ɪʃ(ə)n]：n. 直觉，直觉力，直觉的知识 mislead [mɪs'liːd]：vt. 误导，带错

In this paper, we suggest that, in the context of finegrained recognition, the more information we get, the better our judgment will be. Based on this idea, we at first propose a local feature location strategy which intend to accurately locate as many parts as possible with the help of part annotation in the training stage. Then, we convert the part localization problem to object detection. This is different from tradition object detection whose goal is to detect objects from raw images, because we focus on detecting the parts in the images containing the object.

Table 1: Part region generation
Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第2张

3.1.1 Ground truth part region generation

It is notice that part annotation is available in some of fine-grained datasets, for example, CUB200-2011 [27], Birdsnap [3], and FGVC Aircraft [21]. In this paper, we take CUB200-2011 as an example, but the idea can be easy extended to the other datasets. CUB200-2011 has defined fifteen part key points, and we leverage these points to construct ground truth part regions (or called bounding boxes). In our proposed local feature location strategy, five discriminative part regions (i.e., head, breast, tail, wing and leg) are generated, as shown in Table 1. We note that the accuracy of part regions has significant impact on part detection, three strategies are used to generate part regions:

(1) Two region generation styles: For head and breast region, we adopt minimal rectangle to include all the key points annotated on the bird head, and square envelope (i.e., key-point-centered square) are used for the remaining regions, as shown in Table 1.
(2) Self-tuning region size: The key points in part annotation represent the center of specific bird part. If we just draw a minimal rectangle to include all of this points as in [28] to generate ground truth part region, some detail features may be lost, as shown in Figure 2. For head region, the size is self tuned according to the width and height of minimal rectangle which can be denoted by

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第3张

where $W_{mini-rect}$ , $H_{mini-rect}$ are the width and height of minimal rectangle including the key points, and $W_{head}$ , $H_{head}$ are the size of generated head region, and $λ_w$ and $λ_h$ are the tuning factors which are used to pad the head region. Additionally, for the part region generated by square envelope, it is also necessary to seriously determine the region size. The reason is that, if the region size is too large, the other parts of the object will be included, otherwise, if the size is too small, the distinguishable features will be lost. Besides, duo to the different sizes of the images as well as the different proportions of the objects in the images, the size of object varies significantly. In this paper, the region sizes are self-adjusted according to the size of head, because, through our observation of a large number of images, the head size is not seriously affected by the changes of scales and viewpoints and occlusions, so it can be regarded as a better reference.

envelope ['envələʊp; 'ɒn-]：n. 信封，封皮，包膜，包层，包迹 beak [biːk]：n. 鸟嘴，鹰钩鼻子，地方执法官，男教师 crown [kraʊn]：n. 王冠，花冠，王权，顶点 vt. 加冕，居...之顶，表彰，使圆满完成 nape [neɪp]：n. 颈背，项 throat [θrəʊt]：n. 喉咙，嗓子，嗓音，窄路 vt. 开沟于，用喉音说 proportion [prə'pɔːʃ(ə)n]：n. 比例，占比，部分，面积，均衡 vt. 使成比例，使均衡，分摊

(3) Redundant region elimination: It is possible that the same part but different sides (i.e., left and right) are both appear in the image, for example, left wing and right wing, left leg and right leg, as shown in Figure 2 (the two images in left side). The same problem may occur during the part detection phase for test image sets, which will be illustrated later. The region has the minimum intersection over union (IoU) will be chosen for the current part, and the IoU is defined as

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第4张

where $R_{currentpart}$ , $R_{otherparts}$ are the regions of current part and the other parts, respectively. If the IoUs for both sides are the same, we randomly choose one of them.

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第5张
Figure 2: Comparison of region generation with and without size self-tuning. The rectangle includes all the part points (red points) on head without self- tuning (up) and with self-tuning (bottom).

Figure 3 shows some examples of our generated part regions. From Figure 3, we can see that distinguishable features are well appeared in the generated part regions for birds from the same category (e.g., bohemian waxwing and cedar waxwing).

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第6张
Figure 3: Some examples of generated part ragions from part annotation. Left: Laysan Albatross (up), Sooty Albatross (bottom). Middle: Cedar Waxwing (up), Bohemian Waxwing (bottom). Right: American Three Toed Woodpecker (up), Red Bellied Woodpecker (bottom).

cedar ['siːdə]；n. 雪松，香柏，西洋杉木 Laysan Albatross：黑背信天翁 Cedar Waxwing：雪松太平鸟 Bohemian Waxwing：太平鸟 Red Bellied Woodpecker：红腹啄木鸟 Three-toed Woodpecker：三趾啄木鸟 Sooty Albatross：乌信天翁，乌黑的信天翁

3.1.2 Local part detection and localization

In the second step, with the part regions in hand, we convert the part localization problem to part detection in the images including the object. The research on object detection is an active topic in recently years, and the promising performances have been proposed in the literatures leveraging deep neural network [11, 23, 22]. The earlier work [34] employed R-CNN [11] to detect objects and localize their parts for recognition. However, the recognition is conducted in a strongly supervised way (i.e., both bounding box and part annotations are used at training time), and just two parts (i.e., head and torso) were detected in CUB-200-2011 dataset. In contrast, only part annotation is required for training, and no supervision is required in the test. Our work leverages YOLO v3 to detect and locate all five parts defined in Table 1. Comparing to R-CNN (and the other classifier-based object detection approaches, e.g., fast and faster R-CNN), YOLO is much faster at obtaining comparable detection accuracy, because, for a single image, it makes predictions with a single network evaluation while R-CNN requires thousands. It is notice that, two thresholds should be carefully selected in part detection and localization when using YOLO. One threshold $τ_1$ is compared with the IoU of the predicted and ground truth part region to determine what percentage of bounding boxes are preserved during the training phase. Meanwhile, in the test phase, the detected part is considered to be a valid part only if its confidence is higher than another threshold $τ_2$ . The trained model is available on the Github (https://github.com/wuyun8210/part-detection).

preserve [prɪ'zɜːv]：vt. 保存，保护，维持，腌，禁猎 n. 保护区，禁猎地，加工成的食品，专门活动

3.2 The proposed method

In this section, besides recognizing the subcategories of the object, we are also very interested in the impact of detected parts on the accuracy of recognition.

3.2.1 The importance of the parts

The method we proposed is to train the different models on the different datasets to clarify the recognition performance of using the object or the different parts.

Firstly, we generate several groups of part image sets based on the ground truth region of the training set, as shown in Figure 4. Then, for each group of image set, we leverage deep convolutional neural network to train different models separately. We do this by assigning the object label to the corresponding parts. We use ResNet [12] as the backbone neural network, and fine-tune the parameters of the pre-trained model on ImageNet. From Figure 4, we take one of images of bohemian waxwing (upper left corner) in the training set as an example. Seven images (i.e., the original image and the center-cropped image of the object, and five local images of the parts) are generated and resized to the same size w × h (in this paper, we set w and h to 224) to form seven groups of image sets $S_{i} (i = 1, ... , 7)$ . The center cropped image and five parts images are assigned the same label as the original image. After training, we obtain seven learned models (i.e., the weights of CNN) $M_{i} (i = 1, ... , 7)$ .

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第7张
Figure 4: Generating part ground truth bounding box based upon part annotations for part location and detection, and constructing the object and parts datasets for classification. [Best viewed in color]

clarify ['klærɪfaɪ]：vt. 澄清，阐明 vi. 得到澄清，变得明晰，得到净化 leverage ['liːv(ə)rɪdʒ; 'lev(ə)rɪdʒ]：n. 手段，影响力，杠杆作用，杠杆效率 v. 利用，举债经营 Bohemian [bo'himɪən]：adj. 波希米亚的，放荡不羁的，波希米亚语的 n. 波希米亚人，波希米亚语，放浪者 waxwing ['wækswɪŋ]：n. 连雀 breast [brest]：n. 乳房，胸部，胸怀，心情 vt. 以胸对着，与...搏斗 tail [teɪl]：n. 尾巴，踪迹，辫子，燕尾服 vt. 尾随，装上尾巴 vi. 跟踪，变少或缩小 adj. 从后面而来的，尾部的 wing [wɪŋ]：n. 翼，翅膀，飞翔，派别，侧厅，耳房，厢房 vt. 使飞，飞过，空运，增加...速度，装以翼 vi. 飞行

In the test phase, the same procedural is used to generate the test sets, except that the ground truth part regions are replaced by the detected and localized part regions as proposed in Section 3.1. The group number of the test sets is same as the train sets, and it is denoted by $T_{i} (i = 1, ... ,7)$ . For the images in each group of test set, the corresponding learned model $M_i$ is used to predict which category the images belong to. It is note that, the parts that are not visible in the training set or that are not detected in the test set are ignored. The experimental results are illustrated in Section 4.2.

3.2.2 Fine-grained recognition

In recent works, after obtaining the part regions, a straightforward method for fine-grained recognition is to design a multi-stream CNN framework for end-to-end fine-grained recognition as in [38, 28]. However, if some of parts are not visible or not properly detected, these methods can easily to face the label confliction problem in model training and prediction. This means that the empty features will correspond to different labels. We know that some of machine learning algorithms (e.g., SVM [26], Decision Tree [8]), are robust to learn from the dataset with lost information. In this paper, to avoid the label confliction problem, we leverage libSVM [6] to combine all of the features due to its convenience in parameter tuning.

confliction：冲突 convenience [kən'viːnɪəns]：n. 便利，厕所，便利的事物

In fine-grained recognition, the learned CNN models are used for extracting discriminative features. In the training stage, for each sample, two object images (original and center-cropped) and detected part images (maybe less than five parts) are fed to the learned models respectively. Then, the activation tensors output from ResNet pool5-layer with dimension of 4096 (with the input of image size of 224 × 224) are taken as the feature of this image. The lost features (corresponding to invisible parts) are set to zero vector before all of the features are concatenated and trained by SVM. In the prediction stage, the same features are extracted and concatenated, then, we output its subcategory by the SVM classifier for each test image. It is note that the lost features related to undetected parts are also replaces by zero vectors. We illustrate the detailed results in Section 4.2.

invisible [ɪn'vɪzɪb(ə)l]：adj. 无形的，看不见的，无形的，不显眼的，暗藏的

4 Experimental results

In this section, we illustrate the experimental results of the proposed FP-CNN on part detection and localization and fine-grained recognition on the widely-used and challenge dataset CUB200-2011. This dataset contains 200 categories and total of 11788 bird images. We split the dataset into three parts: 50% for the training, 20% for validation, and the rest for test.

4.1 Part detection and localization performance

From Section 3.1, we know that two thresholds play an important role on the performance of part detection and localization. We design a relative small threshold (i.e., $τ_2 = 0.6$ ) for the training set, to ensure that efficient parts can be detected with higher probability. During the test stage, the metric that used to determine which parts are properly detected includes two folds: 1) choosing only one of detected parts that obtains the highest score from the same type, and 2) the score of the detected parts must larger than the threshold set in the test phase. In this paper, we set $τ_2 = 0.3$ . Some examples of bird detection and localization are shown in Figure 5. We randomly select four birds which has been shown in Figure 3 to facilitate the readers to observe the part bounding boxes of the ground truth and the predicted. From Figure 5, we can see that, although the pictures are taken in different scale, viewpoints and backgrounds, the main parts are precisely detected and located in the majority of test images. In the last column, we also show some examples of the parts that are not well detected duo to the low scores they obtained. In Table 2, we give the localization accuracy of all types of parts using the Percentage of Correctly Localized Parts (PCP) metric as in [34, 28], and we also compare the PCP of birds head with the recent works (the tail, breast, leg and wing were not detected in these works).

Percentage of Correctly Localized Parts，PCP

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第8张
Figure 5: Some results of part detection and localization. We just select four birds which have been shown in Figure 3 as the examples: Laysan Albatross (the first row), Sooty Albatross (the second row), Bohemian Waxwing (the third row), and Red Bellied Woodpecker (the last row). The last column shows some parts of the birds that are not well detected and located.

Table 2: Comparison of part localization accuracy on the CUB-200-2001 dataset.
Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第9张

From Table 2, we can see that our method obtains the highest PCP (88.20%), and it improves the performance of Mask-CNN by 1.44%, and outperforms the other works with a significant margin. In addition, the tail, breast and wing are also located with high probability (the PCP are all larger than 76%). The leg is the exception, and it just obtains the score of 58.66%. The possible reason is that the feet of birds have some similarities in shape, texture and color with the places (e.g., branches, grasses etc.) they inhabited.

inhabit [ɪn'hæbɪt]：vt. 栖息，居住于，占据 vi. 居住，栖息

4.2 Fine-grained Recognition

We first report the recognition results on seven groups of datasets as defined in Section 3.2. All the models are fine-tuning on the pretrained ResNet model in caffe [15]. Figure 6 shows the recognition accuracy on the validation set with respect to the iteration (totally of 50,000 iterations are conducted). The detailed recognition results on the test sets are shown in Table 3. We can see that the experiment on cropped images obtains the highest accuracy (82.70%) and the smallest loss (0.6779) than the other groups of image sets. The accuracy on the head of birds (77.02%) outperforms the other four parts by a large margin, and it obtains the comparable performance with the original images (78.92%) and the cropped images. Additionally, although the wing and breast are not sufficiently to recognize the whole bird with high probability (both of them are approximately 50%), they indeed provide some useful information. The leg and tail obtain the lowest scores among all of these parts, 31.72% and 29.48% respectively. From the experimental results, we can safely conclude that the birds head contains more discriminative features than the other parts, on the contrary, it is difficult to recognize them by using the leg and tail.

conclude [kən'kluːd]：vt. 推断，决定，作结论，结束 vi. 推断，断定，决定

Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第10张
Figure 6: The valid accuracy of different images and parts during training on convolutional neural network. [Best viewed in color]

Through the above analysis, we know that different parts have different performance when they are used for recognition independently. Then, we try to compare the classification accuracies using the features extracted from different parts through the style of incremental combination. That means, we set the combination of the original and cropped images as a baseline, then we increase one of part images according its performance order (as shown in Table 3) each time. The combined features are classified by libSVM as discussed in Section 3.2. The experimental results are shown in Table 4. As can be seen from Table 4, as the increase of combined part features, the classification accuracies increase. The best performance (88.23%) appears at the combination of the baseline and three parts (i.e., the head, wing and breast) and is slightly superior (0.17%) to the feature combination that contains all the parts.

Table 3: Comparison of test accuracy on different images and parts on the CUB20-2011dataset.
Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第11张

superior [suːˈpɪərɪə]：adj. 上级的，优秀的，出众的，高傲的 n. 上级，长官，优胜者，高手，长者

Finally, we compare the proposed FP-CNN method with the state-of-art works on CUB200-2011 dataset. The detailed results are presented in Table 5. In our method, we select the forth combination in Table 4 as the final feature for fine-grained recognition. All the input images are resized to 224 × 224 as discussed in Section 3.2. Three types of state-of-art works are selected for comparison: 1) strongly supervised methods using both bounding box and part annotation [34, 13, 4], 2) strongly supervised methods just using one of the annotations ([38, 17, 28], and this paper), 3) weakly supervised methods using only class labels [25, 20]. Our proposed method outperforms all of these state-of-art works in the fine-grained recognition accuracy. It is note that, the higher resolution of images can improve the classification accuracy of our method, as they provide more precise details. Although two weakly supervised methods [25, 20] obtained the attractive results, our method outperforms them by a clear margin (higher than [25] 7.2% and [20] 4.1%, respectively).

Table 4: Comparison of different combination of images and parts.
Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第12张

Table 5: Comparison of recognition accuracy with state-of-art approaches on CUB200-2011 dataset.
Fine-grained visual recognition with salient feature detection (https://mushiming.com/) 第13张

5 Conclusion

In this paper, based on part annotation equipped in the dataset, the ground truth part regions are generated for training a FP-CNN model, so that fine-granularity parts can be precisely detected and localized from test images. Then, we proposed a fine-grained recognition method using these fine-granularity parts. Experimental results reveal that the proposed method improves the state-of-art recognition performance on widely used CUB200-2011 bird dataset. In the future, we will explore an accurate fine-granularity part localization method without the help of part annotation.

reveal [rɪ'viːl]：vt. 显示，透露，揭露，泄露 n. 揭露，暴露，门侧，窗侧 fine granularity：良好颗粒性 constellation [,kɒnstə'leɪʃ(ə)n]：n. 星座，星群，荟萃，兴奋丛

References

Caltech-UCSD Birds 200
Birdsnap: Large-Scale Fine-Grained Visual Categorization of Birds
Fine-Grained Visual Classification of Aircraft
Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition
Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization
LIBSVM – A Library for Support Vector Machines
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
A Library for Support Vector Machines

WORDBOOK

Percentage of Correctly Localized Parts，PCP

KEY POINTS

In this paper, to improve the performance of fine-grained recognition, we try to precisely locate as many salient parts of object as possible at first. Then, we figure out the classification probability that can be obtained by using separate parts for object classification. Finally, through extracting efficient features from each part and combining them, then feeding to a classifier for recognition, an improved accuracy over state-of-art algorithms has been obtained on CUB200-2011 bird dataset.

Fine-grained recognition is very challenging due to the significant differences between samples of the same category and the obvious similarities between different categories [38, 28].

We know that although convolutional neural networks are significantly powerful in learning features, it has poor interpretability [31, 1].

The principle behind this idiom is also suitable to the fine-grained recognition, because the more information we get, the better our judgment will be.

i.e., strongly-supervised and weekly-supervised fine-grained recognition. We should remind that both of these two categories methods requires class labels, and that is the reason why we could not call them as unsupervised recognition.

In our work, we do not make any priori assumption about the importance of various parts for fine-grained recognition, and all components are taken into account.

Instead of finding multiple attention areas in an image at the same time, they suggested finding different regions of attention multiple times, and using recurrent neural network to predict the object class.

We at first introduce the method to localize the parts of object in a precise way with the part annotation in hand. Then, we compare and analyze the classification accuracy when using different parts of the object.

The localization of possible discriminative parts is one of the core issue of fine-grained recognition.

In this paper, we suggest that, in the context of finegrained recognition, the more information we get, the better our judgment will be.

This is different from tradition object detection whose goal is to detect objects from raw images, because we focus on detecting the parts in the images containing the object.

It is notice that part annotation is available in some of fine-grained datasets, for example, CUB200-2011 [27], Birdsnap [3], and FGVC Aircraft [21].

The reason is that, if the region size is too large, the other parts of the object will be included, otherwise, if the size is too small, the distinguishable features will be lost.

It is notice that, two thresholds should be carefully selected in part detection and localization when using YOLO. One threshold $τ_1$ is compared with the IoU of the predicted and ground truth part region to determine what percentage of bounding boxes are preserved during the training phase. Meanwhile, in the test phase, the detected part is considered to be a valid part only if its confidence is higher than another threshold $τ_2$ . The trained model is available on the Github (https://github.com/wuyun8210/part-detection).

We use ResNet [12] as the backbone neural network, and fine-tune the parameters of the pre-trained model on ImageNet.

However, if some of parts are not visible or not properly detected, these methods can easily to face the label confliction problem in model training and prediction. This means that the empty features will correspond to different labels. We know that some of machine learning algorithms (e.g., SVM [26], Decision Tree [8]), are robust to learn from the dataset with lost information. In this paper, to avoid the label confliction problem, we leverage libSVM [6] to combine all of the features due to its convenience in parameter tuning.

Then, the activation tensors output from ResNet pool5-layer with dimension of 4096 (with the input of image size of 224 × 224) are taken as the feature of this image. The lost features (corresponding to invisible parts) are set to zero vector before all of the features are concatenated and trained by SVM.

We split the dataset into three parts: 50% for the training, 20% for validation, and the rest for test.

From the experimental results, we can safely conclude that the birds head contains more discriminative features than the other parts, on the contrary, it is difficult to recognize them by using the leg and tail.

Through the above analysis, we know that different parts have different performance when they are used for recognition independently.

It is note that, the higher resolution of images can improve the classification accuracy of our method, as they provide more precise details.

THE END

发表回复

请先登录账户再评论哦

Fine-grained visual recognition with salient feature detection