Open AccessArticle

Detecting Wheat Heads from UAV Low-Altitude Remote Sensing Images Using Deep Learning Based on Transformer

Jiangpeng Zhu

¹,

Guofeng Yang

^1,*

Xuping Feng

¹,

Xiyao Li

¹,

Hui Fang

Jinnuo Zhang

Xiulin Bai

Mingzhu Tao

¹ and

Yong He

College of Biosystems Engineering and Food Science, Key Laboratory of Spectroscopy, Ministry of Agriculture and Rural Affairs, Zhejiang University, Hangzhou 310058, China

Huzhou Institute of Zhejiang University, Huzhou 313000, China

Department of Agricultural and Biological Engineering, Purdue University, West Lafayette, IN 47907-2093, USA

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(20), 5141; https://doi.org/10.3390/rs14205141

Submission received: 19 September 2022 / Revised: 10 October 2022 / Accepted: 10 October 2022 / Published: 14 October 2022

(This article belongs to the Special Issue Crop Biophysical Parameters Retrieval Using Remote Sensing Data)

Download

Browse Figures

Graphical abstract
"> Figure 1
Common challenges faced in wheat head detection, (a) different growth stage, (b) brightness variation, (c) image blur, (d) overlapping. "> Figure 2
Experimental site and design. (a) Location of the experimental site; (b) UAV and sensor; (c) three types of images captured by the sensor; (d) plot design of the experimental site. "> Figure 3
Examples of publicly available wheat head datasets: (a) GWHD, (b) SPIKE, (c) ACID, and (d) UWHD. "> Figure 4
The architecture of Transformer. "> Figure 5
Wheat head Transformer detection method. Two-stage method FR-Transformer (a); one-stage method R-Transformer and one-stage method Y-Transformer (b). "> Figure 6
Detection results of UWHD (a) RGB image, (b) R-Transformer, (c) Y-Transformer, (d) FR-Transformer and the challenges are (I) different growth stage, (II) brightness variation, (III) image blur, and (IV) overlapping (the green bounding box in the image represents the ground truth of the wheat head, and the red bounding box represents the predicted wheat head region). "> Figure 7
Special detection results of UWHD. (a) The FR-Transformer method performs well even without the ground truth bounding box, (b) due to the dual effects of light and ground soil, the FR-Transformer method is mistaken for wheat heads, research and application should avoid this situation (the green bounding box in the image represents the ground truth of the wheat head, and the red bounding box represents the predicted wheat head region). "> Figure 8
Results of wheat heads using different backbones to detect UWHD. Focus on the difference in FLOPs and AP for each method (a); also, compare number of parameters and AP (b). ">

Versions Notes

Abstract

The object detection method based on deep learning convolutional neural network (CNN) significantly improves the detection performance of wheat head on wheat images obtained from the near ground. Nevertheless, for wheat head images of different stages, high density, and overlaps captured by the aerial-scale unmanned aerial vehicle (UAV), the existing deep learning-based object detection methods often have poor detection effects. Since the receptive field of CNN is usually small, it is not conducive to capture global features. The visual Transformer can capture the global information of an image; hence we introduce Transformer to improve the detection effect and reduce the computation of the network. Three object detection networks based on Transformer are designed and developed, including the two-stage method FR-Transformer and the one-stage methods R-Transformer and Y-Transformer. Compared with various other prevalent object detection CNN methods, our FR-Transformer method outperforms them by 88.3% for AP₅₀ and 38.5% for AP₇₅. The experiments represent that the FR-Transformer method can gratify requirements of rapid and precise detection of wheat heads by the UAV in the field to a certain extent. These more relevant and direct information provide a reliable reference for further estimation of wheat yield.

Keywords:

object detection; deep learning; remote sensing monitoring; phenotyping analysis; wheat head counting; UAV; transformer

Graphical Abstract

1. Introduction

Wheat is one of the main cereal crops and plays a vital role in human survival, heath, and diet [1,2]. Nevertheless, the current world wheat supply is confronting severe threats such as the population growth, COVID-19 and climate change [3,4,5]. For wheat production, from heading stage to maturity stage, the growth and health of wheat heads have a significant impact on wheat yield and quality [6]. More specifically, the number of wheat heads per unit area is one of the most critical agronomic factors in wheat production. In order to continuously improve the quality of wheat seeds, wheat breeders need to continuously obtain data on quite a few traits of wheat to identify new wheat varieties, such as disease resistance [7], climate resistance [8], and yield estimation [9]. For wheat traits, wheat head density is the critical adaptive trait in the process of obtaining high quality wheat breeding [10]. The method for estimating wheat head density is to detect all wheat heads in a given region. In the traditional wheat breeding process, the estimation of wheat head density is often manually observed or sampled, which makes the process inefficient, tedious, and error-prone. Therefore, there is a critical need for a fast, convenient, and efficient detection method, which can satisfy the requirement for efficient measurement of wheat head in the field.

At present, remote sensing technology can be roughly divided into three scales: space, air, and ground. Among them, compared with air-based UAV remote sensing, space-based remote sensing, such as satellites, is easily affected by meteorological conditions, has a longer revisit period and lower spatial resolution; ground-based remote sensing, such as fixed monitoring stations, has limited scope, high costs, and is difficult to establish in large areas [11,12,13,14]. The above different reasons limit its efficient application in different fields to a certain extent. In the agricultural field in recent years, with the continuous maturity of hardware represented by the unmanned aerial vehicle (UAV) flight platforms and data acquisition sensors [15,16,17,18], with the characteristics of low operating cost, high flexibility, rapid acquisition, and wide coverage, it is feasible to obtain higher spatial, temporal, and spectral resolution image. We use a high-definition zoom visible light sensor to obtain the highest quality visible light images of ground wheat heads. The sensor has higher spatial resolution and is more sensitive to small object wheat heads. The time resolution of the image is controllable, and it is suitable for normal remote sensing to collect wheat head images. Correspondingly, compared with the image data obtained by common resolution sensors, the amount of image data we get is more storage space. It is worth comparing that the hyperspectral and multispectral sensor can obtain more bands of images mounted on UAV, and the amount of data is abundant. They have been used in some fields, such as land cover classification [19,20,21] and have achieved good results. However, their prices [22], use processes [23,24], analysis methods [25], and maintenance operations [11] have limited its large-scale promotion and application to a certain extent.

With the rapid development of data processing technologies represented by machine learning and even deep learning [26,27,28], wheat head detection method based on UAV remote sensing images have gradually become a new choice to replace traditional manual method. One of the reasons for the significant success of computer vision technology in the past can be attributed to the emergence of large-scale labeled datasets, which also requires a large and diverse wheat head dataset for the task of wheat head detection. Nevertheless, most of the existing publicly available wheat head datasets are often collected based on ground-based phenotyping platforms or handheld devices [29,30,31,32], and the quantity and quality are often unsatisfactory. At the same time, these datasets contain a limited number of images and varieties that make deep learning networks less robust. To solve the above problems, we first collected, organized, and cleaned a multi-variety wheat head visible light dataset based on UAV remote sensing. Nonetheless, the detection of wheat heads is a challenging and intractable task with the following obstacles:

The size, color, shape, and texture of wheat heads vary greatly depending on the variety and growth stage of wheat (Figure 1a).
Due to the growth of wheat heads having different heights, angles, and directions, environmental illumination changes (Figure 1b) are uneven and unstable, and meteorological wind contributes to wheat heads shaking (Figure 1c), resulting in a large difference in the visual characteristics of wheat heads, which affects the accurate identification of wheat heads.
The intensive planting of wheat leads to extremely dense distribution and severe occlusion, and there is a problem that different wheat organs block each other (Figure 1d).

Note that some of the above challenges are present in the wheat head detection, and possibly general object detection tasks as well. Fortunately, with the advent of large-scale datasets like ImageNet [33] and high-performance graphics processing unit (GPU), deep learning and CNN have facilitated the development of general object detection [34,35,36,37,38,39]. Currently, most researchers use the popular deep learning CNN for application [40,41,42], and improve [43,44,45,46,47,48,49] in order to achieve efficient detection of wheat heads. Zhao et al. [44] mainly improved the YOLO network using images captured by the UAV, and reconstructed the network by adding a micro-scale detection layer, setting a priori anchors, and adjusting the confidence loss function of the detection layer based on region of interest (IoU). These changes improve feature extraction for small-sized wheat heads and increase detection accuracy. Due to the emergence of the global wheat head detection (GWHD) dataset [31,32], more and more wheat head detection studies have begun to use GWHD as experimental data [50,51,52]. The RGB images in the dataset are collected by institutions or universities distributed across countries and cover genotypes from Europe, Africa, Asia, Australia, and North America. Similarly, our study included 305 wheat genotypes worldwide. The GWHD dataset contains wheat heads at post-flowering and maturity, and images in the dataset were most acquired using various ground-based phenotyping platforms and cameras in the nadir viewing orientation, different image attributes were produced. Among the main uses are gantries, tractors, carts, and handheld devices. Zhang et al. [47] used GWHD also by improving the one-stage YOLO network, adding GAN to the attention module to improve the performance of the main detection network, and the experimental results showed better detection performance. Sun et al. [48] used two public datasets and tested on the datasets obtained by UAVs, artificially performed data augmentation such as contrast adjustment and Gaussian blur to improve the generalization ability. The original information is adaptively pooling by using the augmented feature pyramid, and the detection bias and error of wheat heads are solved with the help of the underlying information, and a high detection accuracy is achieved. Wen et al. [49] also used GWHD and used a supplementary hand-held camera captured wheat head dataset for detection, improving RetinaNet to efficiently detect and count wheat heads. Different from the datasets used in the above studies, we constructed wheat heads images obtained by the UAV with high temporal and spatial resolution for detection research to satisfy the requirements of efficient measurement of wheat heads traits, and developed an automatic detection network for wheat heads based on deep learning.

Although CNNs have become the standard method in the field of computer vision, the impact of the above challenges can be mitigated with strong representation capabilities. In the field of computer vision, the backbone network has always been a requisite part of feature extraction. Usually, convolutional operations are good at extracting details, and to grasp the global information often requires stacking many convolutional layers [53,54]. The attention mechanism is good at grasping the whole, but requires a large amount of data to learn its discriminative features during training [55]. The Transformer that combines the attention of the convolution kernel improves the performance while reducing computational overhead. In 2020, the emergence of Vision Transformer (ViT) [56], DETR [57] for object detection, SETR [58] for image segmentation, and METRO [59] for 3D human poses, have triggered a new round of paradigm shifts. Using the Transformer to complete the object detection vision task has become a new research direction. Compared with existing mainstream CNN methods, Transformer-based methods also show good performance on vision tasks. Therefore, we introduced Transformer in wheat head detection, and designed an object detection network based on Transformer as the backbone, including Faster-RCNN, RetinaNet and YOLOv3, in order to reduce the complexity of the network structure, explore the scalability of the network, and improve the detection of wheat heads effect and training efficiency.

2. Materials and Methods

2.1. Data Collection and Processing

2.1.1. Experimental Site and Plant Material

The experimental site is in the Changxing Agricultural Experiment Station of Zhejiang University (119.63°E 30.89°N, Figure 2a). The site has a subtropical monsoon climate, with an average annual temperature of 15.6 °C, an average relative humidity of 76%, an average annual precipitation of 1309 mm, and an average annual sunshine hour of 1810 h.

Wheat was sown on 2 November 2021, with a row spacing of 25 cm and a density of 80 grains/m². We had a trial of 305 plots of 2 m × 1.25 m corresponding to 305 wheat varieties worldwide (Figure 2d), according to the local water and fertilizer management conditions, unified fertilization and unified irrigation. In particular, before sowing, we applied 1 ton/mu of organic fertilizer and 25 kg/mu of compound fertilizer (N:P:K = 16:16:16).

2.1.2. Dataset

We constructed the UAV Wheat Head Dataset (UWHD), as shown in Figure 3d. During the wheat ripening stage on 17 May 2022, the DJI Matrice M300 RTK UAV equipped with Zenmuse H20T took high-quality images of the wheat planting environment at a height of 10 m (Figure 2b), with a spatial resolution of 0.06 cm/px. It is worth mentioning that we first screened and cleaned the original dataset, and only selected zoomed visible images containing wheat heads (Figure 2c), as shown by the dashed box. Furthermore, the original pixel image was cropped into a 1120 × 1120 pixel image for subsequent operations. Finally, UWHD had a total of 550 images, and the wheat heads in the image are labeled based on the open source LabelMe annotation tool.

After that, a series of data augmentation was performed on the dataset, including translation, flipping, brightness change, blurring, etc. After data augmentation, 2750 images were obtained, and the dataset was divided into training set, validation set, and test set in a ratio of 7:2:1. All other datasets use the same division ratio.

We also used the proposed method to detect three public available wheat head datasets: (1) Global Wheat Head Detection (GWHD) [31], (2) SPIKE [30], and (3) Annotated Crop Image Dataset (ACID) [29]. Example images from three datasets were shown in Figure 3a–c. In addition, the details of each dataset are as shown in Table 1:

The GWHD dataset is a larger dataset for wheat head detection. From 2016 to 2019, data were collected over four years from nine institutions or universities in various locations, hoping to cover as many genotypes, seeding densities, growth stages, and collection conditions as possible, including 507 genotypes from around the world. The dataset mainly collects visible light images from hand-held devices, tractors, gantry, and rope-suspended phenotyping platforms. We use the dataset from Kaggle (https://www.kaggle.com/c/global-wheat-detection, accessed on 10 February 2022) competition.

The SPIKE dataset has a total of 335 field images captured at multiple stages across 10 varieties. Merely the bounding boxes of wheat heads are annotated in each image. First, images were taken directly from the field and then cropped to preserve only the IoU. Then annotate the bounding box of the wheat head in this IoU.

The ACID dataset was captured in a greenhouse environment. In total there are 520 images containing about 4100 wheat heads. The dataset gives pointwise labels for each wheat head, but since we need a bounding box as an annotation for wheat head detection. Then we manually label each wheat head using the LabelMe annotation tool.

2.2. Wheat Head Detection Method

The traditional CNN regards the image as a matrix or grid, and aggregates neighboring pixels or feature pixels through a sliding window; the visual Transformer divides the input image into several image blocks to form a sequence and uses attention mechanisms or fully connected layers handle sequential relationships. At present, there are two main challenges when Transformer is applied to the image field: (1) The visual entity changes greatly, and the performance of the visual Transformer may not be very good in different scenarios; (2) the image resolution is high and there are quite a few pixels. The Transformer is based on global self-attention, resulting in a large amount of computation. Based on this, this study proposes a wheat head detection method including a sliding window operation and a Transformer with the hierarchical design as the backbone. Specifically, it includes the FR-Transformer based on the two-stage algorithm Faster-RCNN, the R-Transformer based on the one-stage algorithm RetinaNet, and the Y-Transformer based on the one-stage algorithm YOLOv3.

2.2.1. Transformer

From the architecture of Transformer (Figure 4), we can see that the image (H, W) is first input to Patch Partition for a block operation, and then sent to the linear embedding module to adjust the number of channels. Finally, the final prediction result is obtained through feature extraction and downsampling of multiple stages. It is worth noting that after each stage, the size will be reduced to 1/2 of the original, and the channel will be expanded to twice the original, which is like the ResNet network. The Transformer Block in each stage uses Block [60] in Swin Transformer, which is composed of two connected Transformer Blocks based on W-MSA and SW-MSA respectively, and improves computing performance through Window and Shifted Window mechanisms.

2.2.2. Wheat Head Transformer Detection Method

We choose Transformer implementations and compared them based on the classic two-stage method Faster-RCNN [61] and the one-stage methods RetinaNet [62] and YOLOv3 [63]. Based on Section 2.2.1, a two-stage method FR-Transformer (Figure 5a) was constructed. For Faster-RCNN, the convolutional network is creatively used to generate proposal boxes by itself, and the convolutional network is shared with the object detection network. The region proposal network and anchor can greatly improve the quality of region proposal, and at the same time greatly improve the accuracy and speed of the detector. The one-stage method R-Transformer is constructed based on Section 2.2.1 and the one-stage method Y-Transformer is constructed based on Section 2.2.1 (Figure 5b). For RetinaNet, the classic one-stage architecture consists of a ResNet backbone network, an FPN structure, and two sub-networks for regressing object positions and predicting object categories, respectively. Using Focal Loss in the training process solves the problem of unbalanced foreground and background categories in traditional one-stage detectors, and further improves the accuracy of one-stage detectors [62]. YOLOv3 is a one-stage detector that achieves nearly twice the inference speed compared to conventional object detection methods that achieve the same accuracy [63].

2.2.3. Model Training and Testing

In our experiments, we mainly use Intel Xeon Silver 4210 CPU and GPU is NVIDIA Tesla V100 to train the model. The programming language is Python 3.9, based on Pytorch 1.5. During the training phase, we randomly resize images of 448 × 448 resolution. Considering that in order to speed up Transformer’s convergence and the characteristics of not easy to overfit, the AdamW method was chosen to optimize the learning rate. The specific settings of the training hyper-parameters are shown in Table 2.

2.2.4. Evaluation Metrics

To evaluate the effectiveness of wheat head detection, based on precision (P) and recall (R), and true positive (TP), false positive (FP), and false negative (FN).

P = \frac{TP}{TP + FP}

(1)

R = \frac{TP}{TP + FN}

(2)

Average precision (AP) is the average of precision, as shown in Formula (3):

AP = \int_{0}^{1} {P (R) d}_{R}

(3)

Mean average precision (mAP) is the sum of the average precisions across all classes, divided by the number of classes, as follows:

mAP = \frac{\sum_{i = 1}^{N} {AP}_{i}}{N}

(4)

where N is the number of categories. Since we only detect one category of wheat heads, AP is the mAP. We use AP to represent AP at IoU = 0.5:0.05:0.95, AP₅₀ to represent the AP value when the IoU threshold is 0.5, and AP₇₅ to represent the AP value when the IoU threshold is 0.75. AP_S represents the AP value of the object box with the pixel area less than

32^{2}

, AP_M represents the AP value of the object box with the pixel area of

32^{2}

–

96^{2}

and AP_L represents the AP value of the object box with the pixel area greater than

96^{2}

3. Results

3.1. Wheat Head Detect Results

The qualitative results comparison of the proposed FR-Transformer method with R-Transformer and Y-Transformer is shown in Figure 6, the proposed FR-Transformer method performs the best in wheat head detection. It can be found that the FR-Transformer method can best overcome the challenges of wheat head detection, such as different growth stages of wheat heads (Figure 6(I)), changes in the brightness of images (Figure 6(II)), wheat heads are affected by the wind or other disturbances causing the image to be blurred (Figure 6(III)), overlap between wheat heads (Figure 6(IV)). Additionally, we believe that the detection effect of our method largely benefits from the completeness and uniqueness of the characteristics of the wheat head maturity stage.

3.2. Comparison with Other Object Detection Methods

To scientifically demonstrate the detection performance of our method, we trained and tested with other popular object detection methods on the same dataset (UWHD), and compared the test results, including one-stage methods SSD, FCOS, YOLOF, YOLOX; Multi-stage method Cascade R-CNN; transformer-based DETR. Table 3 shows the detection performance of each object detection method.

The results show that the proposed FR-Transformer method can better detect most of the wheat heads in the real field environment on the wheat head detection task (Table 3). Specifically, our method achieves AP of 43.7%, AP₅₀ of 88.3%, AP₇₅ of 38.5%, AP_S of 6.4%, AP_M of 44%, and AP_L of 54.1%. Furthermore, the parameters (Params) and floating point of operations (FLOPs) are in the middle of all methods. Compared with YOLOX, the powerful one-stage method, we propose FR-Transformer methods AP, AP₅₀, AP₇₅, AP_S, and AP_M, which can improve by 0.7%, 2.9%, 0.3%, 0.3%, and 0.9% on UWHD, respectively. Compared with the multi-stage detection method Cascade R-CNN, AP, AP₅₀, AP₇₅, AP_S, AP_M, and AP_L can be improved by 5.2%, 10.2%, 8%, 3.2%, 7.5%, and 6.1%, respectively. Compared with the transformer-based DETR, FR-Transformer surpasses DETR in metrics other than AP_S, which has a large advantage. In general, the FR-Transformer method can accurately detect wheat heads in UAV images with complex field backgrounds, and the detection results of the FR-Transformer method are used for comparison in the following.

3.3. Comparing the Proposed Method on Different Common Wheat Head Datasets

For GWHD, SPIKE, and ACID datasets, the proposed FR-Transformer method obtains the results shown in Table 4. It is mentioned that compared with the results in [52], the detection results obtained by the FR-Transformer method are all higher than those in [52].

4. Discussion

4.1. Robustness Interpretation of the Proposed Method and Optimization Direction of Detection Results

For UWHD, we find that the proposed method can predict the bounding boxes of wheat heads to a certain extent even if the ground truth bounding box is missing for some images of wheat heads (Figure 7a). Correspondingly, although the method predicts the correct wheat head, it is also considered a false positive since its ground truth is not available. Therefore, AP scores will decrease. Adding the ground truth of these wheat heads in UWHD will increase the results. In addition, due to the dual effects of light and ground soil, the FR-Transformer method is mistaken for wheat heads, research and application should avoid this situation (Figure 7b).

4.2. Comparison of Detection Metrics for Different Input Image Sizes

Photos obtained from different scales often result in large differences in image resolution, size, such as satellites, gantry, and handheld devices, and severely affect detection performance. To this end, the exploration results show that the input size of the training images has a significant impact on the detection results (Table 5). For the image captured by the same UAV, the higher the size of the input training image, the higher the detection AP value. In addition, we found that when the size of the input image changed from 896 × 896 to 1120 × 1120, some metrics AP, AP_S, and AP_L increased, and some metrics AP₅₀, AP₇₅, and AP_M even decreased. From this, it can be explained that the input size of the training image needs to be appropriately selected to improve the training efficiency, and the whole image cannot be simply trained.

4.3. Detection of Wheat Head Using Different Backbones

In order to illustrate the superiority of Transformer as the backbone, we use CNN’s ResNet-50, ResNet-101, DarkNet-53, and MobileNetV2 as the backbone, and conduct experiments and comparisons based on three classic object detection methods. While keeping the neck and head of our network unchanged, ResNet-50, ResNet-101, DarkNet-53, MobileNetV2 and our introduced backbone Transformer are used for experiments (Figure 8). See Table A1 for the specific values of Figure 8.

The FR-Transformer, R-Transformer, and Y-Transformer designed based on the Transformer backbone in this paper can surpass other backbones in various metrics (Figure 8), under the same method. Among them, for AP₅₀, our proposed FR-Transformer improves AP₅₀ by 7% and 8.9%, respectively; R-Transformer improves AP₅₀ by 3.3% and 6.3%, respectively; Y-Transformer improves AP₅₀ by 2.3% and 8.7%, respectively. After comprehensive testing, the Transformer backbone has shown great potential to further improve wheat head detection. For the Params and FLOPs, our FR-Transformer reduces the Params and FLOPs than Faster-RCNN and ResNet-101, and the same R-Transformer reduces the Params and FLOPs than RetinaNet and ResNet-101, Y-Transformer also reduces the Params and FLOPs compared to YOLOv3 and DarkNet-53. It can be seen that our Transformer method can achieve superior performance with smaller Params and FLOPs, demonstrating its superior data efficiency.

5. Conclusions

Wheat head detection is of great significance for wheat yield estimation, field management, and phenotyping analysis. Based on the visible light images of UAVs, the rapid and precise detection of wheat heads will greatly reduce the investment of traditional wheat breeders. Hence, in order to promote the performance of detection, we introduce a new visual network architecture Transformer. Three object detection methods are designed and developed based on Transformer as the backbone, including the two-stage method Faster-RCNN and the one-stage method RetinaNet and YOLOv3. The experimental results show that the Transformer-based FR-Transformer method overcomes the related visual obstacles, and the detection performance is significantly better than other methods. For the three Transformer methods, when the same method has different backbones and maintains a good detection effect, it can relatively reduce the Params and FLOPs. The AP of the FR-Transformer method is 43.7%, the AP₅₀ is 88.3%, the AP₇₅ is 38.5%, the AP_S is 6.4%, the AP_M is 44%, and the AP_L is 54.1%. The proposed FR-Transformer method can better satisfy the requirements of rapid and accurate detection of wheat heads in the field environment photographed by UAVs, and provide reliable reference data for further estimation of wheat yield.

Author Contributions

Conceptualization, J.Z. (Jiangpeng Zhu), G.Y. and X.F.; methodology, J.Z. (Jiangpeng Zhu) and G.Y.; validation, G.Y., X.F. and Y.H.; investigation, J.Z. (Jiangpeng Zhu), X.L., J.Z. (Jinnuo Zhang), X.B. and M.T.; resources, H.F.; writing—original draft preparation, J.Z. (Jiangpeng Zhu) and G.Y.; writing—review and editing, X.F., Y.H. and J.Z. (Jinnuo Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2021YFD2000102).

Data Availability Statement

Not applicable.

Acknowledgments

We thank the Changxing Agricultural Experiment Station of Zhejiang University for providing the experimental base. We are grateful to the editors and anonymous reviewers for their constructive and helpful comments, which improved the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Use different backbones for wheat head detection.

Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	Params	FLOPs
Faster-RCNN	ResNet-50	37.8	79.4	32.0	2.5	38.0	49.6	41.1 M	38.8 G
Faster-RCNN	ResNet-101	39.9	81.3	34.2	3.0	40.8	51.5	60.1 M	53.7 G
Faster-RCNN	Transformer	43.7	88.3	38.5	6.4	44.0	54.1	44.8 M	38.8 G
RetinaNet	ResNet-50	35.2	75.6	26.8	2.6	32.5	42.1	36.1 M	40.1 G
RetinaNet	ResNet-101	38.8	78.6	30.1	3.5	36.3	45.6	55.1 M	54.9 G
RetinaNet	Transformer	40.3	81.9	32.3	4.9	40.6	48.2	36.8 M	40.4 G
YOLOv3	DarkNet-53	37.1	78.6	31.8	3.5	36.7	42.1	61.5 M	37.9 G
YOLOv3	MobileNetV2	34.7	72.2	24.9	2.4	33.1	38.3	3.7 M	3.2 G
YOLOv3	Transformer	39.0	80.9	33.6	3.8	39.6	45.3	48.3 M	26.8 G

References

Hansen, L.B.S.; Roager, H.M.; Sondertoft, N.B.; Gobel, R.J.; Kristensen, M.; Valles-Colomer, M.; Vieira-Silva, S.; Ibrugger, S.; Lind, M.V.; Maerkedahl, R.B.; et al. A Low-Gluten Diet Induces Changes in the Intestinal Microbiome of Healthy Danish Adults. Nat. Commun. 2018, 9, 4630. [Google Scholar] [CrossRef] [Green Version]
Delabre, I.; Rodriguez, L.O.; Smallwood, J.M.; Scharlemann, J.P.W.; Alcamo, J.; Antonarakis, A.S.; Rowhani, P.; Hazell, R.J.; Aksnes, D.L.; Balvanera, P.; et al. Actions on Sustainable Food Production and Consumption for the Post-2020 Global Biodiversity Framework. Sci. Adv. 2021, 7, eabc8259. [Google Scholar] [CrossRef]
Deutsch, C.A.; Tewksbury, J.J.; Tigchelaar, M.; Battisti, D.S.; Merrill, S.C.; Huey, R.B.; Naylor, R.L. Increase in Crop Losses to Insect Pests in a Warming Climate. Science 2018, 361, 916–919. [Google Scholar] [CrossRef] [Green Version]
Kinnunen, P.; Guillaume, J.H.A.; Taka, M.; D’Odorico, P.; Siebert, S.; Puma, M.J.; Jalava, M.; Kummu, M. Local Food Crop Production Can Fulfil Demand for Less than One-Third of the Population. Nat. Food 2020, 1, 229–237. [Google Scholar] [CrossRef]
Laborde, D.; Martin, W.; Swinnen, J.; Vos, R. COVID-19 Risks to Global Food Security. Science 2020, 369, 500–502. [Google Scholar] [CrossRef] [PubMed]
Watson, A.; Ghosh, S.; Williams, M.J.; Cuddy, W.S.; Simmonds, J.; Rey, M.-D.; Hatta, M.A.M.; Hinchliffe, A.; Steed, A.; Reynolds, D.; et al. Speed Breeding Is a Powerful Tool to Accelerate Crop Research and Breeding. Nat. Plants 2018, 4, 23–29. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Sun, S.; Ge, W.; Zhao, L.; Hou, B.; Wang, K.; Lyu, Z.; Chen, L.; Xu, S.; Guo, J.; et al. Horizontal Gene Transfer of Fhb7 from Fungus Underlies Fusarium Head Blight Resistance in Wheat. Science 2020, 368, eaba5435. [Google Scholar] [CrossRef] [PubMed]
Xiong, W.; Reynolds, M.P.; Crossa, J.; Schulthess, U.; Sonder, K.; Montes, C.; Addimando, N.; Singh, R.P.; Ammar, K.; Gerard, B.; et al. Increased Ranking Change in Wheat Breeding under Climate Change. Nat. Plants 2021, 7, 1207–1212. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Jia, H.; Li, T.; Wu, J.; Nagarajan, R.; Lei, L.; Powers, C.; Kan, C.-C.; Hua, W.; Liu, Z.; et al. TaCol-B5 Modifies Spike Architecture and Enhances Grain Yield in Wheat. Science 2022, 376, 180–183. [Google Scholar] [CrossRef] [PubMed]
Yao, H.; Xie, Q.; Xue, S.; Luo, J.; Lu, J.; Kong, Z.; Wang, Y.; Zhai, W.; Lu, N.; Wei, R.; et al. HL2 on Chromosome 7D of Wheat (Triticum Aestivum L.) Regulates Both Head Length and Spikelet Number. Theor. Appl. Genet. 2019, 132, 1789–1797. [Google Scholar] [CrossRef]
Yao, H.; Qin, R.; Chen, X. Unmanned Aerial Vehicle for Remote Sensing Applications—A Review. Remote Sens. 2019, 11, 1443. [Google Scholar] [CrossRef] [Green Version]
Jin, X.; Zarco-Tejada, P.J.; Schmidhalter, U.; Reynolds, M.P.; Hawkesford, M.J.; Varshney, R.K.; Yang, T.; Nie, C.; Li, Z.; Ming, B.; et al. High-Throughput Estimation of Crop Traits: A Review of Ground and Aerial Phenotyping Platforms. IEEE Geosci. Remote Sens. Mag. 2021, 9, 200–231. [Google Scholar] [CrossRef]
Zhang, B.; Wu, Y.; Zhao, B.; Chanussot, J.; Hong, D.; Yao, J.; Gao, L. Progress and Challenges in Intelligent Remote Sensing Satellite Systems. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1814–1822. [Google Scholar] [CrossRef]
Wang, X.; Luo, Z.; Li, W.; Hu, X.; Zhang, L.; Zhong, Y. A Self-Supervised Denoising Network for Satellite-Airborne-Ground Hyperspectral Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5503716. [Google Scholar] [CrossRef]
Maimaitijiang, M.; Sagan, V.; Sidike, P.; Hartling, S.; Esposito, F.; Fritschi, F.B. Soybean Yield Prediction from UAV Using Multimodal Data Fusion and Deep Learning. Remote Sens. Environ. 2020, 237, 111599. [Google Scholar] [CrossRef]
Radoglou-Grammatikis, P.; Sarigiannidis, P.; Lagkas, T.; Moscholios, I. A Compilation of UAV Applications for Precision Agriculture. Comput. Netw. 2020, 172, 107148. [Google Scholar] [CrossRef]
de Almeida, D.R.A.; Broadbent, E.N.; Ferreira, M.P.; Meli, P.; Zambrano, A.M.A.; Gorgens, E.B.; Resende, A.F.; de Almeida, C.T.; do Amaral, C.H.; Corte, A.P.D.; et al. Monitoring Restored Tropical Forest Diversity and Structure through UAV-Borne Hyperspectral and Lidar Fusion. Remote Sens. Environ. 2021, 264, 112582. [Google Scholar] [CrossRef]
Reddy Maddikunta, P.K.; Hakak, S.; Alazab, M.; Bhattacharya, S.; Gadekallu, T.R.; Khan, W.Z.; Pham, Q.-V. Unmanned Aerial Vehicles in Smart Agriculture: Applications, Requirements, and Challenges. IEEE Sens. J. 2021, 21, 17608–17619. [Google Scholar] [CrossRef]
Cimtay, Y.; Ilk, H.G. A Novel Bilinear Unmixing Approach for Reconsideration of Subpixel Classification of Land Cover. Comput. Electron. Agric. 2018, 152, 126–140. [Google Scholar] [CrossRef]
Lee, M.-K.; Golzarian, M.R.; Kim, I. A New Color Index for Vegetation Segmentation and Classification. Precis. Agric. 2021, 22, 179–204. [Google Scholar] [CrossRef]
Li, H.; Li, Z.; Dong, W.; Cao, X.; Wen, Z.; Xiao, R.; Wei, Y.; Zeng, H.; Ma, X. An Automatic Approach for Detecting Seedlings per Hill of Machine-Transplanted Hybrid Rice Utilizing Machine Vision. Comput. Electron. Agric. 2021, 185, 106178. [Google Scholar] [CrossRef]
Uto, K.; Seki, H.; Saito, G.; Kosugi, Y.; Komatsu, T. Development of a Low-Cost, Lightweight Hyperspectral Imaging System Based on a Polygon Mirror and Compact Spectrometers. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 861–875. [Google Scholar] [CrossRef]
Honkavaara, E.; Rosnell, T.; Oliveira, R.; Tommaselli, A. Band Registration of Tuneable Frame Format Hyperspectral UAV Imagers in Complex Scenes. ISPRS J. Photogramm. Remote Sens. 2017, 134, 96–109. [Google Scholar] [CrossRef]
Banerjee, B.P.; Raval, S.; Cullen, P.J. UAV-Hyperspectral Imaging of Spectrally Complex Environments. Int. J. Remote Sens. 2020, 41, 4136–4159. [Google Scholar] [CrossRef]
Heylen, R.; Parente, M.; Gader, P. A Review of Nonlinear Hyperspectral Unmixing Methods. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 1844–1868. [Google Scholar] [CrossRef]
Su, J.; Yi, D.; Su, B.; Mi, Z.; Liu, C.; Hu, X.; Xu, X.; Guo, L.; Chen, W.-H. Aerial Visual Perception in Smart Farming: Field Study of Wheat Yellow Rust Monitoring. IEEE Trans. Ind. Inform. 2021, 17, 2242–2249. [Google Scholar] [CrossRef] [Green Version]
Onishi, M.; Ise, T. Explainable Identification and Mapping of Trees Using UAV RGB Image and Deep Learning. Sci. Rep. 2021, 11, 903. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Pound, M.P.; Atkinson, J.A.; Wells, D.M.; Pridmore, T.P.; French, A.P. Deep Learning for Multi-Task Plant Phenotyping. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Workshop ICCVW 2017, Venice, Italy, 22–29 October 2017; pp. 2055–2063. [Google Scholar]
Hasan, M.M.; Chopin, J.P.; Laga, H.; Miklavcic, S.J. Detection and Analysis of Wheat Spikes Using Convolutional Neural Networks. Plant Methods 2018, 14, 1–13. [Google Scholar] [CrossRef] [Green Version]
David, E.; Madec, S.; Sadeghi-Tehran, P.; Aasen, H.; Zheng, B.; Liu, S.; Kirchgessner, N.; Ishikawa, G.; Nagasawa, K.; Badhon, M.A.; et al. Global Wheat Head Detection (GWHD) Dataset: A Large and Diverse Dataset of High-Resolution RGB-Labelled Images to Develop and Benchmark Wheat Head Detection Methods. Plant Phenom. 2020, 2020, 3521852. [Google Scholar] [CrossRef] [PubMed]
David, E.; Serouart, M.; Smith, D.; Madec, S.; Velumani, K.; Liu, S.; Wang, X.; Pinto, F.; Shafiee, S.; Tahir, I.S.A.; et al. Global Wheat Head Detection 2021: An Improved Dataset for Benchmarking Wheat Head Detection Methods. Plant Phenom. 2021, 2021, 9846158. [Google Scholar] [CrossRef] [PubMed]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-YOLOv4: Scaling Cross Stage Partial Network. In Proceedings of the 2021 IEEE CVF Conference Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 20–25 June 2021; pp. 13024–13033. [Google Scholar]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the 2021 IEEE CVF Conference Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 20–25 June 2021; pp. 16514–16524. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 12009–12019. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 11976–11986. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 16000–16009. [Google Scholar]
Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear Density Estimation from High Resolution RGB Imagery Using Deep Learning Technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Su, W.-H.; Zhang, J.; Yang, C.; Page, R.; Szinyei, T.; Hirsch, C.D.; Steffenson, B.J. Automatic Evaluation of Wheat Resistance to Fusarium Head Blight Using Dual Mask-RCNN Deep Learning Frameworks in Computer Vision. Remote Sens. 2021, 13, 26. [Google Scholar] [CrossRef]
Dandrifosse, S.; Ennadifi, E.; Carlier, A.; Gosselin, B.; Dumont, B.; Mercatoris, B. Deep Learning for Wheat Ear Segmentation and Ear Density Measurement: From Heading to Maturity. Comput. Electron. Agric. 2022, 199, 107161. [Google Scholar] [CrossRef]
Gong, B.; Ergu, D.; Cai, Y.; Ma, B. Real-Time Detection for Wheat Head Applying Deep Neural Network. Sensors 2021, 21, 191. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Zhang, X.; Yan, J.; Qiu, X.; Yao, X.; Tian, Y.; Zhu, Y.; Cao, W. A Wheat Spike Detection Method in UAV Images Based on Improved YOLOv5. Remote Sens. 2021, 13, 3095. [Google Scholar] [CrossRef]
Hong, Q.; Jiang, L.; Zhang, Z.; Ji, S.; Gu, C.; Mao, W.; Li, W.; Liu, T.; Li, B.; Tan, C. A Lightweight Model for Wheat Ear Fusarium Head Blight Detection Based on RGB Images. Remote Sens. 2022, 14, 3481. [Google Scholar] [CrossRef]
Khaki, S.; Safaei, N.; Pham, H.; Wang, L. WheatNet: A Lightweight Convolutional Neural Network for High-Throughput Image-Based Wheat Head Detection and Counting. Neurocomputing 2022, 489, 78–89. [Google Scholar] [CrossRef]
Zhang, Y.; Li, M.; Ma, X.; Wu, X.; Wang, Y. High-Precision Wheat Head Detection Model Based on One-Stage Network and GAN Model. Front. Plant Sci. 2022, 13, 787852. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Yang, K.; Chen, C.; Shen, J.; Yang, Y.; Wu, X.; Norton, T. Wheat Head Counting in the Wild by an Augmented Feature Pyramid Networks-Based Convolutional Neural Network. Comput. Electron. Agric. 2022, 193, 106705. [Google Scholar] [CrossRef]
Wen, C.; Wu, J.; Chen, H.; Su, H.; Chen, X.; Li, Z.; Yang, C. Wheat Spike Detection and Counting in the Field Based on SpikeRetinaNet. Front. Plant Sci. 2022, 13. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Wang, K.; Lu, H.; Cao, Z. Dynamic Color Transform for Wheat Head Detection. In Proceedings of the 2021 IEEE CVF International Conference Computer Vision Workshop, ICCVW 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 1278–1283. [Google Scholar]
Najafian, K.; Ghanbari, A.; Stavness, I.; Jin, L.; Shirdel, G.H.; Maleki, F. A Semi-Self-Supervised Learning Approach for Wheat Head Detection Using Extremely Small Number of Labeled Samples. In Proceedings of the 2021 IEEE CVF International Conference Computer Vision Workshop, ICCVW 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 1342–1351. [Google Scholar]
Bhagat, S.; Kokare, M.; Haswani, V.; Hambarde, P.; Kamble, R. WheatNet-Lite: A Novel Light Weight Network for Wheat Head Detection. In Proceedings of the 2021 IEEE CVF International Conference Computer Vision Workshop, ICCVW 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 1332–1341. [Google Scholar]
Yang, G.; Chen, G.; He, Y.; Yan, Z.; Guo, Y.; Ding, J. Self-Supervised Collaborative Multi-Network for Fine-Grained Visual Categorization of Tomato Diseases. IEEE Access 2020, 8, 211912–211923. [Google Scholar] [CrossRef]
Yang, G.; Chen, G.; Li, C.; Fu, J.; Guo, Y.; Liang, H. Convolutional Rebalancing Network for the Classification of Large Imbalanced Rice Pest and Disease Datasets in the Field. Front. Plant Sci. 2021, 12, 1150. [Google Scholar] [CrossRef]
Yang, G.; He, Y.; Yang, Y.; Xu, B. Fine-Grained Image Classification for Crop Disease Based on Attention Mechanism. Front. Plant Sci. 2020, 11, 600854. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar]
Lin, K.; Wang, L.; Liu, Z. End-to-End Human Pose and Mesh Reconstruction with Transformers. In Proceedings of the 2021 IEEE CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 20–25 June 2021; pp. 1954–1963. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2015; Volume 28, pp. 91–99. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision, ICCV, Venice, Italy, 22–2 October 2017; pp. 2999–3007. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-Level Feature. In Proceedings of the 2021 IEEE CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Berkeley, CA, USA, 19–25 June 2021; pp. 13034–13043. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]

Figure 1. Common challenges faced in wheat head detection, (a) different growth stage, (b) brightness variation, (c) image blur, (d) overlapping.

Figure 2. Experimental site and design. (a) Location of the experimental site; (b) UAV and sensor; (c) three types of images captured by the sensor; (d) plot design of the experimental site.

Figure 3. Examples of publicly available wheat head datasets: (a) GWHD, (b) SPIKE, (c) ACID, and (d) UWHD.

Figure 4. The architecture of Transformer.

Figure 5. Wheat head Transformer detection method. Two-stage method FR-Transformer (a); one-stage method R-Transformer and one-stage method Y-Transformer (b).

Figure 6. Detection results of UWHD (a) RGB image, (b) R-Transformer, (c) Y-Transformer, (d) FR-Transformer and the challenges are (I) different growth stage, (II) brightness variation, (III) image blur, and (IV) overlapping (the green bounding box in the image represents the ground truth of the wheat head, and the red bounding box represents the predicted wheat head region).

Figure 7. Special detection results of UWHD. (a) The FR-Transformer method performs well even without the ground truth bounding box, (b) due to the dual effects of light and ground soil, the FR-Transformer method is mistaken for wheat heads, research and application should avoid this situation (the green bounding box in the image represents the ground truth of the wheat head, and the red bounding box represents the predicted wheat head region).

Figure 8. Results of wheat heads using different backbones to detect UWHD. Focus on the difference in FLOPs and AP for each method (a); also, compare number of parameters and AP (b).

Table 1. Details of common wheat head datasets and our UWHD.

Dataset	Release Time	Environment	Resolution	Numbers	Instances
GWHD [31]	2020	Field	1024 $\times$ 1024	3422	188,445
SPIKE [30]	2018	Lab	2001 $\times$ 1501	335	25,000
ACID [29]	2017	Field	1956 $\times$ 1530	534	4100
UWHD	2022	Field	1120 $\times$ 1120	550	30,500

Table 2. Hyper-parameter settings for training.

Epochs	Batch Size	Learning Rate	Betas	Weight Decay
100	8	0.001	0.9, 0.999	0.05

Table 3. Comparison of various object detection methods.

Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	Params	FLOPs
Faster-RCNN	Transformer	43.7	88.3	38.5	6.4	44.0	54.1	44.8 M	38.8 G
RetinaNet	Transformer	40.3	81.9	32.3	4.9	40.6	48.2	36.8 M	40.4 G
YOLOv3	Transformer	39.0	80.9	33.6	3.8	39.6	45.3	48.3 M	26.8 G
SSD [64]	VGG-16	35.1	77.3	26.1	3.1	35.0	46.3	23.8 M	67.2 G
Cascade R-CNN [65]	ResNet-50	38.5	78.1	30.5	3.2	36.5	48.0	68.9 M	40.8 G
FCOS [66]	ResNet-50	37.8	80.2	31.3	3.8	35.6	47.5	31.8 M	38.6 G
DETR [57]	ResNet-50	41.1	87.5	35.6	8.2	41.5	50.5	41.3 M	18.5 G
YOLOF [67]	R-50-C5	42.8	82.1	39.9	5.8	42.9	55.2	42.0 M	19.2 G
YOLOX [68]	YOLOX-M	43.0	85.4	38.2	6.1	43.1	54.6	25.3 M	18.0 G

Table 4. Comparison the proposed FR-Transformer on different common wheat head datasets.

Dataset	AP₅₀	Reference
GWHD	91.3	[52]
	92.6	[49]
	93.2	Our
SPIKE	67.6	[30]
	86.1	[52]
	88.4	Our
ACID	76.9	[52]
ACID	79.5	Our

Table 5. Performance comparison of F-Transformer on input images of different resolutions.

Input Size	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
448 × 448	43.7	88.3	38.5	6.4	44.0	54.1
672 × 672	44.8	89.1	40.9	9.9	45.3	55.4
896 × 896	46.3	90.3	47.2	11.4	48.6	57.9
1120 × 1120	46.7	89.8	47.1	12.5	48.5	58.1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, J.; Yang, G.; Feng, X.; Li, X.; Fang, H.; Zhang, J.; Bai, X.; Tao, M.; He, Y. Detecting Wheat Heads from UAV Low-Altitude Remote Sensing Images Using Deep Learning Based on Transformer. Remote Sens. 2022, 14, 5141. https://doi.org/10.3390/rs14205141

AMA Style

Zhu J, Yang G, Feng X, Li X, Fang H, Zhang J, Bai X, Tao M, He Y. Detecting Wheat Heads from UAV Low-Altitude Remote Sensing Images Using Deep Learning Based on Transformer. Remote Sensing. 2022; 14(20):5141. https://doi.org/10.3390/rs14205141

Chicago/Turabian Style

Zhu, Jiangpeng, Guofeng Yang, Xuping Feng, Xiyao Li, Hui Fang, Jinnuo Zhang, Xiulin Bai, Mingzhu Tao, and Yong He. 2022. "Detecting Wheat Heads from UAV Low-Altitude Remote Sensing Images Using Deep Learning Based on Transformer" Remote Sensing 14, no. 20: 5141. https://doi.org/10.3390/rs14205141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu