Open AccessArticle

A CNN- and Self-Attention-Based Maize Growth Stage Recognition Method and Platform from UAV Orthophoto Images

Xindong Ni

¹,

Faming Wang

¹,

Hao Huang

²,

Ling Wang

¹,

Changkai Wen

¹ and

Du Chen

^1,3,*

College of Engineering, China Agricultural University, Beijing 100083, China

Z-ONE Technology Co., Ltd., Shanghai 201804, China

State Key Laboratory of Intelligent Agricultural Power Equipment, Beijing 100083, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(14), 2672; https://doi.org/10.3390/rs16142672

Submission received: 3 May 2024 / Revised: 15 June 2024 / Accepted: 17 July 2024 / Published: 22 July 2024

(This article belongs to the Special Issue Object Detection and Information Extraction Based on Remote Sensing Imagery)

Download

Browse Figures

Figure 1
UAV for image acquisition of maize growth stages and its operational process. "> Figure 2
Study area and positions of ground modeling. "> Figure 3
Main growth stages of maize. "> Figure 4
Image sample taken by UAV after resizing. "> Figure 5
Redesigned classifier for GoogLeNet. "> Figure 6
Example network architectures for MaizeHT with an input image size of 224 × 224. "> Figure 7
Example Residual Block architectures for MaizeHT. "> Figure 8
Example MLP block architectures for MaizeHT. "> Figure 9
Platform architecture schematic. "> Figure 10
Intelligent recognition platform for maize growth stage: (a) home page, (b) technology introduction, (c) case presentation, (d) intelligent identification. "> Figure 11
Image input of size 224 × 224 for each batch after data enhancement. "> Figure 12
Image input of size 512 × 512 for each batch after data enhancement. "> Figure 13
The validation accuracy curve generated during training; the input image size of MaizeHT_512 is 512 × 512, and the input images of other models have a size of 224 × 224. "> Figure 14
The validation loss curve generated during training; the input image size of MaizeHT_512 is 512 × 512, and the input images of other models have a size of 224 × 224. "> Figure 15
Confusion matrix generated by MaizeHT_224 on the test set with input image size 224 × 224. "> Figure 16
Confusion matrix generated by MaizeHT_512 on the test set with input image size 512 × 512. "> Figure 17
Maize growth stage recognition results. ">

Versions Notes

Abstract

The accurate recognition of maize growth stages is crucial for effective farmland management strategies. In order to overcome the difficulty of quickly obtaining precise information about maize growth stage in complex farmland scenarios, this study proposes a Maize Hybrid Vision Transformer (MaizeHT) that combines a convolutional algorithmic structure with self-attention for maize growth stage recognition. The MaizeHT model utilizes a ResNet34 convolutional neural network to extract image features to self-attention, which are then transformed into sequence vectors (tokens) using Patch Embedding. It simultaneously inserts category information and location information as a token. A Transformer architecture with multi-head self-attention is employed to extract token features and predict maize growth stage categories using a linear layer. In addition, the MaizeHT model is standardized and encapsulated, and a prototype platform for intelligent maize growth stage recognition is developed for deployment on a website. Finally, the performance validation test of MaizeHT was carried out. To be specific, MaizeHT has an accuracy of 97.71% when the input image resolution is 224 × 224 and 98.71% when the input image resolution is 512 × 512 on the self-built dataset, the number of parameters is 15.446 M, and the floating-point operations are 4.148 G. The proposed maize growth stage recognition method could provide computational support for maize farm intelligence.

Keywords:

maize; deep learning; image classification; CNN; self-attention mechanism

1. Introduction

In modern agriculture, maize planting, management, and harvesting methods are gradually changing from manual and mechanical to a higher level of automation and intelligence, significantly improving production efficiency and quality [1]. In an intelligent agricultural system, maize growth stage recognition results are the important input of the whole system, and also the important decision-making basis for intelligent and precise farmland management.

The growth process of maize requires precise farmland management and appropriate planting techniques. The growth stage of maize can be divided into multiple stages, but the main farmland management is concentrated in the first four stages of the maize’s life: seedling stage, jointing stage, small-trumpet stage, and big-trumpet stage. For example, the seedling survival check and replanting are carried out during the maize seedling stage, the weeding at the jointing stage, and the maize borer control at the small-trumpet stage. The precise identification of these four critical periods will affect farmland management’s timeliness and indirectly affect maize’s yield and quality.

In previous agricultural practices, traditional maize farmland monitoring mainly relied on experienced farmers to take care of the farmland or calculate the stage division of farmland management based on time. This method not only consumes manpower and material resources but also cannot guarantee the accuracy of farmland management, which makes the efficient and high-precision identification technology of maize growth period play an essential role in this field.

Throughout the growth cycle of maize, various external morphological characteristics such as shape, color, size, and posture exhibit distinct features at different growth stages. This characteristic allows for exploring methods for automatically observing, detecting, and identifying maize at various growth stages. The fusion of computer vision and machine learning has become a common practice for detecting and monitoring crop phenotypic traits [2]. By utilizing data mining techniques based on image information, researchers have been able to identify crop diseases [3,4,5], detect and manage weeds [6,7,8], estimate crop density and yield [9,10,11], and monitor growth status [12,13,14]. The core is remote sensing technology represented by drones and deep learning technology that can be used for dataset training [15]. In field crop growth status detection and monitoring, UAVs are invaluable as they can cover vast farmland areas quickly, capturing comprehensive and detailed crop images that meet the data requirements for computer vision processing. This technology provides the foundation for recognizing and classifying maize growth stages [16].

In recent years, there have been many applications and studies of deep learning techniques for a single stage of maize. Xu [17] used a kind of multi-scale convolutional global pooling neural network to realize the recognition of corn diseases with corn leaf pictures as an input. Specifically, they added a convolutional layer and an Inception module to AlexNet to improve model performance, and they used the global pooling layer instead of fully connected layers for parameter reduction. Priyadharshini [18] used the modified LeNet to extract the characterization information of maize leaves to realize the identification of disease types. In their research, LeNet was able to extract features from 6464 input images, making it more suitable for the task of maize leaf disease recognition. Due to the obvious characteristics of diseased leaves in the small-trumpet stage and big-trumpet stage of maize, this disease identification research can be applied to these two periods. In addition, An [19] focused on the identification and classification of drought stress in the seedling and jointing stages of maize. To be specific, they used ResNet50 and ResNet152 models to realize the identification and classification of the optimum moisture, light drought, and moderate drought in the seedling and jointing stages of maize and finally achieved 98.14% and 95.95% accuracy rates, respectively. In addition, it is also found that the accuracy of a deep convolutional neural network (DCNN) in color images is higher than that in gray images. In the multi-growth stage recognition task of maize, Yue [20] used the convolutional long short-term memory (LSTM) model to predict meteorological factors and used its prediction results to predict the growth stage of maize using a hybrid model and data-driven model.

Therefore, to realize the large-scale recognition of multiple growth stages of maize, this study focuses on using deep learning technology to achieve high-precision recognition of maize growth stages on orthophoto farmland images collected by UAV. This study uses transfer learning and model fine-tuning strategies to realize the application of CNN and self-attention deep learning models in the maize growth period recognition task. The main work of this paper is as follows:

(1) This study proposes a Maize Hybrid Vision Transformer (MaizeHT) that combines a convolutional algorithmic structure with self-attention for maize growth stage recognition. The MaizeHT model utilizes a ResNet34 convolutional neural network to extract image features to self-attention. A Transformer architecture with multi-head self-attention is employed to extract token features.

(2) This study standardizes and encapsulates the proposed MaizeHT model, designs and develops a prototype platform for intelligent recognition of the maize growth stage, implements the cloud deployment of the platform using the Flask architecture, and completes the trial production for the engineering landing of this research.

(3) This study builds a UAV low-altitude image acquisition system, constructs a complete dataset of orthorectified RGB images of farmland in a corn patch area, identifies and classifies the maize growth stage, and evaluates the effectiveness of the proposed model by comparing it with various control group models using multiple evaluation metrics.

The paper is organized as follows: The self-built datasets and the hybrid model for maize growth stage identification are described in Section 2. The results of the proposed method are discussed in Section 3, and the conclusions are provided in Section 4.

2. Materials and Methods

2.1. Date Acquisition and Datasets

The acquisition and construction of the self-built datasets used in our work are described in Section 2.1.1 and Section 2.1.2, and this dataset contains field images of all corn growth stages required for the experiment.

2.1.1. Image Data Acquisition

Maize image acquisition equipment, a DJI Matrice 300 RTK UAV (Unmanned Aerial Vehicle from SZ DJI Technology Co., Ltd., Shenzhen, China) with a SHARE6100 camera (Share Intelligent Control Technology Co., Ltd., Shenzhen, China), was used to acquire high-definition RGB images of maize not in the same growth period; a specific image of the UAV equipment with a schematic of the operational process is shown in Figure 1. The maize growth stage trial field (Figure 2) site is located in Luoyang, Henan Province, China (111.978173N, 34.48730E), with an average annual precipitation of 630 mm and a frost-free period of more than 210 days in a semi-arid continental monsoon climate. The trial field was a multi-year-cultivated mature field, approximately 373.55 m long and 301.1 m wide, with a total area of approximately 113,510 m², with maize sowing completed on 10 June 2020; planting spacing was 40 cm between plants and 35 cm between rows, and no-till seeding was used in a portion of the field at the time of sowing.

In our work, the collected maize image data include four time series of images (Figure 3); according to the growth cycle of maize, they are arranged in the order of seedling stage, jointing stage, small-trumpet stage, and big-trumpet stage. The seedling stage of maize is the growth stage when the plant seedlings are more than 2 cm unearthed, and the number of exposed leaves is between 3 and 5. The stem length of maize at the jointing stage is between 2 and 3 cm, and the number of leaves is between 5 and 7. In the maize small-trumpet stage, the number of leaves is between 8 and 10, and the leaves are slender and shaped like a small trumpet. The leaves of maize in the big-trumpet stage have fully expanded 11 to 13 pieces, and the leaves are hypertrophic and shaped like a big trumpet.

The specific image acquisition process was as follows: the seedling stage was chosen to be photographed from 5 July to 10 July 2021; the jointing stage was chosen to be photographed from 14 July to 20 July 2021; the small-trumpet stage was chosen to be photographed from 22 July to 28 July 2021; and the big-trumpet stage was chosen to be photographed from 30 July to 9 August 2021. The flight speed of the UAV was set to 5 m/s, the flight altitude was set to 40 m, and the operation was performed on a planned path with a timed hover camera, which had an overlap rate of 75% and a resolution of 61 megapixels (Figure 1). In order to obtain reliable and clear images, UAV image acquisition time was noon, under cloud-free, good-lighting conditions.

2.1.2. Image Pre-Processing and Dataset Construction

A total of 13,804 original images with a resolution of 9504 × 5344 were collected. First, images of farmland boundary areas, images of non-crop areas, and images with poor imaging quality were manually removed from the collected images to achieve preliminary data cleaning of the original image dataset.

Secondly, because the shooting overlap rate of the camera lens in the UAV image acquisition process is 75%, under the condition of ensuring two-dimensional equidistance, 25% of the original image data are selected as the image of the target dataset, and the size of the dataset can be reduced under the condition of ensuring relatively complete image information content.

Finally, considering the large coverage area of the images collected by the drone at high altitudes, which makes it difficult for the deep learning model to load images and increases the computing resources required for training, it is necessary to segment the original images properly. Specifically, each original image is divided into 25 RGB images of 1901 × 1068 pixels using the pixel equidistant segmentation method. In addition, the operation of the first step needs to be repeated for the image after cutting to complete the last cleaning of the image data. The final image sample of the resulting dataset is shown in Figure 4.

After the above operations, 10,040 image datasets were obtained as shown in Table 1. The entire dataset contains 2512 images of the seedling stage, 2510 images of the jointing stage, 2505 images of the small-trumpet stage, and 2513 images of the big-trumpet stage. In order to avoid the imbalance of test set performance evaluation, datasets were randomly selected as training dataset (8032 images), validation dataset (1004 images), and test dataset (1004 images) in a ratio of 8:1:1. The training set is used to learn the weight parameters of developed models, allowing the achievement of a fit to the target sample as it performs gradient descent on the error, performing a preliminary unbiased evaluation of model training on the validation dataset and monitoring the fit of the model training process; this is an important basis for hyperparameter tuning. Finally, the obtained trained models are tested and evaluated on the test dataset to complete the comprehensive performance.

2.2. Maize Growth Stage Recognition Method and Platform

2.2.1. Adaptation of Classic Models

In our work, we explore the actual performance of existing deep learning models for maize growth stage recognition tasks, and these models include CNN models represented by AlexNet [21], VGG16 [22], the Residual Networks [23], ResNet18, ResNet34, ResNet50, DenseNet121 [24], GoogLeNet [25], EfficeirntNetV1 [26], and EfficientNetV2 [27] and self-attention models represented by Vision Transformer [28] and Swin-Transformer [29].

In order to adapt the existing CNN and self-attention models to the task of maize growth stage identification, we needed to adjust these models appropriately. For the CNN models, we redesigned the AlexNet, VGG16, ResNet, DenseNet, and GoogLeNet classifiers to adapt them to the recognition task of the maize growth stage. For AlexNet, we redesigned its classifier, its output dimensions of the first and second linear layers were set as 2048 to reduce computation complexity, and the output dimension of the last linear layer was set as 4 to match the recognition task of the four growth stages. For VGG16, we redesigned its classifier, and it had fewer layers than the original, but we introduced a new unfolding layer and added a LogSoftmax activation function after the last linear layer. For ResNet, we chose ResNet18, ResNet32, and ResNet50 and changed their classifiers to be the same as that for VGG16. For DenseNet, we changed its classifier to be the same as that for VGG16. For GoogLeNet, we replaced its fully connected layer with the classifier as shown in Figure 5.

And for EfficientNetV1 and EfficientNetV2, we only changed the number of classification classes. In the self-attention model, we only changed the number of classification classes by Vision Transformer and Swin-Transformer.

In addition, to ensure the efficiency of the experiments, these adjusted models use a transfer learning strategy, which is pre-trained on ImageNet and then fine-tuned, freezing the weight parameters except for the above operations and training only our final modified classifier or “head” layer.

2.2.2. The Proposed Hybrid Model: MaizeHT

In addition to exploring performance differences and specific implementations of deep learning models for specific farming tasks, we propose a hybrid model architecture of CNN and self-attention for maize growth stage recognition in our study: the Maize Hybrid Vision Transformer (MaizeHT). Referring to the hybrid model proposed by [28], the MaizeHT model we have designed for a 224 × 224 image input is shown in Figure 6.

The hybrid model uses ResNet34 as the backbone and retains the first 3 stages to replace the Patch Embedding Projection of the original Vision Transformer model. The Residual Block used in each stage is shown in Figure 7, replacing the Batch Normalization layers [30] in the Residual Block of the original ResNet34 model with Group Normalization [31]. The Group Normalization’s number of groups into which the channels are separated is set uniformly to 8.

We connect the patch embedded behind the ResNet34 backbone; the same patch is used in the Vision Transformer model. The trainable class token of the category record is connected with the embedding generated by Patch Embedding, and a location embedding is added to the embedding to complete the location tag.

Finally, we connect a Transformer Encoder and a classifier at the tail of the hybrid model. The Transformer Encoder mainly consists of two blocks: a multi-head self-attention block with a head number of 4 and an MLP block (Figure 8). Each block is preceded and followed by a layer normalization (Layer Norm) layer [32] and a dropout layer. In addition, the residual connection processing is adopted after each block. The classifier consists of a Layer Norm layer, an extract class token layer, and a linear layer, where the class token contains the category information inferred by the MaizeHT model. At the same time, the final recognition and classification results of the maize growth stage are output by the linear layer.

In addition to the MaizeHT applied to 224 × 224 resolution image inputs, a version of MaizeHT for 512 × 512 resolution image inputs was also designed to meet the multi-scenario needs of maize farm intelligence, considering the computational resource usage and memory overhead of the model. Our proposed MaizeHT for high-resolution image input only makes the following change: setting the stride of the first convolutional layer in the ResNet34 backbone in the original MaizeHT model to 4. The rest of the structure and parameters remain the same as the original MaizeHT.

2.2.3. Intelligent Recognition Platform

After the MaizeHT model was standardized and encapsulated, a prototype platform was created for the intelligent identification of maize growth stages. This platform was designed and developed with a focus on utility and ease of use to facilitate the deployment of deep learning models on the web. The platform includes core modules for loading and pre-processing maize field images, instantiating and executing deep learning models, providing image previews, visualizing and displaying results through a graphical interface, and ultimately presenting them on the web.

(1) Platform architecture design

The architecture of the intelligent detection platform for corn growing season is shown in Figure 9, which contains a four-tier architecture including a presentation layer, a request–response layer, a service layer, and a data layer.

The presentation layer provides a graphical interface for user interaction. The request–response layer facilitates communication between the client and server, utilizing HTTP and https protocols. The service layer is the backbone of the platform’s core business logic, processing user interactions before being displayed through the presentation layer. Essential functions, such as image handling, model processing, and deep learning inference, are implemented in the service layer. While not directly engaging with users, the data layer provides data resources based on business needs. In this platform, the data layer primarily stores the structure and parameter data of deep learning models for corn growth period identification and detection.

(2) Platform technology selection

Considering the benefits of web applications, such as ease of access, platform updates, and no installation required, this platform has opted for a web-based approach for rapid deployment and user-friendly experience.

Mainstream web technologies have been chosen for the front end of the intelligent corn growth detection platform. It includes HTML5 for structuring web pages, CSS for defining styles, and Javascript for creating interactive effects. These technologies work together to support building front-end pages and user interactions.

Regarding the back-end technology framework, the platform utilizes Flask 2.0.1 and Python 3.8 for development. By selecting Flask, designed for Python, as the deployment architecture, the platform benefits from a unified development language, making updates and maintenance more efficient.

(3) Platform prototype implementation

This research builds a platform for the intelligent detection of maize growth stages; it mainly includes the platform homepage, technology introduction page, case display page, and intelligent identification page, as shown in Figure 10.

2.3. Experiment and Evaluation Metrics

2.3.1. Experimental Steps

We trained the MaizeHT model on the constructed training dataset. To ensure the diversity of data input and improve the robustness of the model, we used data augmentation and mini-batches during model training. Given the practical experience when using drones to obtain orthophotos of maize fields, data enhancement mainly used a combination strategy of ±45 degrees random rotation with a 50% probability of horizontal and vertical flipping. In order to facilitate the training and calculation of the model, images were resized to 224 × 224 (Figure 11) or 512 × 512 (Figure 12) before being input into the model, and the batch size was designed to be 64. During the training of the model, the model training was supervised using the validation dataset so that the model’s overfitting condition was always detected (Validation Frequency equal to 1). Following the controlled experimental strategy, the existing CNN and self-attention models were trained and validated according to the above method, and all models were trained for 100 epochs. Finally, all models, including the MaizeHT model, were tested for the same performance on the test dataset. All training procedures were uniformly conducted throughout the experiments using Adam [33] as the optimizer, cross-entropy as a loss function, and a 0.001 learning rate. All steps used throughout the experiment were performed in an environment with Intel Xeon Gold 6142, Nvidia RTX 3080, CUDA11.2, and Pythorch1.10. The specific experimental environment configuration is shown in Table 2.

Considering the efficiency of the experiment execution and the diversity of the control groups, we conducted experiments on all experimental models according to the input image resolutions of 224 × 224 and 512 × 512. Specifically, AlexNet, ResNet18, ResNet34, and all the selected Vision Transformer and Swin-Transformer used 224 × 224 resolution images as input; DenseNet121, GoogLeNet, and EfficientNet_V1 all used 512 × 512 resolution images as input. In addition to this, VGG16 and the proposed MaizeHT used both image sizes as input.

For the selection of different model versions, we implemented the following instructions: We chose EfficientNet_V1-B6 and EfficientNet_V2-L; for the Vision Transformer model, we chose the base version with patch 16 and the large version, and the base version with patch 32; for Swin-Transformer, we chose the three versions of tiny, small, and base for the experiments. Moreover, for a more intuitive comparison with the proposed MaizeHT model, we also trained a VGG16 and VIT without using transfer learning and model fine-tuning.

2.3.2. Evaluation Metrics

Three metrics, accuracy, number of parameters, and floating-point operations (FLOPs), were used to evaluate all experimental model performance for better quantitative assessment. Among them, accuracy (Equation (1)) is a macro measure of the recognition results of the model, which reflects the proportion of all correctly recognized samples. The number of parameters and FLOPs of the model are used to evaluate all models’ size and computational resource overhead quantitatively.

To enable a more comprehensive measurement of the recognition performance of the proposed MaizeHT model, we use four additional metrics, namely precision (Equation (2)), recall (Equation (3)), specificity (Equation (4)), and F1 score (Equation (5)). For all models, the sample will exhibit four states after its prediction for each maize growth stage: true positive (TP), false positive (FP), true negative (TN), and false negative (FN).

A c c u r a r y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

R e c a l l = \frac{T P}{T P + F N}

(3)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(4)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

As defined by the above equation, the accuracy of the model is the proportion of true positives and true negatives it predicts for all samples, i.e., the number of samples correctly classified by the model as a proportion of the total number of samples; the precision is the proportion of true positives out of all positives predicted by the model; the recall, also known as sensitivity or true positives rate, is reflected as the proportion of all positive samples in which the model predicts correctly; the specificity expresses the proportion of true negatives predicted by the model as a percentage of all negatives, hence the term true negative rate; the F1 score is the harmonic mean of model precision and recall, which is an evaluation index considering model precision and recall.

3. Results and Discussion

3.1. Model Performance Analysis Results

In order to achieve the recognition of maize farmland images for the growth stage, we trained VGG16, VIT-B/32, and MaizeHT on the maize growth stage dataset (Section 2.1) with an input image size of 224 × 224, as well as a high-resolution version of MaizeHT with an input size of 512 × 512, and the accuracy and loss of the validation performed during training are shown in Figure 13 and Figure 14, respectively.

The results of the control experiments of our proposed MaizeHT_224 model with VGG16 and Vision Transformer_Base_patch32 (VIT-Base/32) are shown in Figure 13. By observing the validation accuracy and loss of the three models during the training process, it can be found that MaizeHT_224 can achieve 96.41% accuracy in the recognition of maize growth stages, which is at the same level as VGG16, representing the classic CNN model. As for the poor performance of VIT-Base/32 in this task, the VIT-Base/32 model performs better on larger datasets [28]. It is better to pre-train it on a large dataset and then fine-tune the model on a maize growth stage dataset. A further comparison of the accuracy and loss curves of MaizeHT_224 and VGG16 reveals that the transformation of the curve of VGG16 is smoother than the transformation of the curve of MaizeHT_224 early in the training process. This result may be because the global awareness of multi-head self-attention in Transformer Encoder is slower than the local awareness model of the CNN.

A comparison between the baseline version of the MaizeHT model (MaizeHT_224) and the high-resolution version (MaizeHT_512) reflects the impact on the performance of the proposed model when the input resolution is increased, as can be seen from the curves of MaizeHT_224 and MaizeHT_512 in Figure 13 and Figure 14. When we vary the size of the model input, the performance of the MaizeHT model improves as the image resolution increases. Although the two models do not have the same structure, the two versions differ only in the stride of the first convolutional layer; i.e., the down-sampling rate after the first convolution is different. The rest of the network structure is the same, so that comparison can be used to reflect the results we want to discuss.

To compare even further the specific performance of MaizeHT_224 and MaizeHT_512 for the recognition task at the maize growth stage, we tested the recognition of both versions of the MaizeHT model on the test dataset and plotted the confusion matrix as shown in Figure 15 and Figure 16.

From the comparison of the two confusion matrices, it can be found that MaizeHT_512 shows that the MaizeHT model performs better in the case of high-resolution inputs for all comparisons except for one more maize image with a small-trumpet stage identified as a seedling stage. Following the formulas, we can calculate the specific performance metrics for both versions of MaizeHT from the two confusion matrices, and the results are presented in Table 3 and Table 4.

Consistent with the response of the confusion matrices, MaizeHT_512 has improved in all periods and values of each metric, except for the recognition precision and specificity in the seedling stage, which are lower than those for MaizeHT_224. The F1 score of MaizeHT_512 was still improved despite the decrease in precision and specificity of seedlings, which indicates that the MaizeHT model is still improving the recognition of maize seedlings with improved resolution.

The results of our application of the existing deep learning model to the maize growth stage recognition task are shown in Table 5. Except for the proposed MaizeHT model, all other experimental models were used for pre-training on ImageNet and fine-tuned on a local maize growth stage dataset.

In terms of accuracy and loss of recognition, the EfficientNet_V2(L) model achieves the highest accuracy and lowest loss among all models, but it also uses the largest number of parameters and larger FLOPs. Among the models with an input image resolution of 224, the Vision Transformer (base-p32) model was ranked first with an accuracy of 98.41% and a loss value of 0.05135, followed by ResNet34 with an accuracy of 98.21% and a loss value of 0.06974. From the perspective of comparing the model size and computation, we can find that the model with the least number of params is GoogLeNet and the smallest FLOPs is AlexNet, but we are surprised to find that GoogLeNet can obtain 99.46% accuracy and 0.01909 loss with fewer parameters. Furthermore, our proposed MaizeHT_224 is rated as the best model in terms of compromise performance along with ResNet34 in a combined comparison of accuracy, loss, number of parameters, and FLOPs against all models with input image size 224 × 224. This is demonstrated by the fact that ResNet34 is more accurate and less computationally intensive, and MaizeHT_224 has fewer parameters. MaizeHT_512, on the other hand, achieves performance close to existing CNN models with minimal FLOPs and fewer parameters in a model with an input image size of 512 × 512. It should be noted that, however, the MaizeHT model has not been pre-trained on a large dataset like the other models.

Finally, in order to compare the generalization ability and robustness of all models, we counted the accuracy and loss of all models, including MaizeHT, on the training, validation, and testing datasets, whose results are shown in Table 6.

The comparison of the results shows that in the maize growth stage recognition task, VGG16 showed less generalization than the other models in the experiments with both input image sizes of 224 × 224 and 512 × 512, followed by the worse performance of Swin-Transformer (small). As for our proposed MaizeHT, it can be seen from the comparison that both MaizeHT_224 and MaizeHT_512 exhibit excellent generalization and robustness in this task.

In summary, in the maize growing stage recognition task, when the input image size is 224 × 224, we consider ResNet34 and the proposed MaizeHT_224 as the best choice. Specifically, in the pursuit of high accuracy and low computation, the accuracy of ResNet34 is 97.81%, and the FLOPs value is 3.419 G. MaizeHT_224, with a precision of 97.71% and multiple parameters of 15.446 M, is recommended if one wants to save memory overhead while the task is in progress and the performance degradation is acceptable. When the image input size of the task is 512 × 512, GoogLeNet, with 99.46% accuracy, 5.863 M parameters, and 7.318 G FLOPs, may be the best choice for comprehensive performance. However, if only extreme accuracy is sought, one should consider EfficientNet_V2(L) with 99.60% accuracy. Our proposed MaizeHT_512 minimizes the required computing resources with 5.416 G FLOPs.

3.2. Platform Application Results

The specific steps of using the maize growth stage intelligent recognition platform developed in this paper are as follows: When the image is uploaded, the platform will load the uploaded image into the image upload frame and generate an image preview window below the operation button; the user can observe the image in the window to further confirm whether the uploaded image is the target image for this mission. If the uploaded image is correct, the user can use the mouse to click the “identification” button on the left below the image upload frame. Suppose the platform is used for the first time when the identification button is pressed. In that case, the structure and parameters of the maize growth stage identification model are loaded from the deep learning model library within the back-end server, instantiating the deep learning model and constructing a computational diagram for deep learning, and finally returning the maize growth stage identification results to the front end after completing forward inference for display on a preview window, with the effect shown in Figure 17.

The analysis revealed that the images inputted into the platform were at the big-trumpet stage of maize growth. The platform and its built-in algorithm identified the corn growth stage probabilities as follows: big-trumpet stage at 97.6%, small-trumpet stage at 2.3%, jointing stage at 0%, and seedling stage at 0%. The accuracy for this particular image was determined to be 97.6%.

Furthermore, through the validation of 100 images, the platform’s accuracy was maintained above 97%. This demonstrates that the algorithm model and platform studied in this research have high accuracy and speed in identifying the growth period of corn. As a result, it has evolved into a practical data platform for corn growth period detection.

4. Conclusions and Future Work

4.1. Conclusions

This work focused on a specific implementation of deep learning techniques for the maize growth stage recognition task. We proposed a hybrid network model combining CNN and self-attention architecture for this task—MaizeHT. For this task, we performed an actual performance comparison of existing models, including both CNN and self-attention backbone networks. The recognition requirements of different image resolutions were considered in the actual farm management process; we studied 224 × 224 and 512 × 512 image inputs separately in our experiments and designed different versions of MaizeHT for these two resolution image inputs. The conclusions of this study are as follows:

(1) The main structure of MaizeHT is composed of the first three stages of ResNet34 as a feature extractor and then combined with a Transformer Encoder. This can use the local feature extraction of the CNN model and the global perception of self-attention to realize the recognition of maize growth stages.

(2) The MaizeHT has an accuracy of 97.71% when the input image resolution is 224 × 224 and 98.71% when the input image resolution is 512 × 512 on the self-built dataset; the number of parameters is 15.446 M, and the floating-point operations are 4.148 G.

(3) The MaizeHT model has been standardized and packaged, and a prototype platform for intelligent recognition of maize growth stages has been created using the Flask architecture. Using deep learning technology to identify maize growth stages is a crucial step toward achieving future intelligent and information-driven precision agriculture.

4.2. Future Work

This study only collected images from maize fields in 2021 and constructed the dataset. The randomness factors of a single year’s image sample dataset were not eliminated. Research on mixed recognition of different varieties of maize and maize grown in different regions has not been achieved. In addition, considering the continuity of the growth period, future research can develop more delicate growth stage recognition models, which will provide research ideas for constructing deep learning models with better comprehensive performance and practicality.

Considering the diversity of actual scenarios of farmland intelligent monitoring in large-scale field operations, we will pay more attention to the lightweight requirements of models in edge computing scenarios and the high-precision requirements in cloud computing scenarios in the future and customize models for different business scenarios in subdivision.

Mastering the changes in maize growth can help decision-makers better understand the maize status in different regions and take appropriate agricultural management measures. At present, we have conducted research on maize, but relatively speaking, it has significant differences in different growth stages. In the future, we will try to expand the applicability of the MaizeHT model to other crops, improve the subsequent decision-making process research, and establish a complete field management strategy.

Author Contributions

Conceptualization, X.N., L.W. and D.C.; methodology, X.N. and H.H.; software, F.W.; validation, H.H. and X.N.; investigation, X.N. and C.W.; resources, L.W. and D.C.; writing—original draft preparation, X.N.; writing—review and editing, X.N., F.W., H.H., L.W., D.C. and C.W.; supervision, L.W. and D.C.; project administration, L.W.; funding acquisition, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 32372592, and China Agricultural University Double First-Class Construction Project: Special Project for State Key Laboratory of Intelligent Agricultural Power Equipment.

Data Availability Statement

All data are contained within the article.

Conflicts of Interest

Author Hao Huang was employed by the company Z-ONE Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Sharma, V.; Tripathi, A.K.; Mittal, H. Technological revolutions in smart farming: Current trends, challenges & future directions. Comput. Electron. Agric. 2022, 201, 107217. [Google Scholar] [CrossRef]
Pathan, M.; Patel, N.; Yagnik, H.; Shah, M. Artificial cognition for applications in smart agriculture: A comprehensive review. Artif. Intell. Agric. 2020, 4, 81–95. [Google Scholar] [CrossRef]
Luo, Y.; Cai, X.; Qi, J.; Guo, D.; Che, W. FPGA–accelerated CNN for real-time plant disease identification. Comput. Electron. Agric. 2023, 207, 107715. [Google Scholar] [CrossRef]
Hridoy, R.H.; Tarek Habib, M.; Sadekur Rahman, M.; Uddin, M.S. Deep Neural Networks-Based Recognition of Betel Plant Diseases by Leaf Image Classification. In Evolutionary Computing and Mobile Sustainable Networks; Lecture Notes on Data Engineering and Communications Technologies; Springer: Singapore, 2022; pp. 227–241. [Google Scholar]
Xu, W.; Li, W. Deep residual neural networks with feature recalibration for crop image disease recognition. Crop Prot. 2024, 176, 106488. [Google Scholar] [CrossRef]
Zou, K.; Liao, Q.; Zhang, F.; Che, X.; Zhang, C. A segmentation network for smart weed management in wheat fields. Comput. Electron. Agric. 2022, 202, 107303. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, C.; Qiao, Y.; Zhang, Z.; Zhang, W.; Song, C. CNN feature based graph convolutional network for weed and crop recognition in smart farming. Comput. Electron. Agric. 2020, 174, 105450. [Google Scholar] [CrossRef]
Pathak, H.; Igathinathane, C.; Howatt, K.; Zhang, Z. Machine learning and handcrafted image processing methods for classifying common weeds in corn field. Smart Agric. Technol. 2023, 5, 100249. [Google Scholar] [CrossRef]
Carlier, A.; Dandrifosse, S.; Dumont, B.; Mercatoris, B. Wheat Ear Segmentation Based on a Multisensor System and Superpixel Classification. Plant Phenomics 2022, 2022, 9841985. [Google Scholar] [CrossRef]
Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Vega, F.A.; Ramírez, F.C.; Saiz, M.P.; Rosúa, F.O. Multi-temporal imaging using an unmanned aerial vehicle for monitoring a sunflower crop. Biosyst. Eng. 2015, 132, 19–27. [Google Scholar] [CrossRef]
Yang, M.-D.; Tseng, H.-H.; Hsu, Y.-C.; Tsai, H.P. Semantic Segmentation Using Deep Learning with Vegetation Indices for Rice Lodging Identification in Multi-date UAV Visible Images. Remote Sens. 2020, 12, 633. [Google Scholar] [CrossRef]
Biabi, H.; Abdanan Mehdizadeh, S.; Salehi Salmi, M. Design and implementation of a smart system for water management of lilium flower using image processing. Comput. Electron. Agric. 2019, 160, 131–143. [Google Scholar] [CrossRef]
Niu, Y.; Han, W.; Zhang, H.; Zhang, L.; Chen, H. Estimating maize plant height using a crop surface model constructed from UAV RGB images. Biosyst. Eng. 2024, 241, 56–67. [Google Scholar] [CrossRef]
Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A Review on UAV-Based Applications for Precision Agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Lee, C.-J.; Yang, M.-D.; Tseng, H.-H.; Hsu, Y.-C.; Sung, Y.; Chen, W.-L. Single-plant broccoli growth monitoring using deep learning with UAV imagery. Comput. Electron. Agric. 2023, 207, 107739. [Google Scholar] [CrossRef]
Xu, Y.; Zhao, B.; Zhai, Y.; Chen, Q.; Zhou, Y. Maize Diseases Identification Method Based on Multi-Scale Convolutional Global Pooling Neural Network. IEEE Access 2021, 9, 27959–27970. [Google Scholar] [CrossRef]
Ahila Priyadharshini, R.; Arivazhagan, S.; Arun, M.; Mirnalini, A. Maize leaf disease classification using deep convolutional neural networks. Neural Comput. Appl. 2019, 31, 8887–8895. [Google Scholar] [CrossRef]
An, J.; Li, W.; Li, M.; Cui, S.; Yue, H. Identification and Classification of Maize Drought Stress Using Deep Convolutional Neural Network. Symmetry 2019, 11, 256. [Google Scholar] [CrossRef]
Yue, Y.; Li, J.-H.; Fan, L.-F.; Zhang, L.-L.; Zhao, P.-F.; Zhou, Q.; Wang, N.; Wang, Z.-Y.; Huang, L.; Dong, X.-H. Prediction of maize growth stages based on deep learning. Comput. Electron. Agric. 2020, 172, 105351. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van der Maaten, L. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Tan, M.; Le, Q.V. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Tan, M.; Le, Q.V. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (PMLR), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]

Figure 1. UAV for image acquisition of maize growth stages and its operational process.

Figure 2. Study area and positions of ground modeling.

Figure 3. Main growth stages of maize.

Figure 4. Image sample taken by UAV after resizing.

Figure 5. Redesigned classifier for GoogLeNet.

Figure 6. Example network architectures for MaizeHT with an input image size of 224 × 224.

Figure 7. Example Residual Block architectures for MaizeHT.

Figure 8. Example MLP block architectures for MaizeHT.

Figure 9. Platform architecture schematic.

Figure 10. Intelligent recognition platform for maize growth stage: (a) home page, (b) technology introduction, (c) case presentation, (d) intelligent identification.

Figure 11. Image input of size 224 × 224 for each batch after data enhancement.

Figure 12. Image input of size 512 × 512 for each batch after data enhancement.

Figure 13. The validation accuracy curve generated during training; the input image size of MaizeHT_512 is 512 × 512, and the input images of other models have a size of 224 × 224.

Figure 14. The validation loss curve generated during training; the input image size of MaizeHT_512 is 512 × 512, and the input images of other models have a size of 224 × 224.

Figure 15. Confusion matrix generated by MaizeHT_224 on the test set with input image size 224 × 224.

Figure 16. Confusion matrix generated by MaizeHT_512 on the test set with input image size 512 × 512.

Figure 17. Maize growth stage recognition results.

Table 1. The annotated dataset for experimentation.

Growth Stage	Training Dataset	Validation Dataset	Test Dataset
Seedling	2012	249	251
Jointing	2010	250	250
Small trumpet	2004	251	250
Big trumpet	2006	254	253

Table 2. Experimental environment.

Configuration	Parameter
CPU	Intel Xeon Gold 6142
GPU	Nvidia RTX 3080
Operating system	Ubuntu 18.04
Accelerated environment	CUDA11.2 cuDNN8.1.1
Development environment	Pycharm2021.1
Random access memory	27.1 GB
Video memory	10.5 GB
PyTorch version	v1.10

Table 3. Summary of the MaizeHT_224 precision, recall, specificity, and F1 scores for each maize growth stage class on the test set with an input image size of 224 × 224.

Stage	Precision	Recall	Specificity	F1 Score
Seedling	0.976	0.976	0.992	0.976
Jointing	0.972	0.964	0.991	0.968
Small trumpet	0.957	0.980	0.985	0.968
Big trumpet	0.980	0.964	0.993	0.972

Table 4. Summary of the MaizeHT_512 precision, recall, specificity, and F1 scores for each maize growth stage class on the test set with an input image size of 512 × 512.

Stage	Precision	Recall	Specificity	F1 Score
Seedling	0.996	0.984	0.999	0.990
Jointing	0.996	0.980	0.999	0.988
Small trumpet	0.973	0.996	0.991	0.984
Big trumpet	0.984	0.988	0.995	0.986

Table 5. Summary of the comparative experimental results of different models.

Algorithms	Input Resolution	Params (M)	Accuracy (%)	Loss	FLOPs (G)
CNN
AlexNet	224 × 224	25.551	97.81	0.07843	0.633
VGG16	224 × 224	21.138	96.31	0.15340	14.311
VGG16	512 × 512	21.138	97.81	0.06835	74.743
ResNet18	224 × 224	11.706	97.31	0.07815	1.694
ResNet34	224 × 224	21.815	98.21	0.06974	3.419
ResNet50	512 × 512	25.617	99.48	0.01923	19.997
DenseNet121	512 × 512	7.217	99.23	0.03678	13.939
GoogLeNet	512 × 512	5.863	99.46	0.01909	7.318
EfficientNet_V1 (B6)	512 × 512	40.745	99.24	0.02252	16.601
EfficientNet_V2 (L)	512 × 512	117.239	99.60	0.01848	59.755
Self-attention
Vision Transformer (base-p16)	224 × 224	85.611	97.51	0.06673	15.691
Vision Transformer (base-p32)	224 × 224	87.381	98.41	0.05135	4.063
Vision Transformer (large-p16)	224 × 224	303.002	98.21	0.06088	55.550
Swin-Transformer (tiny)	224 × 224	27.474	97.61	0.06904	4.051
Swin-Transformer (small)	224 × 224	48.750	97.61	0.06139	7.927
Swin-Transformer (base)	224 × 224	86.626	97.71	0.06968	14.087
MaizeHT_224 (proposed)	224 × 224	15.446	97.71	0.10176	4.148
MaizeHT_512 (proposed)	512 × 512	15.446	98.71	0.04638	5.416

Table 6. Summary of the comparative experimental results of different models on the training dataset, validation dataset, and test dataset.

Algorithms	Input Resolution	Train		Validation		Test
Algorithms	Input Resolution	Accuracy (%)	Loss	Accuracy (%)	Loss	Accuracy (%)	Loss
CNN
AlexNet	224 × 224	96.89	0.08802	97.21	0.08293	97.81	0.07843
VGG16	224 × 224	98.63	0.04460	95.71	0.19580	96.31	0.15340
VGG16	512 × 512	98.63	0.06064	97.21	0.10111	97.81	0.06835
ResNet18	224 × 224	95.80	0.11380	96.51	0.10940	97.31	0.07815
ResNet34	224 × 224	96.28	0.10350	96.02	0.10790	98.21	0.06974
ResNet50	512 × 512	99.23	0.02630	98.90	0.04082	99.48	0.01923
DenseNet121	512 × 512	97.82	0.06604	98.71	0.05063	99.23	0.03678
GoogLeNet	512 × 512	98.63	0.03910	98.51	0.04487	99.46	0.01909
EfficientNet_V1 (B6)	512 × 512	99.23	0.02016	98.80	0.03989	99.24	0.02252
EfficientNet_V2 (L)	512 × 512	99.61	0.00922	98.61	0.05329	99.60	0.01848
Self-attention
Vision Transformer (base-p16)	224 × 224	98.64	0.04666	97.21	0.07104	97.51	0.06673
Vision Transformer (base-p32)	224 × 224	98.44	0.05128	97.41	0.09522	98.41	0.05135
Vision Transformer (large-p16)	224 × 224	98.82	0.03576	97.61	0.06434	98.21	0.06088
Swin-Transformer (tiny)	224 × 224	97.88	0.06934	97.41	0.08803	97.61	0.06904
Swin-Transformer (small)	224 × 224	98.26	0.05437	97.21	0.09485	97.61	0.06139
Swin-Transformer (base)	224 × 224	98.29	0.05546	98.11	0.06973	97.71	0.06968
MaizeHT_224 (proposed)	224 × 224	97.39	0.08262	96.41	0.1139	97.71	0.10176
MaizeHT_512 (proposed)	512 × 512	98.06	0.06283	97.81	0.0731	98.71	0.04638

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, X.; Wang, F.; Huang, H.; Wang, L.; Wen, C.; Chen, D. A CNN- and Self-Attention-Based Maize Growth Stage Recognition Method and Platform from UAV Orthophoto Images. Remote Sens. 2024, 16, 2672. https://doi.org/10.3390/rs16142672

AMA Style

Ni X, Wang F, Huang H, Wang L, Wen C, Chen D. A CNN- and Self-Attention-Based Maize Growth Stage Recognition Method and Platform from UAV Orthophoto Images. Remote Sensing. 2024; 16(14):2672. https://doi.org/10.3390/rs16142672

Chicago/Turabian Style

Ni, Xindong, Faming Wang, Hao Huang, Ling Wang, Changkai Wen, and Du Chen. 2024. "A CNN- and Self-Attention-Based Maize Growth Stage Recognition Method and Platform from UAV Orthophoto Images" Remote Sensing 16, no. 14: 2672. https://doi.org/10.3390/rs16142672

APA Style

Ni, X., Wang, F., Huang, H., Wang, L., Wen, C., & Chen, D. (2024). A CNN- and Self-Attention-Based Maize Growth Stage Recognition Method and Platform from UAV Orthophoto Images. Remote Sensing, 16(14), 2672. https://doi.org/10.3390/rs16142672

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu