[go: up one dir, main page]

 
 
applsci-logo

Journal Browser

Journal Browser

Advanced Convolutional Neural Network (CNN) Technology in Object Detection and Data Processing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 July 2025 | Viewed by 14118

Special Issue Editor


E-Mail Website
Guest Editor
INRIA Institut National de Recherche en Informatique et en Automatique, Le Chesnay, France
Interests: attention mechanism; reinforcement learning; generative learning; causal inference

Special Issue Information

Dear Colleagues,

Convolutional neural networks (CNNs) and related deep neural networks have seen great success in machine learning and computer vision. Advanced CNNs, such as fast, faster R-CNN, have achieved breakthrough performance in object detection. Recently, transformer models have been widely applied in classification, object detection, and multimodal machine learning tasks.

To further boost the research and application of advanced deep neural networks in various computer vision applications, this Special Issue aims to gather and collect advanced deep neural networks and algorithms in the field of computer vision and related areas. We encouraged the submission of research papers on, but not restricted to, object detection, image segmentation, and classification.

Dr. Shiyang Yan
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • convolutional neural network
  • computer vision
  • object detection
  • image segmentation
  • image classification

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 3452 KiB  
Article
Pupil Refinement Recognition Method Based on Deep Residual Network and Attention Mechanism
by Zehui Chen, Changyuan Wang and Gongpu Wu
Appl. Sci. 2024, 14(23), 10971; https://doi.org/10.3390/app142310971 - 26 Nov 2024
Cited by 1 | Viewed by 542
Abstract
This study aims to capture subtle changes in the pupil, identify relatively weak inter-class changes, extract more abstract and discriminative pupil features, and study a pupil refinement recognition method based on attention mechanisms. Based on the deep learning framework and the ResNet101 deep [...] Read more.
This study aims to capture subtle changes in the pupil, identify relatively weak inter-class changes, extract more abstract and discriminative pupil features, and study a pupil refinement recognition method based on attention mechanisms. Based on the deep learning framework and the ResNet101 deep residual network as the backbone network, a pupil refinement recognition model is established. Among them, the image preprocessing module is used to preprocess the pupil images captured by infrared spectroscopy, removing internal noise from the pupil images. By using the ResNet101 backbone network, subtle changes in the pupil are captured, weak inter-class changes are identified, and different features of the pupil image are extracted. The channel attention module is used to screen pupil features and obtain key pupil features. External attention modules are used to enhance the expression of key pupil feature information and extract more abstract and discriminative pupil features. The Softmax classifier is used to process the pupil features captured by infrared spectra and output refined pupil recognition results. Experimental results show that this method can effectively preprocess pupil images captured by infrared spectroscopy and extract pupil features. This method can effectively achieve fine pupil recognition, and the fine recognition effect is relatively good. Full article
Show Figures

Figure 1

Figure 1
<p>Structure diagram of fine pupil recognition model.</p>
Full article ">Figure 2
<p>Channel attention module.</p>
Full article ">Figure 3
<p>External attention module.</p>
Full article ">Figure 4
<p>Pupil image preprocessing results in visible and near-infrared spectral domains. (<b>a</b>) Raw near-infrared images. (<b>b</b>) Preprocessing near-infrared images.</p>
Full article ">Figure 5
<p>Partial pupil feature extraction results.</p>
Full article ">Figure 6
<p>t-SNE visualization.</p>
Full article ">Figure 7
<p>Fine pupil recognition effect of different methods. (<b>a</b>) Iris recognition method of ripple transformation. (<b>b</b>) CNN’s iris recognition method. (<b>c</b>) Iris recognition method for residual images. (<b>d</b>) Iris recognition method based on local gradient pattern. (<b>e</b>) Iris recognition method for Levenshtein distance. (<b>f</b>) Iris recognition method used in this paper.</p>
Full article ">
17 pages, 4745 KiB  
Article
Implementing YOLO Convolutional Neural Network for Seed Size Detection
by Jakub Pawłowski, Marcin Kołodziej and Andrzej Majkowski
Appl. Sci. 2024, 14(14), 6294; https://doi.org/10.3390/app14146294 - 19 Jul 2024
Viewed by 1273
Abstract
The article presents research on the application of image processing techniques and convolutional neural networks (CNN) for the detection and measurement of seed sizes, specifically focusing on coffee and white bean seeds. The primary objective of the study is to evaluate the potential [...] Read more.
The article presents research on the application of image processing techniques and convolutional neural networks (CNN) for the detection and measurement of seed sizes, specifically focusing on coffee and white bean seeds. The primary objective of the study is to evaluate the potential of using CNNs to develop tools that automate seed recognition and measurement in images. A database was created, containing photographs of coffee and white bean seeds with precise annotations of their location and type. Image processing techniques and You Only Look Once v8 (YOLO) models were employed to analyze the seeds’ position, size, and type. A detailed comparison of the effectiveness and performance of the applied methods was conducted. The experiments demonstrated that the best-trained CNN model achieved a segmentation accuracy of 90.1% IoU, with an average seed size error of 0.58 mm. The conclusions indicate a significant potential for using image processing techniques and CNN models in automating seed analysis processes, which could lead to increased efficiency and accuracy in these processes. Full article
Show Figures

Figure 1

Figure 1
<p>A block diagram of the operations performed by the program for detecting the actual width, height, and area of seeds.</p>
Full article ">Figure 2
<p>Examples of images containing seeds of (<b>a</b>) white beans and (<b>b</b>) coffee.</p>
Full article ">Figure 3
<p>Segmentation results for white bean seeds.</p>
Full article ">Figure 4
<p>Segmentation results for coffee beans.</p>
Full article ">Figure 5
<p>Detection and segmentation of multiple coffee beans.</p>
Full article ">Figure 6
<p>Detection and segmentation of multiple white bean seeds.</p>
Full article ">Figure 7
<p>Dependence of IoU on the number of white bean seeds in the image.</p>
Full article ">Figure 8
<p>Dependence of IoU on the number of coffee seeds in the image.</p>
Full article ">Figure 9
<p>Dependence of relative error on the number of white bean seeds in the image.</p>
Full article ">Figure 10
<p>Dependence of relative error on the number of coffee seeds in the image.</p>
Full article ">Figure 11
<p>Dependence of IoU on the number of coffee and white bean seeds in the image.</p>
Full article ">Figure 12
<p>Dependence of relative error on the number of coffee and white bean seeds in the image.</p>
Full article ">Figure 13
<p>Comparison of YOLO methods and typical seed area detection methods.</p>
Full article ">
15 pages, 2625 KiB  
Article
Segmentation of Liver Tumors by Monai and PyTorch in CT Images with Deep Learning Techniques
by Sabir Muhammad and Jing Zhang
Appl. Sci. 2024, 14(12), 5144; https://doi.org/10.3390/app14125144 - 13 Jun 2024
Cited by 2 | Viewed by 2832
Abstract
Image segmentation and identification are crucial to modern medical image processing techniques. This research provides a novel and effective method for identifying and segmenting liver tumors from public CT images. Our approach leverages the hybrid ResUNet model, a combination of both the ResNet [...] Read more.
Image segmentation and identification are crucial to modern medical image processing techniques. This research provides a novel and effective method for identifying and segmenting liver tumors from public CT images. Our approach leverages the hybrid ResUNet model, a combination of both the ResNet and UNet models developed by the Monai and PyTorch frameworks. The ResNet deep dense network architecture is implemented on public CT scans using the MSD Task03 Liver dataset. The novelty of our method lies in several key aspects. First, we introduce innovative enhancements to the ResUNet architecture, optimizing its performance, especially for liver tumor segmentation tasks. Additionally, by harassing the capabilities of Monai, we streamline the implementation process, eliminating the need for manual script writing and enabling faster, more efficient model development and optimization. The process of preparing images for analysis by a deep neural network involves several steps: data augmentation, a Hounsfield windowing unit, and image normalization. ResUNet network performance is measured by using the DC metric Dice coefficient. This approach, which utilizes residual connections, has proven to be more reliable than other existing techniques. This approach achieved DC values of 0.98% for detecting liver tumors and 0.87% for segmentation. Both qualitative and quantitative evaluations show promising results regarding model precision and accuracy. The implications of this research are that it could be used to increase the precision and accuracy of liver tumor detection and liver segmentation, reflecting the potential of the proposed method. This could help in the early diagnosis and treatment of liver cancer, which can ultimately improve patient prognosis. Full article
Show Figures

Figure 1

Figure 1
<p>ResUNet architecture representation.</p>
Full article ">Figure 2
<p>Overview of the liver tumor segmentation workflow.</p>
Full article ">Figure 3
<p>Windowing steps to obtain an isolated liver.</p>
Full article ">Figure 4
<p>Slice with different transforms.</p>
Full article ">Figure 5
<p>Due to simple transformation, the same slice of the patient can show two different body parts.</p>
Full article ">Figure 6
<p>Illustration of the ResUNet architecture output.</p>
Full article ">Figure 7
<p>Dice loss and metrics are trained over 100 epochs during model training.</p>
Full article ">Figure 8
<p>Slices sample of the patient liver tumor segmentation mask.</p>
Full article ">
17 pages, 3884 KiB  
Article
A Method for Underwater Biological Detection Based on Improved YOLOXs
by Heng Wang, Pu Zhang, Mengnan You and Xinyuan You
Appl. Sci. 2024, 14(8), 3196; https://doi.org/10.3390/app14083196 - 10 Apr 2024
Cited by 3 | Viewed by 1310
Abstract
This article proposes a lightweight underwater biological target detection network based on the improvement of YOLOXs, addressing the challenges of complex and dynamic underwater environments, limited memory in underwater devices, and constrained computational capabilities. Firstly, in the backbone network, GhostConv and GhostBottleneck are [...] Read more.
This article proposes a lightweight underwater biological target detection network based on the improvement of YOLOXs, addressing the challenges of complex and dynamic underwater environments, limited memory in underwater devices, and constrained computational capabilities. Firstly, in the backbone network, GhostConv and GhostBottleneck are introduced to replace standard convolutions and the Bottleneck1 structure in CSPBottleneck_1, significantly reducing the model’s parameter count and computational load, facilitating the construction of a lightweight network. Next, in the feature fusion network, a Contextual Transformer block replaces the 3 × 3 convolution in CSPBottleneck_2. This enhances self-attention learning by leveraging the rich context between input keys, improving the model’s representational capacity. Finally, the positioning loss function Focal_EIoU Loss is employed to replace IoU Loss, enhancing the model’s robustness and generalization ability, leading to faster and more accurate convergence during training. Our experimental results demonstrate that compared to the YOLOXs model, the proposed YOLOXs-GCE achieves a 1.1% improvement in mAP value, while reducing parameters by 24.47%, the computational load by 26.39%, and the model size by 23.87%. This effectively enhances the detection performance of the model, making it suitable for complex and dynamic underwater environments, as well as underwater devices with limited memory. The model meets the requirements of underwater target detection tasks. Full article
Show Figures

Figure 1

Figure 1
<p>YOLOX network structure.</p>
Full article ">Figure 2
<p>Mosaic data augmentation splicing effect illustration.</p>
Full article ">Figure 3
<p>CSPBottleneck_1 structure diagram.</p>
Full article ">Figure 4
<p>SPP network structure diagram.</p>
Full article ">Figure 5
<p>Ghost Module structure diagram.</p>
Full article ">Figure 6
<p>Two GhostBottleneck structures.</p>
Full article ">Figure 7
<p>The schematic diagram of the PAFPN network.</p>
Full article ">Figure 8
<p>(<b>a</b>) Schematic diagram of CSPBottleneck_2 structure. (<b>b</b>) Schematic diagram of CoTBottleneck structure.</p>
Full article ">Figure 9
<p>Contextual Transformer (CoT) block.</p>
Full article ">Figure 10
<p>Dataset samples, namely, (<b>a</b>) fish, (<b>b</b>) jellyfish, (<b>c</b>) penguins, (<b>d</b>) sharks, (<b>e</b>) puffins, (<b>f</b>) stingrays, (<b>g</b>) starfish.</p>
Full article ">Figure 11
<p>Distribution of labels.</p>
Full article ">Figure 12
<p>PR curves for the four experimental groups’ test results. (<b>a</b>) PR curve for YOLOXs’s test results; (<b>b</b>) PR curve for Experiment 1’s test results; (<b>c</b>) PR curve for Experiment 2’s test results; (<b>d</b>) PR curve for Experiment 3’s test results.</p>
Full article ">Figure 13
<p>Loss function performance comparison plot.</p>
Full article ">Figure 14
<p>Comparison chart of detection effect. (<b>a</b>) YOLOXs detection results; (<b>b</b>) Detection results of YOLOXs after improvement.</p>
Full article ">
27 pages, 28358 KiB  
Article
Fast Coherent Video Style Transfer via Flow Errors Reduction
by Li Wang, Xiaosong Yang and Jianjun Zhang
Appl. Sci. 2024, 14(6), 2630; https://doi.org/10.3390/app14062630 - 21 Mar 2024
Viewed by 1417
Abstract
For video style transfer, naively applying still image techniques to process a video frame-by-frame independently often causes flickering artefacts. Some works adopt optical flow into the design of temporal constraint loss to secure temporal consistency. However, these works still suffer from incoherence (including [...] Read more.
For video style transfer, naively applying still image techniques to process a video frame-by-frame independently often causes flickering artefacts. Some works adopt optical flow into the design of temporal constraint loss to secure temporal consistency. However, these works still suffer from incoherence (including ghosting artefacts) where large motions or occlusions occur, as optical flow fails to detect the boundaries of objects accurately. To address this problem, we propose a novel framework which consists of the following two stages: (1) creating new initialization images from proposed mask techniques, which are able to significantly reduce the flow errors; (2) process these initialized images iteratively with proposed losses to obtain stylized videos which are free of artefacts, which also increases the speed from over 3 min per frame to less than 2 s per frame for the gradient-based optimization methods. To be specific, we propose a multi-scale mask fusion scheme to reduce untraceable flow errors, and obtain an incremental mask to reduce ghosting artefacts. In addition, a multi-frame mask fusion scheme is designed to reduce traceable flow errors. In our proposed losses, the Sharpness Losses are used to deal with the potential image blurriness artefacts over long-range frames, and the Coherent Losses are performed to restrict the temporal consistency at both the multi-frame RGB level and Feature level. Overall, our approach produces stable video stylization outputs even in large motion or occlusion scenarios. The experiments demonstrate that the proposed method outperforms the state-of-the-art video style transfer methods qualitatively and quantitatively on the MPI Sintel dataset. Full article
Show Figures

Figure 1

Figure 1
<p>Flickering artefacts in video style transfer. The first row shows two original consecutive video frames (<b>left</b>) and the style image (<b>right</b>). The second row shows the temporal incoherence artefact by Johnson et al. [<a href="#B10-applsci-14-02630" class="html-bibr">10</a>]. The green and purple rectangles indicate the different appearances (texture patterns and colours) between these two stylized outputs, which exhibit flickering artefacts. The third row shows the stable results produced by our method, where the outputs preserve consistent texture appearances.</p>
Full article ">Figure 2
<p>Prerequisite. We test the straightforward idea by recomposing pasted stylized content in untraceable flow regions (please see zoom-in rectangles), which shows that the composition result preserves the consistency of content structures better than the warped image <math display="inline"><semantics> <msup> <mi>w</mi> <mi>t</mi> </msup> </semantics></math>, as the warped image <math display="inline"><semantics> <msup> <mi>w</mi> <mi>t</mi> </msup> </semantics></math> has unexpected content structures in the red box and duplicated ones in the orange box. ⊗ denotes the warp operation, which warps <math display="inline"><semantics> <msubsup> <mi>f</mi> <mi>s</mi> <mrow> <mi>t</mi> <mo>−</mo> <mn>1</mn> </mrow> </msubsup> </semantics></math> into <math display="inline"><semantics> <msup> <mi>w</mi> <mi>t</mi> </msup> </semantics></math> with <math display="inline"><semantics> <msup> <mi>F</mi> <mi>t</mi> </msup> </semantics></math>, and ⊕ denotes element-wise addition in this paper. Please zoom in to see the details.</p>
Full article ">Figure 3
<p>Texture discontinuity problem. Naively combining <math display="inline"><semantics> <msup> <mrow> <mi>w</mi> </mrow> <mi>t</mi> </msup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi>f</mi> <mi>s</mi> <mi>t</mi> </msubsup> </semantics></math> via <math display="inline"><semantics> <msup> <mi>M</mi> <mi>t</mi> </msup> </semantics></math> into flow regions causes texture discontinuity problem. For example, in the green rectangle, the gray colours preserved from <math display="inline"><semantics> <msup> <mi>w</mi> <mi>t</mi> </msup> </semantics></math> lose the consistency of texture context (in red colours) which look like noise artefacts.</p>
Full article ">Figure 4
<p>Image blurriness artefacts. Images in the upper rows are original video frames. The blurriness artefacts become more obvious with time step.</p>
Full article ">Figure 5
<p>System overview. Starting from three consecutive frames, our system takes corresponding per-frame stylized <math display="inline"><semantics> <msubsup> <mi>f</mi> <mi>s</mi> <mi>t</mi> </msubsup> </semantics></math>, mask <math display="inline"><semantics> <msup> <mi>M</mi> <mi>t</mi> </msup> </semantics></math> and warped image <math display="inline"><semantics> <msup> <mrow> <mi>w</mi> </mrow> <mi>t</mi> </msup> </semantics></math> as inputs, then computes initialization <math display="inline"><semantics> <msubsup> <mover accent="true"> <mi mathvariant="italic">x</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>n</mi> <mi>i</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> for gradient-based optimization video stabilization network.</p>
Full article ">Figure 6
<p>Recurrent strategy for video style transfer.</p>
Full article ">Figure 7
<p>Network architecture overview. During optimization, the network takes <math display="inline"><semantics> <msubsup> <mover accent="true"> <mi mathvariant="italic">x</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>n</mi> <mi>i</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> obtained from initial generation, current per-frame stylized result <math display="inline"><semantics> <msubsup> <mi>f</mi> <mi>s</mi> <mi>t</mi> </msubsup> </semantics></math>, mask <math display="inline"><semantics> <msup> <mi>M</mi> <mi>t</mi> </msup> </semantics></math> and a warped image <math display="inline"><semantics> <msup> <mrow> <mi>w</mi> </mrow> <mi>t</mi> </msup> </semantics></math> as inputs, and gradually optimizes initial image <math display="inline"><semantics> <msubsup> <mover accent="true"> <mi mathvariant="italic">x</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>n</mi> <mi>i</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> into <math display="inline"><semantics> <msubsup> <mover accent="true"> <mi mathvariant="italic">x</mi> <mo>^</mo> </mover> <mrow> <mi>o</mi> <mi>u</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> based on gradients computed from losses.</p>
Full article ">Figure 8
<p>The process of multi-scale mask fusion and incremental mask. The unexpected flow untraceable errors (see red rectangles) are fixed in this step. The fused mask after the multi-scale scheme may cause worse ghosting artefacts as the flow untraceable regions become thinner than before; thus, the incremental mask is proposed to thicken the boundaries (see green rectangles).</p>
Full article ">Figure 9
<p>Initialization generation. <math display="inline"><semantics> <msup> <mi>M</mi> <mi>t</mi> </msup> </semantics></math> is a single channel per-pixel mask which is obtained from mask generation. Note that the generated <math display="inline"><semantics> <msubsup> <mover accent="true"> <mi mathvariant="italic">x</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>n</mi> <mi>i</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> contains much fewer errors than the warped image <math display="inline"><semantics> <msup> <mrow> <mi>w</mi> </mrow> <mi>t</mi> </msup> </semantics></math> in purple and red rectangles, which leads to much fewer iterations needed to compensate for correct pixel values.</p>
Full article ">Figure 10
<p>The decrease in traceable flow errors (white regions in the right side) by using the proposed initialization. The rectangles indicate the error difference between the initialization <math display="inline"><semantics> <msubsup> <mover accent="true"> <mi>x</mi> <mo>^</mo> </mover> <mrow> <mi>i</mi> <mi>n</mi> <mi>i</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> and <math display="inline"><semantics> <msubsup> <mi>x</mi> <mrow> <mi>i</mi> <mi>n</mi> <mi>i</mi> <mi>t</mi> </mrow> <mi>t</mi> </msubsup> </semantics></math> without our mask generation. The fusion in a maximum/minimum value manner indicates that we remain the maximum/minimum values from those masks at each pixel location.</p>
Full article ">Figure 11
<p>Qualitative ablation study on proposed mask techniques of Alley_2 scene from MPI Sintel dataset [<a href="#B66-applsci-14-02630" class="html-bibr">66</a>]. The naive method using the flow mask produced by [<a href="#B24-applsci-14-02630" class="html-bibr">24</a>] causes ghosting artefacts (see unexpected grids and curves in red and orange rectangles). The multi-scale scheme causes worse ghosting artefacts. By gradually adding the incremental mask and multi-frame mask fusion techniques, the unexpected grids and curves are effectively mitigated which produces better visual quality without ghosting artefacts.</p>
Full article ">Figure 12
<p>The effect of image sharpness. The top rows are original video frames, the middle rows are outputs without Sharpness Losses, and the bottom rows are outputs with Sharpness Losses. The red rectangles indicate the difference of image sharpness.</p>
Full article ">Figure 13
<p>The effect of temporal consistency. The per-frame processing methods are Johnson et al. [<a href="#B10-applsci-14-02630" class="html-bibr">10</a>] and Huang et al. [<a href="#B13-applsci-14-02630" class="html-bibr">13</a>]. The red and green rectangles indicate the discontinuous texture appearances.</p>
Full article ">Figure 14
<p>Comparison to latest state-of-the-art methods [<a href="#B36-applsci-14-02630" class="html-bibr">36</a>,<a href="#B37-applsci-14-02630" class="html-bibr">37</a>,<a href="#B38-applsci-14-02630" class="html-bibr">38</a>,<a href="#B62-applsci-14-02630" class="html-bibr">62</a>] on bike-packing scene from DAVIS 2017 dataset [<a href="#B66-applsci-14-02630" class="html-bibr">66</a>]. The red rectangles indicate the differences in long-term temporal consistency results.</p>
Full article ">Figure 15
<p>Comparison to Li et al. [<a href="#B35-applsci-14-02630" class="html-bibr">35</a>] on Soapbox scene from DAVIS 2017 dataset [<a href="#B66-applsci-14-02630" class="html-bibr">66</a>]. Both the red and green rectangles indicate the differences in two adjacent stabilization results. Please view our <a href="#app1-applsci-14-02630" class="html-app">Supplementary Video</a> for better observations.</p>
Full article ">Figure 16
<p>Comparison to Ruder et al. [<a href="#B30-applsci-14-02630" class="html-bibr">30</a>] on Ambush_4 scene from MPI Sintel dataset [<a href="#B66-applsci-14-02630" class="html-bibr">66</a>]. The per-frame processing method for both methods is Johnson et al. [<a href="#B10-applsci-14-02630" class="html-bibr">10</a>]. The rectangles indicate the difference in the consistent texture appearances in a few adjacent stabilization results. Orange rectangles show the texture appearances around large occlusion boundaries (including flow traceable errors) by Ruder et al. [<a href="#B30-applsci-14-02630" class="html-bibr">30</a>] are discontinuous with context, while purple boxes demonstrate that our textures are consistent with context. Please view our <a href="#app1-applsci-14-02630" class="html-app">Supplementary Video</a>.</p>
Full article ">Figure 17
<p>Ablation study on Sharpness Losses of Alley_2 scene from MPI Sintel dataset [<a href="#B66-applsci-14-02630" class="html-bibr">66</a>]. <span class="html-italic">The higher ARISM score is better</span>. The outputs with our Sharpness Losses achieve higher ARISM scores than those without pixel loss and Sharpness Losses, which indicates that Perceptual Losses and Pixel Loss in the proposed Sharpness Losses both contribute to reducing blurriness artefacts.</p>
Full article ">Figure 18
<p>Ablation study on image quality assessment of Alley_2 scene from MPI Sintel dataset [<a href="#B66-applsci-14-02630" class="html-bibr">66</a>]. <span class="html-italic">A lower score indicates better visual image quality</span>. Note that adding multi-scale scheme (magenta line) causes image quality loss (higher score) compared to naive method (blue line), while adding incremental mask and multi-frame fusion (red line and green line) contributes to achieve lower scores than naive method (blue line).</p>
Full article ">Figure 19
<p>User Study. It is noticed that Wang et al. [<a href="#B36-applsci-14-02630" class="html-bibr">36</a>] received the most votes in terms of temporal consistency, but our proposed method obtained the most majority of votes for overall preference and style transformation effect compared to six other state-of-the-art methods. The number of votes for each method is consistent with <a href="#applsci-14-02630-t003" class="html-table">Table 3</a> [<a href="#B30-applsci-14-02630" class="html-bibr">30</a>,<a href="#B35-applsci-14-02630" class="html-bibr">35</a>,<a href="#B36-applsci-14-02630" class="html-bibr">36</a>,<a href="#B37-applsci-14-02630" class="html-bibr">37</a>,<a href="#B38-applsci-14-02630" class="html-bibr">38</a>,<a href="#B62-applsci-14-02630" class="html-bibr">62</a>].</p>
Full article ">
23 pages, 7009 KiB  
Article
PGDS-YOLOv8s: An Improved YOLOv8s Model for Object Detection in Fisheye Images
by Degang Yang, Jie Zhou, Tingting Song, Xin Zhang and Yingze Song
Appl. Sci. 2024, 14(1), 44; https://doi.org/10.3390/app14010044 - 20 Dec 2023
Cited by 4 | Viewed by 3991
Abstract
Recently, object detection has become a research hotspot in computer vision, which often detects regular images with small viewing angles. In order to obtain a field of view without blind spots, fisheye cameras, which have distortions and discontinuities, have come into use. The [...] Read more.
Recently, object detection has become a research hotspot in computer vision, which often detects regular images with small viewing angles. In order to obtain a field of view without blind spots, fisheye cameras, which have distortions and discontinuities, have come into use. The fisheye camera, which has a wide viewing angle, and an unmanned aerial vehicle equipped with a fisheye camera are used to obtain a field of view without blind spots. However, distorted and discontinuous objects appear in the captured fisheye images due to the unique viewing angle of fisheye cameras. It poses a significant challenge to some existing object detectors. To solve this problem, this paper proposes a PGDS-YOLOv8s model to solve the issue of detecting distorted and discontinuous objects in fisheye images. First, two novel downsampling modules are proposed. Among them, the Max Pooling and Ghost’s Downsampling (MPGD) module effectively extracts the essential feature information of distorted and discontinuous objects. The Average Pooling and Ghost’s Downsampling (APGD) module acquires rich global features and reduces the feature loss of distorted and discontinuous objects. In addition, the proposed C2fs module uses Squeeze-and-Excitation (SE) blocks to model the interdependence of the channels to acquire richer gradient flow information about the features. The C2fs module provides a better understanding of the contextual information in fisheye images. Subsequently, an SE block is added after the Spatial Pyramid Pooling Fast (SPPF), thus improving the model’s ability to capture features of distorted, discontinuous objects. Moreover, the UAV-360 dataset is created for object detection in fisheye images. Finally, experiments show that the proposed PGDS-YOLOv8s model on the VOC-360 dataset improves [email protected] by 19.8% and [email protected]:0.95 by 27.5% compared to the original YOLOv8s model. In addition, the improved model on the UAV-360 dataset achieves 89.0% for [email protected] and 60.5% for [email protected]:0.95. Furthermore, on the MS-COCO 2017 dataset, the PGDS-YOLOv8s model improved AP by 1.4%, AP50 by 1.7%, and AP75 by 1.2% compared with the original YOLOv8s model. Full article
Show Figures

Figure 1

Figure 1
<p>Detection process using the improved model.</p>
Full article ">Figure 2
<p>Fisheye images. (<b>a</b>) Original binocular fisheye image; (<b>b</b>) equirectangular projection image converted from binocular fisheye image.</p>
Full article ">Figure 3
<p>The labelimg tool labels objects in an equirectangular projected image, the green box is the bounding box for the labeling.</p>
Full article ">Figure 4
<p>Some examples of the VOC-360 dataset. (<b>a</b>,<b>b</b>) are synthesized fisheye images.</p>
Full article ">Figure 5
<p>The network structure of PGDS-YOLOv8s.</p>
Full article ">Figure 6
<p>Illustration of the convolutional layer and Ghost module generating the same number of feature maps separately. (<b>a</b>) Convolutional layer; (<b>b</b>) Ghost module.</p>
Full article ">Figure 7
<p>The structure of the MPGD module.</p>
Full article ">Figure 8
<p>The structure of the APGD module.</p>
Full article ">Figure 9
<p>A Squeeze-and-Excitation block.</p>
Full article ">Figure 10
<p>The structure of the C2fs module and the SE-Bottleneck module.</p>
Full article ">Figure 11
<p>Comparison of test results of four models on the VOC-360 test set. (<b>a</b>) YOLOv8s model; (<b>b</b>) YOLOv8s + MPGD + APGD model; (<b>c</b>) YOLOv8s + C2fs + SPPFSE model; (<b>d</b>) PGDS-YOLOv8s model.</p>
Full article ">Figure 12
<p>Comparison of the test results of the two models on the UAV-360 test set. (<b>a</b>) YOLOv8s model; (<b>b</b>) PGDS-YOLOv8s model.</p>
Full article ">
13 pages, 2895 KiB  
Article
Improved Lightweight Multi-Target Recognition Model for Live Streaming Scenes
by Zongwei Li, Kai Qiao, Jianing Chen, Zhenyu Li and Yanhui Zhang
Appl. Sci. 2023, 13(18), 10170; https://doi.org/10.3390/app131810170 - 10 Sep 2023
Viewed by 1405
Abstract
Nowadays, the commercial potential of live e-commerce is being continuously explored, and machine vision algorithms are gradually attracting the attention of marketers and researchers. During live streaming, the visuals can be effectively captured by algorithms, thereby providing additional data support. This paper aims [...] Read more.
Nowadays, the commercial potential of live e-commerce is being continuously explored, and machine vision algorithms are gradually attracting the attention of marketers and researchers. During live streaming, the visuals can be effectively captured by algorithms, thereby providing additional data support. This paper aims to consider the diversity of live streaming devices and proposes an extremely lightweight and high-precision model to meet different requirements in live streaming scenarios. Building upon yolov5s, we incorporate the MobileNetV3 module and the CA attention mechanism to optimize the model. Furthermore, we construct a multi-object dataset specific to live streaming scenarios, including anchor facial expressions and commodities. A series of experiments have demonstrated that our model realized a 0.4% improvement in accuracy compared to the original model, while reducing its weight to 10.52%. Full article
Show Figures

Figure 1

Figure 1
<p>Improved yolov5s-MobileNetV3-CA network architecture (* MobileNet_Block: [out_ch, hidden_ch, kernel_size, stride, use_se, use_hs]).</p>
Full article ">Figure 2
<p>MobileNetV3 network structure.</p>
Full article ">Figure 3
<p>The mAP0.5 histories of the yolov5s and the Yolov5s-MobileNetV3-CA model.</p>
Full article ">Figure 4
<p>Comparison of accuracy density and mAP0.5 parameter values for the two models.</p>
Full article ">Figure 5
<p>The losses of the Yolov5s-MobileNetV3-CA model.</p>
Full article ">Figure 6
<p>Heat map visualization results of the learning effects of different attention mechanisms.</p>
Full article ">
Back to TopTop