[go: up one dir, main page]

 
 
applsci-logo

Journal Browser

Journal Browser

Research on Machine Learning in Computer Vision

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 25 July 2025 | Viewed by 8530

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mathematical, Physical and Computer Sciences, University of Parma, 43124 Parma, Italy
Interests: computer science; feature extraction; deep learning; meta-learning; computer vision
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor

Special Issue Information

Dear Colleagues,

This Special Issue is dedicated to the exploration of the latest advancements in Machine Learning (ML) as they apply to computer vision. It is well known that the rapid progress and use of ML techniques have significantly enhanced the capabilities of computer vision systems, enabling them to interpret visual data with unprecedented effectiveness.

The aim of this Special Issue is to delve into and discuss how the most recent ML approaches, including but not limited to the field of deep learning, are being successfully applied to various computer vision tasks. These tasks include object detection, image retrieval, segmentation, recognition, and more.

We find particular interest in ML techniques such as meta-learning, reinforcement learning, and unsupervised and semi-supervised learning. We especially welcome contributions that address the challenges encountered in deploying these techniques, such as the demand for large datasets and high computational power, and that discuss and propose potential solutions, with a specific focus on one-shot or few-shot approaches. Moreover, contributions that highlight the impact of these advancements on various application domains, like healthcare, autonomous vehicles, and surveillance, are also welcomed.

Dr. Eleonora Iotti
Prof. Dr. João M. F. Rodrigues
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • machine learning
  • computer vision
  • one- and few-shot learning
  • meta-learning
  • reinforcement learning
  • unsupervised and semi-supervised learning
  • ML-based computer vision applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 6883 KiB  
Article
Integrated AI System for Real-Time Sports Broadcasting: Player Behavior, Game Event Recognition, and Generative AI Commentary in Basketball Games
by Sunghoon Jung, Hanmoe Kim, Hyunseo Park and Ahyoung Choi
Appl. Sci. 2025, 15(3), 1543; https://doi.org/10.3390/app15031543 - 3 Feb 2025
Viewed by 1185
Abstract
This study presents an AI-based sports broadcasting system capable of real-time game analysis and automated commentary. The model first acquires essential background knowledge, including the court layout, game rules, team information, and player details. YOLO model-based segmentation is applied for a local camera [...] Read more.
This study presents an AI-based sports broadcasting system capable of real-time game analysis and automated commentary. The model first acquires essential background knowledge, including the court layout, game rules, team information, and player details. YOLO model-based segmentation is applied for a local camera view to enhance court recognition accuracy. Player’s actions and ball tracking is performed through YOLO algorithms. In each frame, the YOLO detection model is used to detect the bounding boxes of the players. Then, we proposed our tracking algorithm, which computed the IoU from previous frames and linked together to track the movement paths of the players. Player behavior is achieved via the R(2+1)D action recognition model including player actions such as running, dribbling, shooting, and blocking. The system demonstrates high performance, achieving an average accuracy of 97% in court calibration, 92.5% in player and object detection, and 85.04% in action recognition. Key game events are identified based on positional and action data, with broadcast lines generated using GPT APIs and converted to natural audio commentary via Text-to-Speech (TTS). This system offers a comprehensive framework for automating sports broadcasting with advanced AI techniques. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Overall procedure of the system.</p>
Full article ">Figure 2
<p>Court segmentation in local view.</p>
Full article ">Figure 3
<p>Player tracking and action recognition.</p>
Full article ">Figure 4
<p>Prompt example for game commentary generation.</p>
Full article ">Figure 5
<p>Overall system architecture.</p>
Full article ">Figure 6
<p>Application screenshot.</p>
Full article ">Figure 7
<p>Calibrated result of court.</p>
Full article ">Figure 8
<p>Confusion matrix for court segmentation.</p>
Full article ">Figure 9
<p>Confusion matrix for R(2+1)D action recognition.</p>
Full article ">Figure 10
<p>Confusion matrix for YOLO results.</p>
Full article ">Figure 11
<p>AI commentary generation result.</p>
Full article ">
23 pages, 20134 KiB  
Article
The Development and Validation of an Artificial Intelligence Model for Estimating Thumb Range of Motion Using Angle Sensors and Machine Learning: Targeting Radial Abduction, Palmar Abduction, and Pronation Angles
by Yutaka Ehara, Atsuyuki Inui, Yutaka Mifune, Kohei Yamaura, Tatsuo Kato, Takahiro Furukawa, Shuya Tanaka, Masaya Kusunose, Shunsaku Takigami, Shin Osawa, Daiji Nakabayashi, Shinya Hayashi, Tomoyuki Matsumoto, Takehiko Matsushita and Ryosuke Kuroda
Appl. Sci. 2025, 15(3), 1296; https://doi.org/10.3390/app15031296 - 27 Jan 2025
Viewed by 613
Abstract
An accurate assessment of thumb range of motion is crucial for diagnosing musculoskeletal conditions, evaluating functional impairments, and planning effective rehabilitation strategies. In this study, we aimed to enhance the accuracy of estimating thumb range of motion using a combination of MediaPipe, which [...] Read more.
An accurate assessment of thumb range of motion is crucial for diagnosing musculoskeletal conditions, evaluating functional impairments, and planning effective rehabilitation strategies. In this study, we aimed to enhance the accuracy of estimating thumb range of motion using a combination of MediaPipe, which is an AI-based posture estimation library, and machine learning methods, taking the values obtained using angle sensors to be the true values. Radial abduction, palmar abduction, and pronation angles were estimated using MediaPipe based on coordinates detected from videos of 18 healthy participants (nine males and nine females with an age range of 30–49 years) selected to reflect a balanced distribution of height and other physical characteristics. A conical thumb movement model was constructed, and parameters were generated based on the coordinate data. Five machine learning models were evaluated, with LightGBM achieving the highest accuracy across all metrics. Specifically, for radial abduction, palmar abduction, and supination, the root mean square error (RMSE), mean absolute error (MAE), coefficient of determination (R2), and correlation coefficient were 4.67°, 3.41°, 0.94, and 0.97; 4.63°, 3.41°, 0.95, and 0.98; and 5.69°, 4.17°, 0.88, and 0.94, respectively. These results demonstrate that when estimating thumb range of motion, the AI model trained using angle sensor data and LightGBM achieved accuracy that was high and comparable to that of prior methods involving the use of MediaPipe and a protractor. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>A comprehensive set of hand anthropometric measurements, including hand, finger, and phalangeal lengths, were recorded based on the fully extended and abducted right hand of each participant.</p>
Full article ">Figure 2
<p>Validation of the angle sensor used for reproducing thumb movements.</p>
Full article ">Figure 3
<p>The angle sensor was fixed to the dorsal midpoint of the metacarpal bone of the right thumb along its bone axis using tape. The sensor measured radial abduction, palmar abduction, and pronation angles.</p>
Full article ">Figure 4
<p>The marker trajectories of the carpometacarpal (CMC), metacarpophalangeal (MP), and interphalangeal (IP) joints were assumed to align along the same straight line and concentric sphere. The apex of the cone was assumed to be the thumb CMC joint, the second metacarpal was the base axis, the index MP joint was the center, and the thumb IP joint formed the base of the cone, with the thumb performing a circular motion. The rotation angle, θ, was defined as 0° on the plane of the palm. The right thumb was then moved in a range of θ = 0° to 90° in its maximum abduction position.</p>
Full article ">Figure 5
<p>The tablet was positioned 50 cm above the table, 100 cm from the subject, and at an angle of 45° with respect to the palm for video recording.</p>
Full article ">Figure 6
<p>MediaPipe landmarks and examples of hand coordinates detected using MediaPipe: detection of the bounding box for relatively rigid parts of the hand and simulation of the hand skeleton. (0) WRIST. (1) THUMB_CMC. (2) THUMB_MCP. (3) THUMB_IP. (4) THUMB_TIP. (5) INDEX_FINGER_MCP. (6) INDEX_FINGER_PIP. (7) INDEX_FINGER_DIP. (8) INDEX_FINGER_TIP. (9) MIDDLE_FINGER_MCP. (10) MIDDLE_FINGER_PIP. (11) MIDDLE_FINGER_DIP. (12) MIDDLE_FINGER_TIP. (13) RING_FINGER_MCP. (14) RING_FINGER_PIP. (15) RING_FIGER_DIP. (16) RING_FIGER_TIP. (17) PINKY_MCP. (18) PINKY_PIP. (19) PINKY_DIP. (20) PINKY_TIP.</p>
Full article ">Figure 7
<p>Workflow for data acquisition and the machine learning processes.</p>
Full article ">Figure 8
<p>(<b>a</b>) The actual angles obtained from the test data for radial abduction compared with their predicted angles obtained from the training data using the linear regression model. (<b>b</b>) The residuals (actual angles − predicted angles) of the linear regression model for radial abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 9
<p>(<b>a</b>) The actual angles from the test data for radial abduction compared with their predicted angles obtained from the training data using the ElasticNet model. (<b>b</b>) The residuals (actual angles − predicted angles) of the ElasticNet model for radial abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 10
<p>(<b>a</b>) The actual angles from the test data for radial abduction compared with their predicted angles obtained from the training data using the SVM model. (<b>b</b>) The residuals (actual angles − predicted angles) of the SVM model for radial abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 11
<p>(<b>a</b>) The actual angles from the test data for radial abduction compared with their predicted angles obtained from the training data using the random forest regression model. (<b>b</b>) The residuals (actual angles − predicted angles) of the random forest regression model for radial abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 12
<p>(<b>a</b>) The actual angles from the test data for radial abduction compared with their predicted angles obtained from the training data using the LightGBM model. (<b>b</b>) The residuals (actual angles − predicted angles) of the LightGBM model for radial abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 13
<p>(<b>a</b>) The actual angles from the test data for palmar abduction compared with their predicted angles obtained from the training data using the linear regression model. (<b>b</b>) The residuals (actual angles − predicted angles) of the linear regression model for palmar abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 14
<p>(<b>a</b>) The actual angles from the test data for palmar abduction compared with their predicted angles obtained from the training data using the ElasticNet model. (<b>b</b>) The residuals (actual angles − predicted angles) of the ElasticNet model for palmar abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 15
<p>(<b>a</b>) The actual angles from the test data for palmar abduction compared with their predicted angles obtained from the training data using the SVM model. (<b>b</b>) The residuals (actual angles − predicted angles) of the SVM model for palmar abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 16
<p>(<b>a</b>) The actual angles from the test data for palmar abduction compared with their predicted angles obtained from the training data using the random forest regression model. (<b>b</b>) The residuals (actual angles − predicted angles) of the random forest regression model for palmar abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 17
<p>(<b>a</b>) The actual angles from the test data for palmar abduction compared with their predicted angles obtained from the training data using the LightGBM model. (<b>b</b>) The residuals (actual angles − predicted angles) of the LightGBM model for palmar abduction plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 18
<p>(<b>a</b>) The actual angles from the test data for pronation compared with their predicted angles obtained from the training data using the linear regression model. (<b>b</b>) The residuals (actual angles − predicted angles) of the linear regression model for pronation plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 19
<p>(<b>a</b>) The actual angles from the test data for pronation compared with their predicted angles obtained from the training data using the ElasticNet model. (<b>b</b>) The residuals (actual angles − predicted angles) of the ElasticNet model for pronation plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 20
<p>(<b>a</b>) The actual angles from the test data for pronation compared with their predicted angles obtained from the training data using the SVM model. (<b>b</b>) The residuals (actual angles − predicted angles) of the SVM model for pronation plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 21
<p>(<b>a</b>) The actual angles from the test data for pronation compared with their predicted angles obtained from the training data using the random forest regression model. (<b>b</b>) The residuals (actual angles − predicted angles) of the random forest regression model for pronation plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 22
<p>(<b>a</b>) The actual angles from the test data for pronation compared with their predicted angles obtained from the training data using the LightGBM model. (<b>b</b>) The residuals (actual angles − predicted angles) of the LightGBM model for pronation plotted and compared against the actual angles in the test data.</p>
Full article ">Figure 23
<p>Feature importance and SHAP values for LightGBM for radial abduction.</p>
Full article ">Figure 24
<p>Feature importance and SHAP values for LightGBM for palmar abduction.</p>
Full article ">Figure 25
<p>Feature importance and SHAP values for LightGBM for pronation.</p>
Full article ">
16 pages, 2038 KiB  
Article
Enhancing Colony Detection of Microorganisms in Agar Dishes Using SAM-Based Synthetic Data Augmentation in Low-Data Scenarios
by Kim Mennemann, Nikolas Ebert, Laurenz Reichardt and Oliver Wasenmüller
Appl. Sci. 2025, 15(3), 1260; https://doi.org/10.3390/app15031260 - 26 Jan 2025
Viewed by 542
Abstract
In many medical and pharmaceutical processes, continuous hygiene monitoring relies on manual detection of microorganisms in agar dishes by skilled personnel. While deep learning offers the potential for automating this task, it often faces limitations due to insufficient training data, a common issue [...] Read more.
In many medical and pharmaceutical processes, continuous hygiene monitoring relies on manual detection of microorganisms in agar dishes by skilled personnel. While deep learning offers the potential for automating this task, it often faces limitations due to insufficient training data, a common issue in colony detection. To address this, we propose a simple yet efficient SAM-based pipeline for Copy-Paste data augmentation to enhance detection performance, even with limited data. This paper explores a method where annotated microbial colonies from real images were copied and pasted into empty agar dish images to create new synthetic samples. These new samples inherited the annotations of the colonies inserted into them so that no further labeling was required. The resulting synthetic datasets were used to train a YOLOv8 detection model, which was then fine-tuned on just 10 to 1000 real images. The best fine-tuned model, trained on only 1000 real images, achieved an mAP of 60.6, while a base model trained on 5241 real images achieved 64.9. Although far fewer real images were used, the fine-tuned model performed comparably well, demonstrating the effectiveness of the SAM-based Copy-Paste augmentation. This approach matches or even exceeds the performance of the current state of the art in synthetic data generation in colony detection and can be expanded to include more microbial species and agar dishes. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Overview of the proposed pipeline. First, colonies are segmented using a pre-trained and frozen Segment Anything Model [<a href="#B15-applsci-15-01260" class="html-bibr">15</a>]. Next, poor segmentations are filtered out to avoid introducing artifacts into the synthetic images. The segmented colonies are then inserted onto new, empty agar plates. Finally, YOLOv8 [<a href="#B48-applsci-15-01260" class="html-bibr">48</a>] is pre-trained on the synthetic data and fine-tuned on real data to achieve optimal accuracy.</p>
Full article ">Figure 2
<p>Examples of good and bad segmentations of colonies from the AGAR dataset [<a href="#B4-applsci-15-01260" class="html-bibr">4</a>] with SAM [<a href="#B15-applsci-15-01260" class="html-bibr">15</a>].</p>
Full article ">Figure 3
<p>Examples of generated data (f.l.t.r.): real image from the AGAR dataset, generated image where the colonies match the background, and generated image where the colonies do not match the background.</p>
Full article ">Figure 4
<p>Comparison of the mAP of YOLOv8-Nano [<a href="#B48-applsci-15-01260" class="html-bibr">48</a>] after pre-training on synthetic images and fine-tuning on real images across all classes in the AGAR dataset [<a href="#B4-applsci-15-01260" class="html-bibr">4</a>]. The synthetic images utilize various opacity values for the inpainted colonies.</p>
Full article ">Figure 5
<p>Comparison of different sizes of fine-tuning datasets of YOLOv8-Nano [<a href="#B48-applsci-15-01260" class="html-bibr">48</a>] on the AGAR dateset [<a href="#B4-applsci-15-01260" class="html-bibr">4</a>]. (<b>a</b>) Mean Average Precision (mAP). (<b>b</b>) Average Precision at an IoU-threshold at 0.5 (AP<sup>50</sup>).</p>
Full article ">
26 pages, 1303 KiB  
Article
On Explainability of Reinforcement Learning-Based Machine Learning Agents Trained with Proximal Policy Optimization That Utilizes Visual Sensor Data
by Tomasz Hachaj and Marcin Piekarczyk
Appl. Sci. 2025, 15(2), 538; https://doi.org/10.3390/app15020538 - 8 Jan 2025
Viewed by 750
Abstract
In this paper, we address the issues of the explainability of reinforcement learning-based machine learning agents trained with Proximal Policy Optimization (PPO) that utilizes visual sensor data. We propose an algorithm that allows an effective and intuitive approximation of the PPO-trained neural network [...] Read more.
In this paper, we address the issues of the explainability of reinforcement learning-based machine learning agents trained with Proximal Policy Optimization (PPO) that utilizes visual sensor data. We propose an algorithm that allows an effective and intuitive approximation of the PPO-trained neural network (NN). We conduct several experiments to confirm our method’s effectiveness. Our proposed method works well for scenarios where semantic clustering of the scene is possible. Our approach is based on the solid theoretical foundation of Gradient-weighted Class Activation Mapping (GradCAM) and Classification and Regression Tree with additional proxy geometry heuristics. It excels in the explanation process in a virtual simulation system based on a video system with relatively low resolution. Depending on the convolutional feature extractor of the PPO-trained neural network, our method obtains 0.945 to 0.968 accuracy of approximation of the black-box model. The proposed method has important application aspects. Through its use, it is possible to estimate the causes of specific decisions made by the neural network due to the current state of the observed environment. This estimation makes it possible to determine whether the network makes decisions as expected (decision-making is related to the model’s observation of objects belonging to different semantic classes in the environment) and to detect unexpected, seemingly chaotic behavior that might be, for example, the result of data bias, bad design of the reward function or insufficient generalization abilities of the model. We publish all source codes so our experiments can be reproduced. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Illustration of agent–environment interaction in reinforcement learning.</p>
Full article ">Figure 2
<p>This figure presents two example agent brains trained with PPO consisting of convolutional image feature extractor and classification layers constructed by two 16-neuron layers. The names of the individual network elements are derived from the notation according to the ONNX data exchange standard (Open Neural Network Exchange, <a href="https://onnx.ai/onnx/" target="_blank">https://onnx.ai/onnx/</a> [accessed on 9 October 2024]). Gemm is matrix multiplication, Mul is element-wise multiplication, and Slice produces a slice of the input tensor along axes. Visualization was performed using Onnx-modifier software <a href="https://github.com/ZhangGe6/onnx-modifier" target="_blank">https://github.com/ZhangGe6/onnx-modifier</a> [accessed on 9 October 2024]. (<b>a</b>) Agent’s NN with “Simple” convolutional feature extractor. (<b>b</b>) Agent’s NN with “Nature” convolutional feature extractor [<a href="#B35-applsci-15-00538" class="html-bibr">35</a>].</p>
Full article ">Figure 3
<p>Proxy geometry we used in proposed method.</p>
Full article ">Figure 4
<p>In this Figure, in each row, we present the triplet of the input image corresponding to semantic segmentation and visualization of the Agent’s position (white capsule shape) towards the blue platform. (<b>a</b>) Input image. (<b>b</b>) Semantic segmentation of the input image. (<b>c</b>) Bird-eye view of the scene. (<b>d</b>) Input image. (<b>e</b>) Semantic segmentation of the input image. (<b>f</b>) Bird-eye view of the scene.</p>
Full article ">Figure 5
<p>A block diagram summarizing the proposed methodology. At first, during the simulation, we gather observations and actions of the agent. Then, with the help of GradCAM and proxy geometry (see Algorithm 1), features are generated (see Algorithm 2). Those features and the agent’s actions are used to create an approximation of the neural network brain of an agent in the form of a decision tree (CART).</p>
Full article ">Figure 6
<p>The plot of Cumulative Reward value during PPO training steps (episodes).</p>
Full article ">Figure 7
<p>Example GradCAM results for various Agent NNs differing with a resolution of the image sensor. All agents use a “Simple” convolutional features embedder. (<b>a</b>) The semantic clustering of the input image (<b>h</b>). The first row presents the GradCAM color-coded map generated as the response of NN for input image (<b>h</b>). The second row shows the same GradCAM map but is imposed onto the input signal. The darker the region, the smaller the value on the map. The bright areas correspond to a value of 1 on the GradCAM map.</p>
Full article ">Figure 8
<p>Example GradCAM results for various Agent NNs differing with a resolution of the image sensor. All agents use a “Simple” convolutional features embedder. (<b>a</b>) Semantic clustering of the input image (<b>h</b>). The first row presents the GradCAM color-coded map generated as the response of NN for input image (<b>h</b>). The second row shows the same GradCAM map but is imposed onto the input signal. The darker the region, the smaller the value on the map. The bright areas correspond to a value of 1 on the GradCAM map.</p>
Full article ">Figure 9
<p>CART explanation generated by the proposed GradCAM-based method for Agent’s NN with <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>64</mn> <mo>×</mo> <mn>64</mn> <mo>)</mo> </mrow> </semantics></math> input image signal, “Simple” convolutional feature embedder and <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>α</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mn>0.3</mn> <mo>)</mo> </mrow> </semantics></math> parameters of Algorithm 2. This particular tree explains the rules of forward–backward motion. The size of the tree is limited to maximal depth of 4. Instances of classes among certain features are presented in color-coded bars. The black arrow positioned under the horizontal axis indicates the splitting threshold.</p>
Full article ">Figure 10
<p>CART explanation generated by the proposed GradCAM-based method for the Agent’s NN with <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>64</mn> <mo>×</mo> <mn>64</mn> <mo>)</mo> </mrow> </semantics></math> input image signal, “Simple” convolutional feature embedder and <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>α</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mn>0.3</mn> <mo>)</mo> </mrow> </semantics></math> parameters of Algorithm 2. This particular tree explains the rules of left–right motion. The size of the tree is limited to maximal depth of 4. Instances of classes among certain features are presented in color-coded bars. The black arrow positioned under the horizontal axis indicates the splitting threshold.</p>
Full article ">Figure 11
<p>CART explanation generated by the proposed GradCAM-based method for Agent’s NN with <math display="inline"><semantics> <mrow> <mn>64</mn> <mo>×</mo> <mn>64</mn> </mrow> </semantics></math> input image signal, “Simple” convolutional feature embedder and <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>α</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mn>0.3</mn> <mo>)</mo> </mrow> </semantics></math> parameters of Algorithm 2. This particular tree explains the rules of jump motion. Instances of classes among certain features are presented in color-coded bars. The black arrow positioned under the horizontal axis indicates the splitting threshold.</p>
Full article ">Figure 12
<p>This Figure presents the decision process (predictions) of the CART tree generated by the proposed GradCAM-based method for Agent’s NN with <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>64</mn> <mo>×</mo> <mn>64</mn> <mo>)</mo> </mrow> </semantics></math> input image signal, “Simple” convolutional features embedder, <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>α</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mn>0.3</mn> <mo>)</mo> </mrow> </semantics></math> on images from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>. The orange arrow under the horizontal axis indicates the actual feature value. This Figure shows an explanation of the decision-making process for a single agent camera reading. It contains a single tree path from the tree shown in <a href="#applsci-15-00538-f009" class="html-fig">Figure 9</a>. (<b>a</b>) Prediction for input image from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>a. (<b>b</b>) Prediction for input image from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>d.</p>
Full article ">Figure 13
<p>This Figure presents the decision process (predictions) of the CART tree generated by the proposed GradCAM-based method for the Agent’s NN with <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>64</mn> <mo>×</mo> <mn>64</mn> <mo>)</mo> </mrow> </semantics></math> input image signal, “Simple” convolutional features embedder, <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>α</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mn>0.3</mn> <mo>)</mo> </mrow> </semantics></math> on images from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>. The orange arrow under the horizontal axis indicates the actual feature value. This Figure shows an explanation of the decision-making process for a single Agent camera reading. It contains a single tree path from the tree shown in <a href="#applsci-15-00538-f010" class="html-fig">Figure 10</a>. (<b>a</b>) Prediction for input image from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>a. (<b>b</b>) Prediction for input image from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>d.</p>
Full article ">Figure 14
<p>This Figure presents the decision process (predictions) of the CART tree generated by the proposed GradCAM-based method for the Agent’s NN with <math display="inline"><semantics> <mrow> <mo>(</mo> <mn>64</mn> <mo>×</mo> <mn>64</mn> <mo>)</mo> </mrow> </semantics></math> input image signal, “Simple” convolutional features embedder, <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>α</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mi>t</mi> <mo>=</mo> <mn>0.3</mn> <mo>)</mo> </mrow> </semantics></math> on images from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>. The orange arrow under the horizontal axis indicates the actual feature value. This Figure shows an explanation of the decision-making process for a single Agent camera reading. It contains a single tree path from the tree shown in <a href="#applsci-15-00538-f011" class="html-fig">Figure 11</a>. (<b>a</b>) Prediction for input image from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>a. (<b>b</b>) Prediction for input image from <a href="#applsci-15-00538-f004" class="html-fig">Figure 4</a>d.</p>
Full article ">
13 pages, 1853 KiB  
Article
Optimizing Deep Learning Acceleration on FPGA for Real-Time and Resource-Efficient Image Classification
by Ahmad Mouri Zadeh Khaki and Ahyoung Choi
Appl. Sci. 2025, 15(1), 422; https://doi.org/10.3390/app15010422 - 5 Jan 2025
Cited by 2 | Viewed by 1349
Abstract
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two [...] Read more.
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two widely used CNN architectures, for classifying the CIFAR-10 dataset using transfer learning on field-programmable gate arrays (FPGAs). Utilizing the Xilinx Vitis-AI and TensorFlow2 frameworks, we adapt VGG16 and VGG19 for FPGA deployment through quantization, compression, and hardware-specific optimizations. Our implementation achieves high classification accuracy, with Top-1 accuracy of 89.54% and 87.47% for VGG16 and VGG19, respectively, while delivering significant reductions in inference latency (7.29× and 6.6× compared to CPU-based alternatives). These results highlight the suitability of our approach for resource-efficient, real-time edge applications. Key contributions include a detailed methodology for combining transfer learning with FPGA acceleration, an analysis of hardware resource utilization, and performance benchmarks. This work underscores the potential of FPGA-based solutions to enable scalable, low-latency DL deployments in domains such as autonomous systems, IoT, and mobile devices. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Workflow of this study for implementing VGG16 and VGG19 on FPGA using the Xilinx Vitis-AI framework.</p>
Full article ">Figure 2
<p>VGG16 and VGG19 model architecture using transfer learning in this work.</p>
Full article ">Figure 3
<p>Conceptual pipeline of the Xilinx Vitis-AI.</p>
Full article ">Figure 4
<p>Confusion matrix of CIFAR-10 test set images classification by our FPGA-based models: (<b>a</b>) VGG16; (<b>b</b>) VGG19.</p>
Full article ">
15 pages, 1426 KiB  
Article
Attention Score Enhancement Model Through Pairwise Image Comparison
by Yeong Seok Ju, Zong Woo Geem and Joon Shik Lim
Appl. Sci. 2024, 14(21), 9928; https://doi.org/10.3390/app14219928 - 30 Oct 2024
Viewed by 934
Abstract
This study proposes the Pairwise Attention Enhancement (PAE) model to address the limitations of the Vision Transformer (ViT). While the ViT effectively models global relationships between image patches, it encounters challenges in medical image analysis where fine-grained local features are crucial. Although the [...] Read more.
This study proposes the Pairwise Attention Enhancement (PAE) model to address the limitations of the Vision Transformer (ViT). While the ViT effectively models global relationships between image patches, it encounters challenges in medical image analysis where fine-grained local features are crucial. Although the ViT excels at capturing global interactions within the entire image, it may potentially underperform due to its inadequate representation of local features such as color, texture, and edges. The proposed PAE model enhances local features by calculating cosine similarity between the attention maps of training and reference images and integrating attention maps in regions with high similarity. This approach complements the ViT’s global capture capability, allowing for a more accurate reflection of subtle visual differences. Experiments using Clock Drawing Test data demonstrated that the PAE model achieved a precision of 0.9383, recall of 0.8916, F1-Score of 0.9133, and accuracy of 92.69%, showing a 12% improvement over API-Net and a 1% improvement over the ViT. This study suggests that the PAE model can enhance performance in computer vision fields where local features are crucial by overcoming the limitations of the ViT. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Examples of various image augmentations in CDT data: original, brightness adjustment, contrast adjustment, and resizing.</p>
Full article ">Figure 2
<p>Clock Drawing Test severity classification [<a href="#B1-applsci-14-09928" class="html-bibr">1</a>].</p>
Full article ">Figure 3
<p>Pairwise Attention Enhancement (PAE) structure. ASMS: Attention Score Map Median Cosine Similarity. (<b>a</b>) Reference Image Encoding, (<b>b</b>) Attention Score Enhancement, (<b>c</b>) Learning Image Encoding.</p>
Full article ">Figure 4
<p>ASMS (Attention Score Map Median Cosine Similarity).</p>
Full article ">Figure 5
<p>Attention maps of the PAE model for Head 1: (<b>a</b>) before attention enhancement, showing a more dispersed attention distribution across patches; (<b>b</b>) after attention enhancement, where attention is more concentrated on visually significant patches.</p>
Full article ">
15 pages, 1919 KiB  
Article
A Multimodal Recommender System Using Deep Learning Techniques Combining Review Texts and Images
by Euiju Jeong, Xinzhe Li, Angela (Eunyoung) Kwon, Seonu Park, Qinglong Li and Jaekyeong Kim
Appl. Sci. 2024, 14(20), 9206; https://doi.org/10.3390/app14209206 - 10 Oct 2024
Cited by 2 | Viewed by 2153
Abstract
Online reviews that consist of texts and images are an essential source of information for alleviating data sparsity in recommender system studies. Although texts and images provide different types of information, they can provide complementary or substitutive advantages. However, most studies are limited [...] Read more.
Online reviews that consist of texts and images are an essential source of information for alleviating data sparsity in recommender system studies. Although texts and images provide different types of information, they can provide complementary or substitutive advantages. However, most studies are limited in introducing the complementary effect between texts and images in the recommender systems. Specifically, they have overlooked the informational value of images and proposed recommender systems solely based on textual representations. To address this research gap, this study proposes a novel recommender model that captures the dependence between texts and images. This study uses the RoBERTa and VGG-16 models to extract textual and visual information from online reviews and applies a co-attention mechanism to capture the complementarity between the two modalities. Extensive experiments were conducted using Amazon datasets, confirming the superiority of the proposed model. Our findings suggest that the complementarity of texts and images is crucial for enhancing recommendation accuracy and performance. Full article
(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)
Show Figures

Figure 1

Figure 1
<p>Example of an online review of Amazon.</p>
Full article ">Figure 2
<p>Framework of the CAMRec Model.</p>
Full article ">Figure 3
<p>Architecture of the CAMRec Model.</p>
Full article ">
Back to TopTop