[go: up one dir, main page]

 
 

Deep Learning for Image, Video and Signal Processing

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Applications".

Deadline for manuscript submissions: closed (30 September 2024) | Viewed by 43284

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
Interests: deep learning; computer vision; audio source separation; music information retrieval
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, Democritus University of Thrace, 67100 Xanthi, Greece
Interests: deep learning; machine learning; manifold learning; image analysis; biomedical signal processing; biomedical image analysis; pattern recognition

Special Issue Information

Dear Colleagues,

Deep learning has been a major revolution in modern information processing. All major application areas have been affected positively by this breakthrough, including image, video and signal processing. Deep learning has rendered traditional approaches that employ man-made features obsolete by allowing neural networks to extract optimized features through learning. Current networks, featuring large networks with millions of parameters, can address many image, video and signal processing problems with top performance. The use of GPUs for training these networks is detrimental. In addition, extensions of traditional learning strategies, such as contrastive, semi-supervised learning and teacher-student models, have addressed the requirement for large amounts of annotated data.

The aim of this Special Issue is to present and highlight the newest trends in deep learning for image, video and signal processing applications. These may include, but are not limited to, the following topics:

  • Object detection;
  • Semantic/instance segmentation;
  • Image fusion;
  • Image/video spatial/temporal inpainting;
  • Generative image/video processing;
  • Image/video classification;
  • Document image processing;
  • Image/video processing for autonomous driving;
  • Audio processing/classification;
  • Audio source separation;
  • Contrastive/semi-supervised learning;
  • Knowledge distillation methods.

Dr. Nikolaos Mitianoudis
Dr. Ilias Theodorakopoulos
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (20 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

14 pages, 6789 KiB  
Article
Real-Time Nonlinear Image Reconstruction in Electrical Capacitance Tomography Using the Generative Adversarial Network
by Damian Wanta, Mikhail Ivanenko, Waldemar T. Smolik, Przemysław Wróblewski and Mateusz Midura
Information 2024, 15(10), 617; https://doi.org/10.3390/info15100617 - 9 Oct 2024
Viewed by 266
Abstract
This study investigated the potential of the generative adversarial neural network (cGAN) image reconstruction in industrial electrical capacitance tomography. The image reconstruction quality was examined using image patterns typical for a two-phase flow. The training dataset was prepared by generating images of random [...] Read more.
This study investigated the potential of the generative adversarial neural network (cGAN) image reconstruction in industrial electrical capacitance tomography. The image reconstruction quality was examined using image patterns typical for a two-phase flow. The training dataset was prepared by generating images of random test objects and simulating the corresponding capacitance measurements. Numerical simulations were performed using the ECTsim toolkit for MATLAB. A cylindrical sixteen-electrode ECT sensor was used in the experiments. Real measurements were obtained using the EVT4 data acquisition system. The reconstructed images were evaluated using selected image quality metrics. The results obtained using cGAN are better than those obtained using the Landweber iteration and simplified Levenberg–Marquardt algorithm. The suggested method offers a promising solution for a fast reconstruction algorithm suitable for real-time monitoring and the control of a two-phase flow using ECT. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>The classes of training datasets. (<b>a</b>) Typical flow patterns: annular flow, stratified flow, and slug flow (one and two rods). (<b>b</b>) Images containing random circular objects. Two relative permittivity values, 1 and 3, were used. For each map of permittivity distribution, the corresponding capacitance measurements are shown in the form of a 2D 16 × 15 pixels color map.</p>
Full article ">Figure 2
<p>(<b>a</b>) ECT sensor filled with PP pellet, and (<b>b</b>) sketch of the sensors.</p>
Full article ">Figure 3
<p>The signal-to-noise ratio of capacitance measurements obtained using the prepared sensor and EVT4 data acquisition system: ‘min’—measurement with empty sensor; ‘max’—measurement conducted with sensor filled with PP pellet. The graph shows data for the first electrode in the pair.</p>
Full article ">Figure 4
<p>The architecture of the generator network in the cGAN model featuring a U-Net-style structure.</p>
Full article ">Figure 5
<p>Neural network training procedure.</p>
Full article ">Figure 6
<p>Neural network training on 50 epochs: (<b>a</b>) discriminator loss, (<b>b</b>) generator loss, and (<b>c</b>) relative image error.</p>
Full article ">Figure 7
<p>The modeled true distribution of relative permittivity and images were reconstructed using the Landweber method, sLM algorithm (λ = 5 × 10<sup>−3</sup>), and cGAN neural network. Based on examples from the test set, with added noise (set to achieve an SNR of 30 dB for the opposing electrodes in the measurement).</p>
Full article ">Figure 7 Cont.
<p>The modeled true distribution of relative permittivity and images were reconstructed using the Landweber method, sLM algorithm (λ = 5 × 10<sup>−3</sup>), and cGAN neural network. Based on examples from the test set, with added noise (set to achieve an SNR of 30 dB for the opposing electrodes in the measurement).</p>
Full article ">Figure 8
<p>Histograms of mean square error (<b>a</b>,<b>b</b>) and correlation distribution (<b>c</b>,<b>d</b>) for the elements of the testing dataset with flow patterns (<b>a</b>,<b>c</b>) and with random circles (<b>b</b>,<b>d</b>).</p>
Full article ">Figure 9
<p>Histogram of the computation time required for each reconstruction algorithm.</p>
Full article ">Figure 10
<p>Four test objects mimicking the two-phase flow patterns (annular and stratified flows, slug, and circles) and the results of the image reconstructions. From left to right: view of test objects inside the sensor, numerical representations of the test objects, normalized sinograms of measured capacitances, and images reconstructed using the Landweber algorithm, the sLM algorithm, and cGAN.</p>
Full article ">Figure 10 Cont.
<p>Four test objects mimicking the two-phase flow patterns (annular and stratified flows, slug, and circles) and the results of the image reconstructions. From left to right: view of test objects inside the sensor, numerical representations of the test objects, normalized sinograms of measured capacitances, and images reconstructed using the Landweber algorithm, the sLM algorithm, and cGAN.</p>
Full article ">
17 pages, 9437 KiB  
Article
Utilizing RT-DETR Model for Fruit Calorie Estimation from Digital Images
by Shaomei Tang and Weiqi Yan
Information 2024, 15(8), 469; https://doi.org/10.3390/info15080469 - 7 Aug 2024
Viewed by 1196
Abstract
Estimating the calorie content of fruits is critical for weight management and maintaining overall health as well as aiding individuals in making informed dietary choices. Accurate knowledge of fruit calorie content assists in crafting personalized nutrition plans and preventing obesity and associated health [...] Read more.
Estimating the calorie content of fruits is critical for weight management and maintaining overall health as well as aiding individuals in making informed dietary choices. Accurate knowledge of fruit calorie content assists in crafting personalized nutrition plans and preventing obesity and associated health issues. In this paper, we investigate the application of deep learning models for estimating the calorie content in fruits from digital images, aiming to provide a more efficient and accurate method for nutritional analysis. We create a dataset comprising images of various fruits and employ random data augmentation techniques during training to enhance model robustness. We utilize the RT-DETR model integrated into the ultralytics framework for implementation and conduct comparative experiments with YOLOv10 on the dataset. Our results show that the RT-DETR model achieved a precision rate of 99.01% and mAP50-95 of 94.45% in fruit detection from digital images, outperforming YOLOv10 in terms of F1- Confidence Curves, P-R curves, precision, and mAP. Conclusively, in this paper, we utilize a transformer architecture to detect fruits and estimate their calorie and nutritional content. The results of the experiments provide a technical reference for more accurately monitoring an individual’s dietary intake by estimating the calorie content of fruits. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>The transformer architecture.</p>
Full article ">Figure 2
<p>The images used in data augmentations.</p>
Full article ">Figure 3
<p>Applying HSV augmentation randomly.</p>
Full article ">Figure 4
<p>The effects of four augmentation techniques from the Albumentations Library.</p>
Full article ">Figure 5
<p>The architecture of RT-DETR.</p>
Full article ">Figure 6
<p>F1–confidence curves for YOLOv10 (<b>a</b>) and RT-DETR (<b>b</b>). The graylines are our F1-confidence curves with different parameters and datasets.</p>
Full article ">Figure 7
<p>The P-R curves for YOLOv10 (<b>a</b>) and RT-DETR (<b>b</b>). The graylines indicate the different parameters and datasets for testing our proposed models.</p>
Full article ">Figure 8
<p>(<b>a</b>–<b>e</b>) Prediction of RT-DETR model (<b>left</b>) and YOLOv10 model (<b>right</b>) in various backgrounds.</p>
Full article ">Figure 8 Cont.
<p>(<b>a</b>–<b>e</b>) Prediction of RT-DETR model (<b>left</b>) and YOLOv10 model (<b>right</b>) in various backgrounds.</p>
Full article ">Figure 8 Cont.
<p>(<b>a</b>–<b>e</b>) Prediction of RT-DETR model (<b>left</b>) and YOLOv10 model (<b>right</b>) in various backgrounds.</p>
Full article ">Figure 9
<p>(<b>a</b>,<b>b</b>) Calorie estimation error for the RT-DETR model and the YOLOv10 model in different environments.</p>
Full article ">Figure 10
<p>(<b>a</b>,<b>b</b>) Both RT-DETR and YOLOv10 models incorrectly detected an Ambrosia apple or a Rose apple as a Gala apple. (<b>a</b>) The case in misidentification for an Ambrosia apple (<b>b</b>) The case in misidentification for an Rose apple.</p>
Full article ">
16 pages, 3092 KiB  
Article
Epileptic Seizure Detection from Decomposed EEG Signal through 1D and 2D Feature Representation and Convolutional Neural Network
by Shupta Das, Suraiya Akter Mumu, M. A. H. Akhand, Abdus Salam and Md Abdus Samad Kamal
Information 2024, 15(5), 256; https://doi.org/10.3390/info15050256 - 2 May 2024
Cited by 3 | Viewed by 1255
Abstract
Electroencephalogram (EEG) has emerged as the most favorable source for recognizing brain disorders like epileptic seizure (ES) using deep learning (DL) methods. This study investigated the well-performed EEG-based ES detection method by decomposing EEG signals. Specifically, empirical mode decomposition (EMD) decomposes EEG signals [...] Read more.
Electroencephalogram (EEG) has emerged as the most favorable source for recognizing brain disorders like epileptic seizure (ES) using deep learning (DL) methods. This study investigated the well-performed EEG-based ES detection method by decomposing EEG signals. Specifically, empirical mode decomposition (EMD) decomposes EEG signals into six intrinsic mode functions (IMFs). Three distinct features, namely, fluctuation index, variance, and ellipse area of the second order difference plot (SODP), were extracted from each of the IMFs. The feature values from all EEG channels were arranged in two composite feature forms: a 1D (i.e., unidimensional) form and a 2D image-like form. For ES recognition, the convolutional neural network (CNN), the most prominent DL model for 2D input, was considered for the 2D feature form, and a 1D version of CNN was employed for the 1D feature form. The experiment was conducted on a benchmark CHB-MIT dataset as well as a dataset prepared from the EEG signals of ES patients from Prince Hospital Khulna (PHK), Bangladesh. The 2D feature-based CNN model outperformed the other 1D feature-based models, showing an accuracy of 99.78% for CHB-MIT and 95.26% for PHK. Furthermore, the cross-dataset evaluations also showed favorable outcomes. Therefore, the proposed method with 2D composite feature form can be a promising ES detection method. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Proposed framework for epileptic seizure detection.</p>
Full article ">Figure 2
<p>Six IMFs of seizure and non-seizure sample cases from the PHK dataset.</p>
Full article ">Figure 3
<p>Two different feature orientations of 18 features (= 6 IMFs <math display="inline"><semantics> <mrow> <mo>×</mo> </mrow> </semantics></math> 3 features) for each individual channel (Ch). Total channels (N) were 22 and 19 for the CHB-MIT and PHK datasets, respectively. (<b>a</b>) All of the features are placed in a line placing individual channel features one after another in a line. (<b>b</b>) The feature values of individual channels are placed in rows in the 2D image-like form.</p>
Full article ">Figure 3 Cont.
<p>Two different feature orientations of 18 features (= 6 IMFs <math display="inline"><semantics> <mrow> <mo>×</mo> </mrow> </semantics></math> 3 features) for each individual channel (Ch). Total channels (N) were 22 and 19 for the CHB-MIT and PHK datasets, respectively. (<b>a</b>) All of the features are placed in a line placing individual channel features one after another in a line. (<b>b</b>) The feature values of individual channels are placed in rows in the 2D image-like form.</p>
Full article ">Figure 4
<p>CNN structure to classify seizure from a 22 × 18 sized composite feature for the CHB-MIT dataset.</p>
Full article ">Figure 5
<p>Training set loss vs. epochs for the CHB-MIT dataset.</p>
Full article ">Figure 6
<p>Test set accuracy vs. epochs for the CHB-MIT dataset.</p>
Full article ">Figure 7
<p>Training set loss vs. epochs for the PHK dataset.</p>
Full article ">Figure 8
<p>Test set accuracy vs. epochs for the PHK dataset.</p>
Full article ">
18 pages, 3172 KiB  
Article
Transformer-Based Approach to Pathology Diagnosis Using Audio Spectrogram
by Mohammad Tami, Sari Masri, Ahmad Hasasneh and Chakib Tadj
Information 2024, 15(5), 253; https://doi.org/10.3390/info15050253 - 30 Apr 2024
Viewed by 1411
Abstract
Early detection of infant pathologies by non-invasive means is a critical aspect of pediatric healthcare. Audio analysis of infant crying has emerged as a promising method to identify various health conditions without direct medical intervention. In this study, we present a cutting-edge machine [...] Read more.
Early detection of infant pathologies by non-invasive means is a critical aspect of pediatric healthcare. Audio analysis of infant crying has emerged as a promising method to identify various health conditions without direct medical intervention. In this study, we present a cutting-edge machine learning model that employs audio spectrograms and transformer-based algorithms to classify infant crying into distinct pathological categories. Our innovative model bypasses the extensive preprocessing typically associated with audio data by exploiting the self-attention mechanisms of the transformer, thereby preserving the integrity of the audio’s diagnostic features. When benchmarked against established machine learning and deep learning models, our approach demonstrated a remarkable 98.69% accuracy, 98.73% precision, 98.71% recall, and an F1 score of 98.71%, surpassing the performance of both traditional machine learning and convolutional neural network models. This research not only provides a novel diagnostic tool that is scalable and efficient but also opens avenues for improving pediatric care through early and accurate detection of pathologies. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>The workflow of the proposed model for classifying infant pathological cries.</p>
Full article ">Figure 2
<p>Distribution of the three pathologies’ samples. It shows that the three diseases are equally represented.</p>
Full article ">Figure 3
<p>The proposed Audio Spectrogram Transformer (AST) architecture. The 2D audio spectrogram is split into a sequence of 16×16 patches with overlap and then linearly projected to a sequence of 1D patch embeddings. Each patch embedding is added with a learnable positional embedding. An additional classification token is prepended to the sequence. The output embedding is input to a transformer, and the output of the classification token is used for classification with a linear layer [<a href="#B36-information-15-00253" class="html-bibr">36</a>].</p>
Full article ">Figure 4
<p>Spectrogram of the audio data where (<b>a</b>) is healthy class, (<b>b</b>) is RDS class, and (<b>c</b>) is sepsis class respectively.</p>
Full article ">Figure 5
<p>Progression of the best model validation metrics over epoch.</p>
Full article ">Figure 6
<p>Training and validation loss per epoch.</p>
Full article ">Figure 7
<p>Confusion matrix of the hyper-tuned spectrogram transformer model for the three classes.</p>
Full article ">Figure 8
<p>ROC curve of the multiclass classification of AST model.</p>
Full article ">
22 pages, 6807 KiB  
Article
Deep Learning-Based Road Pavement Inspection by Integrating Visual Information and IMU
by Chen-Chiung Hsieh, Han-Wen Jia, Wei-Hsin Huang and Mei-Hua Hsih
Information 2024, 15(4), 239; https://doi.org/10.3390/info15040239 - 20 Apr 2024
Cited by 1 | Viewed by 1603
Abstract
This study proposes a deep learning method for pavement defect detection, focusing on identifying potholes and cracks. A dataset comprising 10,828 images is collected, with 8662 allocated for training, 1083 for validation, and 1083 for testing. Vehicle attitude data are categorized based on [...] Read more.
This study proposes a deep learning method for pavement defect detection, focusing on identifying potholes and cracks. A dataset comprising 10,828 images is collected, with 8662 allocated for training, 1083 for validation, and 1083 for testing. Vehicle attitude data are categorized based on three-axis acceleration and attitude change, with 6656 (64%) for training, 1664 (16%) for validation, and 2080 (20%) for testing. The Nvidia Jetson Nano serves as the vehicle-embedded system, transmitting IMU-acquired vehicle data and GoPro-captured images over a 5G network to the server. The server recognizes two damage categories, low-risk and high-risk, storing results in MongoDB. Severe damage triggers immediate alerts to maintenance personnel, while less severe issues are recorded for scheduled maintenance. The method selects YOLOv7 among various object detection models for pavement defect detection, achieving a mAP of 93.3%, a recall rate of 87.8%, a precision of 93.2%, and a processing speed of 30–40 FPS. Bi-LSTM is then chosen for vehicle vibration data processing, yielding 77% mAP, 94.9% recall rate, and 89.8% precision. Integration of the visual and vibration results, along with vehicle speed and travel distance, results in a final recall rate of 90.2% and precision of 83.7% after field testing. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>There are different ways to annotate the same picture.</p>
Full article ">Figure 2
<p>Examples of the percentage of cracks [<a href="#B18-information-15-00239" class="html-bibr">18</a>] are (<b>a</b>) new pavement, 0% cracks; (<b>b</b>) partial cracks, 1–50%; (<b>c</b>) partially connected cracks, 50–70%; and (<b>d</b>) dense and mixed cracks, 70–100%.</p>
Full article ">Figure 3
<p>(<b>a</b>) Proposed system framework and (<b>b</b>) installation diagram.</p>
Full article ">Figure 4
<p>Proposed software system flowchart of the front end.</p>
Full article ">Figure 5
<p>Proposed software system flowchart of the back-end.</p>
Full article ">Figure 6
<p>Dataset processing. (<b>a</b>) The original image before processing. (<b>b</b>) The modified image after processing.</p>
Full article ">Figure 7
<p>Accelerations (ax, ay, az) in the x, y, and z directions versus time domain. (<b>a</b>) Road pavement uneven—low risk. (<b>b</b>) Road pavement uneven—high risk.</p>
Full article ">Figure 8
<p>Information fusion threaded activity diagram.</p>
Full article ">Figure 9
<p>Information obtained by GPRMC.</p>
Full article ">Figure 10
<p>The relative position of the front wheels to the markers in our image.</p>
Full article ">Figure 11
<p>Integrated information format.</p>
Full article ">Figure 12
<p>Schematic diagram of a pothole and the markings.</p>
Full article ">Figure 13
<p>Comparison of the original (red curve) acceleration signals (ax, ay, az) after 1st stage pre-processing (green curve) and 2nd stage pre-processing (blue curve). (<b>a</b>) Acceleration signal of X. (<b>b</b>) Acceleration signal Y. (<b>c</b>) Acceleration signal Z.</p>
Full article ">Figure 14
<p>(<b>a</b>) <span class="html-italic">x</span>-axis acceleration distribution. (<b>b</b>) <span class="html-italic">y</span>-axis acceleration distribution. (<b>c</b>) <span class="html-italic">z</span>-axis acceleration distribution.</p>
Full article ">Figure 15
<p>Pitch, yaw, roll, ax, ay, and az data versus time domain for (<b>a</b>) road pavement uneven—low risk. (<b>b</b>) Road pavement uneven—high risk.</p>
Full article ">
15 pages, 3559 KiB  
Article
STAR-3D: A Holistic Approach for Human Activity Recognition in the Classroom Environment
by Vijeta Sharma, Manjari Gupta, Ajai Kumar and Deepti Mishra
Information 2024, 15(4), 179; https://doi.org/10.3390/info15040179 - 25 Mar 2024
Cited by 3 | Viewed by 1401
Abstract
The video camera is essential for reliable activity monitoring, and a robust analysis helps in efficient interpretation. The systematic assessment of classroom activity through videos can help understand engagement levels from the perspective of both students and teachers. This practice can also help [...] Read more.
The video camera is essential for reliable activity monitoring, and a robust analysis helps in efficient interpretation. The systematic assessment of classroom activity through videos can help understand engagement levels from the perspective of both students and teachers. This practice can also help in robot-assistive classroom monitoring in the context of human–robot interaction. Therefore, we propose a novel algorithm for student–teacher activity recognition using 3D CNN (STAR-3D). The experiment is carried out using India’s indigenously developed supercomputer PARAM Shivay by the Centre for Development of Advanced Computing (C-DAC), Pune, India, under the National Supercomputing Mission (NSM), with a peak performance of 837 TeraFlops. The EduNet dataset (registered under the trademark of the DRSTATM dataset), a self-developed video dataset for classroom activities with 20 action classes, is used to train the model. Due to the unavailability of similar datasets containing both students’ and teachers’ actions, training, testing, and validation are only carried out on the EduNet dataset with 83.5% accuracy. To the best of our knowledge, this is the first attempt to develop an end-to-end algorithm that recognises both the students’ and teachers’ activities in the classroom environment, and it mainly focuses on school levels (K-12). In addition, a comparison with other approaches in the same domain shows our work’s novelty. This novel algorithm will also influence the researcher in exploring research on the “Convergence of High-Performance Computing and Artificial Intelligence”. We also present future research directions to integrate the STAR-3D algorithm with robots for classroom monitoring. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Architecture diagram of our proposed algorithm STAR-3D.</p>
Full article ">Figure 2
<p>Architecture diagram of single-shot detector.</p>
Full article ">Figure 3
<p>A whole system of a generative adversarial network (GAN).</p>
Full article ">Figure 4
<p>ResNet50 model architecture as the base model of STAR-3D.</p>
Full article ">Figure 5
<p>A glimpse of EduNet dataset (Action categories from left to right: Arguing, Eating_in_Classroom, Holding_Book, Holding_Stick, Holding_Mobile_Phone, Explaining_the_Subject, Writing_on_Board, sleeping, Reading_Book).</p>
Full article ">Figure 6
<p>Architecture of the PARAM Shivay supercomputer [<a href="#B34-information-15-00179" class="html-bibr">34</a>].</p>
Full article ">Figure 7
<p>Validation accuracy of EduNet dataset action classes using STAR-3D.</p>
Full article ">Figure 8
<p>Confusion matrix.</p>
Full article ">
22 pages, 12087 KiB  
Article
A Cloud-Based Deep Learning Framework for Downy Mildew Detection in Viticulture Using Real-Time Image Acquisition from Embedded Devices and Drones
by Sotirios Kontogiannis, Myrto Konstantinidou, Vasileios Tsioukas and Christos Pikridas
Information 2024, 15(4), 178; https://doi.org/10.3390/info15040178 - 24 Mar 2024
Cited by 2 | Viewed by 1549
Abstract
In viticulture, downy mildew is one of the most common diseases that, if not adequately treated, can diminish production yield. However, the uncontrolled use of pesticides to alleviate its occurrence can pose significant risks for farmers, consumers, and the environment. This paper presents [...] Read more.
In viticulture, downy mildew is one of the most common diseases that, if not adequately treated, can diminish production yield. However, the uncontrolled use of pesticides to alleviate its occurrence can pose significant risks for farmers, consumers, and the environment. This paper presents a new framework for the early detection and estimation of the mildew’s appearance in viticulture fields. The framework utilizes a protocol for the real-time acquisition of drones’ high-resolution RGB images and a cloud-docker-based video or image inference process using object detection CNN models. The authors implemented their framework proposition using open-source tools and experimented with their proposed implementation on the debina grape variety in Zitsa, Greece, during downy mildew outbursts. The authors present evaluation results of deep learning Faster R-CNN object detection models trained on their downy mildew annotated dataset, using the different object classifiers of VGG16, ViTDet, MobileNetV3, EfficientNet, SqueezeNet, and ResNet. The authors compare Faster R-CNN and YOLO object detectors in terms of accuracy and speed. From their experimentation, the embedded device model ViTDet showed the worst accuracy results compared to the fast inferences of YOLOv8, while MobileNetV3 significantly outperformed YOLOv8 in terms of both accuracy and speed. Regarding cloud inferences, large ResNet models performed well in terms of accuracy, while YOLOv5 faster inferences presented significant object classification losses. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Proposed object detection framework inputs, outputs, and steps.</p>
Full article ">Figure 2
<p>Proposed high-level system architecture that supports the object detection framework process.</p>
Full article ">Figure 3
<p>IoT plant-level monitoring camera nodes, ThingsBoard dashboard, and normal and mildew-infected leaf inferences using different Faster R-CNN models. (<b>a</b>) IoT plant-level autonomous camera nodes and their corresponding parts (1)–(4). (<b>b</b>) IoT plant-level monitoring ThingsBoard dashboard. (<b>c</b>) IoT plant-level inferences using the MobileNetV3 model. (<b>d</b>) IoT plant-level inferences using the ResNet-50 model.</p>
Full article ">Figure 4
<p>IoT plant-level monitoring camera devices, ThingsBoard dashboard drone GPS locations acquired from image metadata, and normal and mildew-infected leaf inferences using object detection models on image streams. (<b>a</b>) IoT plant-level monitored viticulture field using drones. (<b>b</b>) Drone GPS locations from captured image EXIF metadata [<a href="#B53-information-15-00178" class="html-bibr">53</a>], as illustrated in ThingsBoard. (<b>c</b>) IoT plant-level video stream inferences using YOLOv5-small model.</p>
Full article ">Figure 4 Cont.
<p>IoT plant-level monitoring camera devices, ThingsBoard dashboard drone GPS locations acquired from image metadata, and normal and mildew-infected leaf inferences using object detection models on image streams. (<b>a</b>) IoT plant-level monitored viticulture field using drones. (<b>b</b>) Drone GPS locations from captured image EXIF metadata [<a href="#B53-information-15-00178" class="html-bibr">53</a>], as illustrated in ThingsBoard. (<b>c</b>) IoT plant-level video stream inferences using YOLOv5-small model.</p>
Full article ">Figure 5
<p>IoT plant-level video stream inferences using (<b>1</b>) ResNet-50 and (<b>2</b>) ResNet-152 models.</p>
Full article ">Figure 6
<p>Annotation process using LabelImg tool on (<b>a</b>) IoT camera nodes and (<b>b</b>) drone acquired images. Two distinct annotation classes were used for normal and downy mildew-infected leaves.</p>
Full article ">Figure 7
<p>Precision—recall mAP scores for threshold values 0.5–0.95 and a step of 0.05 (<math display="inline"><semantics> <mrow> <mi>m</mi> <mi>A</mi> <msub> <mi>P</mi> <mrow> <mn>0.5</mn> <mo>:</mo> <mn>0.95</mn> </mrow> </msub> </mrow> </semantics></math>) for cloud object detection models.</p>
Full article ">Figure 8
<p>Classification loss scores over epochs for cloud object detection models.</p>
Full article ">Figure 9
<p>Precision—recall mAP scores for threshold values 0.5–0.95 and step of 0.05 (<math display="inline"><semantics> <mrow> <mi>m</mi> <mi>A</mi> <msub> <mi>P</mi> <mrow> <mn>0.5</mn> <mo>:</mo> <mn>0.95</mn> </mrow> </msub> </mrow> </semantics></math>) for embedded and mobile device object detection models.</p>
Full article ">Figure 10
<p>Classification loss scores over epochs for embedded and mobile device object detection models.</p>
Full article ">
23 pages, 2629 KiB  
Article
Detect with Style: A Contrastive Learning Framework for Detecting Computer-Generated Images
by Georgios Karantaidis and Constantine Kotropoulos
Information 2024, 15(3), 158; https://doi.org/10.3390/info15030158 - 11 Mar 2024
Viewed by 1530
Abstract
The detection of computer-generated (CG) multimedia content has become of utmost importance due to the advances in digital image processing and computer graphics. Realistic CG images could be used for fraudulent purposes due to the deceiving recognition capabilities of human eyes. So, there [...] Read more.
The detection of computer-generated (CG) multimedia content has become of utmost importance due to the advances in digital image processing and computer graphics. Realistic CG images could be used for fraudulent purposes due to the deceiving recognition capabilities of human eyes. So, there is a need to deploy algorithmic tools for distinguishing CG images from natural ones within multimedia forensics. Here, an end-to-end framework is proposed to tackle the problem of distinguishing CG images from natural ones by utilizing supervised contrastive learning and arbitrary style transfer by means of a two-stage deep neural network architecture. This architecture enables discrimination by leveraging per-class embeddings and generating multiple training samples to increase model capacity without the need for a vast amount of initial data. Stochastic weight averaging (SWA) is also employed to improve the generalization and stability of the proposed framework. Extensive experiments are conducted to investigate the impact of various noise conditions on the classification accuracy and the proposed framework’s generalization ability. The conducted experiments demonstrate superior performance over the existing state-of-the-art methodologies on the public DSTok, Rahmouni, and LSCGB benchmark datasets. Hypothesis testing asserts that the improvements in detection accuracy are statistically significant. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Architecture of the proposed CoStNet.</p>
Full article ">Figure 2
<p>On the (<b>left</b>), a natural image is depicted. On the (<b>right</b>), a computer-generated image is shown that was generated by the style transfer module.</p>
Full article ">Figure 3
<p>Components of style transfer module.</p>
Full article ">Figure 4
<p>Learning procedure of the proposed framework.</p>
Full article ">Figure 5
<p>Sample images of DSTok dataset. On the (<b>left</b>), a natural image is depicted. On the (<b>right</b>), a computer-generated image is shown. It is difficult to determine that the image on the right is computer-generated with the naked eye.</p>
Full article ">Figure 6
<p>Training loss versus epochs during the first training stage of CoStNet on the DSTok dataset.</p>
Full article ">Figure 7
<p>Validation accuracy versus epochs during the first training stage of CoStNet on the DSTok dataset.</p>
Full article ">Figure 8
<p>Original CGI and the CGIs altered by injecting salt-and-pepper noise at various SNRs.</p>
Full article ">Figure 9
<p>Accuracy of the proposed CoStNet on the DSTok dataset when several portions of the original dataset are retained.</p>
Full article ">
61 pages, 7868 KiB  
Article
Advances in Facial Expression Recognition: A Survey of Methods, Benchmarks, Models, and Datasets
by Thomas Kopalidis, Vassilios Solachidis, Nicholas Vretos and Petros Daras
Information 2024, 15(3), 135; https://doi.org/10.3390/info15030135 - 28 Feb 2024
Viewed by 9505
Abstract
Recent technological developments have enabled computers to identify and categorize facial expressions to determine a person’s emotional state in an image or a video. This process, called “Facial Expression Recognition (FER)”, has become one of the most popular research areas in computer vision. [...] Read more.
Recent technological developments have enabled computers to identify and categorize facial expressions to determine a person’s emotional state in an image or a video. This process, called “Facial Expression Recognition (FER)”, has become one of the most popular research areas in computer vision. In recent times, deep FER systems have primarily concentrated on addressing two significant challenges: the problem of overfitting due to limited training data availability, and the presence of expression-unrelated variations, including illumination, head pose, image resolution, and identity bias. In this paper, a comprehensive survey is provided on deep FER, encompassing algorithms and datasets that offer insights into these intrinsic problems. Initially, this paper presents a detailed timeline showcasing the evolution of methods and datasets in deep facial expression recognition (FER). This timeline illustrates the progression and development of the techniques and data resources used in FER. Then, a comprehensive review of FER methods is introduced, including the basic principles of FER (components such as preprocessing, feature extraction and classification, and methods, etc.) from the pro-deep learning era (traditional methods using handcrafted features, i.e., SVM and HOG, etc.) to the deep learning era. Moreover, a brief introduction is provided related to the benchmark datasets (there are two categories: controlled environments (lab) and uncontrolled environments (in the wild)) used to evaluate different FER methods and a comparison of different FER models. Existing deep neural networks and related training strategies designed for FER, based on static images and dynamic image sequences, are discussed. The remaining challenges and corresponding opportunities in FER and the future directions for designing robust deep FER systems are also pinpointed. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Six basic emotions.</p>
Full article ">Figure 2
<p>The arousal and valence factors vary from positive to negative. Also, the positive states are in the right hemisphere, and the negative ones are in the left. Emotional states can be easily defined in 4 different quadrants (Q1–Q4), including high arousal–high valence (Q1), high arousal–low valence (Q2), low arousal–low valence (Q3), and low arousal–high valence (Q4), respectively [<a href="#B38-information-15-00135" class="html-bibr">38</a>].</p>
Full article ">Figure 3
<p>Taxonomy of the FER methods.</p>
Full article ">Figure 4
<p>Evolution timeline of methods. List of papers in the image [Bassili1978] [<a href="#B104-information-15-00135" class="html-bibr">104</a>], [Padgett1996] [<a href="#B105-information-15-00135" class="html-bibr">105</a>], [Guo2000] [<a href="#B106-information-15-00135" class="html-bibr">106</a>], [Cowie2001] [<a href="#B25-information-15-00135" class="html-bibr">25</a>], [Matsugu2003, Cohen2003] [<a href="#B107-information-15-00135" class="html-bibr">107</a>,<a href="#B108-information-15-00135" class="html-bibr">108</a>], [Wang2004] [<a href="#B109-information-15-00135" class="html-bibr">109</a>], [Kotsia2006] [<a href="#B110-information-15-00135" class="html-bibr">110</a>], [Zhao2007] [<a href="#B111-information-15-00135" class="html-bibr">111</a>], [Ranzato2011] [<a href="#B112-information-15-00135" class="html-bibr">112</a>], [Zhong2012] [<a href="#B113-information-15-00135" class="html-bibr">113</a>], [Tang2013,Kahou2013] [<a href="#B114-information-15-00135" class="html-bibr">114</a>,<a href="#B115-information-15-00135" class="html-bibr">115</a>], [Liu2014] [<a href="#B116-information-15-00135" class="html-bibr">116</a>], [Ebrahimi2015,Kim2015] [<a href="#B117-information-15-00135" class="html-bibr">117</a>,<a href="#B118-information-15-00135" class="html-bibr">118</a>], [Fan2016] [<a href="#B119-information-15-00135" class="html-bibr">119</a>], [Zhang2017] [<a href="#B120-information-15-00135" class="html-bibr">120</a>], [Zhang2018] [<a href="#B121-information-15-00135" class="html-bibr">121</a>], [Liu2020] [<a href="#B122-information-15-00135" class="html-bibr">122</a>], [Xue2021] [<a href="#B64-information-15-00135" class="html-bibr">64</a>], [Li2023] [<a href="#B62-information-15-00135" class="html-bibr">62</a>].</p>
Full article ">Figure 5
<p>Evolution timeline of datasets (I = image, S = sequence, U = uncontrolled, C = controlled, L = lab). List of papers in the image [Lyons1998] [<a href="#B72-information-15-00135" class="html-bibr">72</a>], [Lundqvist1998] [<a href="#B90-information-15-00135" class="html-bibr">90</a>,<a href="#B91-information-15-00135" class="html-bibr">91</a>], [Kanade2000, Tian2001, Lucey2010] [<a href="#B76-information-15-00135" class="html-bibr">76</a>,<a href="#B77-information-15-00135" class="html-bibr">77</a>,<a href="#B78-information-15-00135" class="html-bibr">78</a>], [Pantic2005] [<a href="#B97-information-15-00135" class="html-bibr">97</a>], [Yin2006] [<a href="#B74-information-15-00135" class="html-bibr">74</a>,<a href="#B75-information-15-00135" class="html-bibr">75</a>], [Gross2008] [<a href="#B96-information-15-00135" class="html-bibr">96</a>], [Susskind2010] [<a href="#B87-information-15-00135" class="html-bibr">87</a>], [Langner2010] [<a href="#B89-information-15-00135" class="html-bibr">89</a>], [Aifanti2010] [<a href="#B99-information-15-00135" class="html-bibr">99</a>,<a href="#B100-information-15-00135" class="html-bibr">100</a>], [Zhao2011] [<a href="#B88-information-15-00135" class="html-bibr">88</a>], Dhall2011,2012,2017] [<a href="#B80-information-15-00135" class="html-bibr">80</a>,<a href="#B81-information-15-00135" class="html-bibr">81</a>,<a href="#B82-information-15-00135" class="html-bibr">82</a>], [Goodfellow2013] [<a href="#B86-information-15-00135" class="html-bibr">86</a>], [Dhall2011,2015] [<a href="#B84-information-15-00135" class="html-bibr">84</a>,<a href="#B85-information-15-00135" class="html-bibr">85</a>], [Barsoum2016] [<a href="#B71-information-15-00135" class="html-bibr">71</a>], [Benitez2017] [<a href="#B93-information-15-00135" class="html-bibr">93</a>], [Yale2017] [<a href="#B79-information-15-00135" class="html-bibr">79</a>], [Mollahoseini2017] [<a href="#B83-information-15-00135" class="html-bibr">83</a>], [Li2017] [<a href="#B94-information-15-00135" class="html-bibr">94</a>], [Zhang2018] [<a href="#B95-information-15-00135" class="html-bibr">95</a>], [kollias2019] [<a href="#B103-information-15-00135" class="html-bibr">103</a>], [Kosti2019] [<a href="#B101-information-15-00135" class="html-bibr">101</a>,<a href="#B102-information-15-00135" class="html-bibr">102</a>], [Ulrich2024] [<a href="#B129-information-15-00135" class="html-bibr">129</a>].</p>
Full article ">Figure 6
<p>Some feature extraction methods from top to bottom and from the right to the left LBP, Adaboost, Optical Flow, AAM, Gabor, and SIFT [<a href="#B173-information-15-00135" class="html-bibr">173</a>].</p>
Full article ">Figure 7
<p>Traditional facial expression recognition stages.</p>
Full article ">Figure 8
<p>CNN architecture (<a href="https://github.com/somillko/Facial-Expression-Recognition" target="_blank">https://github.com/somillko/Facial-Expression-Recognition</a> accessed on 1 February 2023).</p>
Full article ">Figure 9
<p>Accuracy vs. Operations, Size Parameters. This figure depicts the relationship between accuracy and the number of operations required for a single forward pass through the network. The size of each blob is proportional to the number of network parameters. A legend is provided in the bottom-right corner, indicating the range of parameters, which spans from 5 × 106 to 155 × 106. Both axes are shared across these figures, and grey dots are used to highlight the centers of the blobs [<a href="#B218-information-15-00135" class="html-bibr">218</a>].</p>
Full article ">Figure 10
<p>Structure of DBN [<a href="#B226-information-15-00135" class="html-bibr">226</a>].</p>
Full article ">Figure 11
<p>Simple RNN architecture (<a href="https://medium.com/deeplearningbrasilia/deep-learning-recurrent-neural-networks-f9482a24d010" target="_blank">https://medium.com/deeplearningbrasilia/deep-learning-recurrent-neural-networks-f9482a24d010</a> accessed on 1 February 2023).</p>
Full article ">Figure 12
<p>GRU and LSTM (source: <a href="https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21" target="_blank">https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21</a> accessed on 1 February 2023).</p>
Full article ">Figure 13
<p>GAN (source: <a href="https://developers.google.com/machine-learning/gan/gan_structure" target="_blank">https://developers.google.com/machine-learning/gan/gan_structure</a> accessed on 1 February 2023).</p>
Full article ">Figure 14
<p>Examples of two hybrid model frameworks (from top to bottom [<a href="#B240-information-15-00135" class="html-bibr">240</a>,<a href="#B241-information-15-00135" class="html-bibr">241</a>]) (<b>a</b>) the outlined approach, which encompasses two distinct streams: the temporal stream and the spatial stream. The purpose of the temporal stream is to handle temporal signals, while the spatial stream focuses on processing spatial signals. These two streams are then combined to create a comprehensive video feature. Subsequently, a Support Vector Machine (SVM) is employed for the recognition of facial expressions. The method is structured into four key stages: (i) preprocessing of the video, (ii) extraction of features, (iii) fusion of features, and (iv) recognition of emotions [<a href="#B241-information-15-00135" class="html-bibr">241</a>]. (<b>b</b>) The suggested framework is divided into three main components: (a) the input section; (b) the recommended model, termed VGG-GRU; and (c) the mechanism for controlling both the steering angle and the vehicle’s speed [<a href="#B240-information-15-00135" class="html-bibr">240</a>].</p>
Full article ">Figure 15
<p>Loss layer networks. (<b>a</b>) Triplet loss [<a href="#B248-information-15-00135" class="html-bibr">248</a>] and (<b>b</b>) correlation-based loss [<a href="#B60-information-15-00135" class="html-bibr">60</a>]. (<b>a</b>) This technique aims to reduce the gap between an anchor sample and a positive sample that share the same identity, while simultaneously increasing the gap between the anchor sample and a negative sample that has a different identity [<a href="#B248-information-15-00135" class="html-bibr">248</a>]. (<b>b</b>) This approach analyzes all samples within a mini-batch to guide the network in generating embedded feature vectors. The goal is for feature vectors of similar classes to exhibit strong correlation, whereas those of distinct classes should show weak correlation. Here, "H" signifies a strong correlation and "L" signifies a weak correlation [<a href="#B60-information-15-00135" class="html-bibr">60</a>].</p>
Full article ">Figure 16
<p>Ensemble network. NNs and BOVW network [<a href="#B258-information-15-00135" class="html-bibr">258</a>]. This network integrates automatically extracted features from convolutional neural networks (such as VGG-13, VGG-f, and VGG-face) with manually designed features derived using the bag-of-visual-words approach. Following the combination and L2 normalization of the feature vectors, a local learning strategy is applied [<a href="#B258-information-15-00135" class="html-bibr">258</a>].</p>
Full article ">Figure 17
<p>Multi-tasking Networks. Emotion GCN [<a href="#B37-information-15-00135" class="html-bibr">37</a>] The Emotion-GCN model is designed for Facial Expression Recognition (FER) in natural environments, incorporating a graph that links seven expression labels with two valence-arousal dimensions via an adjacency matrix. It utilizes Graph Convolutional Networks (GCNs) to process word embeddings into classifiers and regressors for mapping facial expressions. Image representations are extracted using a DenseNet-based CNN and refined by global max-pooling, enabling the model to perform both expression classification and valence-arousal regression through end-to-end training [<a href="#B37-information-15-00135" class="html-bibr">37</a>].</p>
Full article ">Figure 18
<p>Transformers. (<b>a</b>) FER-former [<a href="#B62-information-15-00135" class="html-bibr">62</a>] and (<b>b</b>) TransFER [<a href="#B64-information-15-00135" class="html-bibr">64</a>]. (<b>a</b>) presents model’s architecture FER-former, incorporating Multi-Granularity Embedding Integration (MGEI), hybrid self-attention, and Heterogeneous Domains-Steering Supervision (HDSS). It captures wide-ranging receptive fields and multi-modal semantics by embedding text into Multi-Head Self-Attention (MHSA), excluding convolutional position encodings for simplicity [<a href="#B62-information-15-00135" class="html-bibr">62</a>]. (<b>b</b>) TransFER model structure. Initially, facial images undergo processing by a foundational CNN to generate feature maps. These maps are further refined by local CNNs to identify various important feature regions. Subsequently, a 1 × 1 convolution and reshaping operations transform these maps into a sequence of feature vectors, suitable for input into the MSAD (a modified self-attention design within a Transformer encoder), which examines the interconnections among local patches. An MLP Head is then used to produce the final classification outcome. MAD directs the identification of diverse local patches, while MSAD leverages multi-head self-attention to uncover complex relationships between these patches [<a href="#B64-information-15-00135" class="html-bibr">64</a>].</p>
Full article ">Figure 19
<p>Examples of 12 different well-known datasets (from top to bottom and from the right to the left JAFFE [<a href="#B72-information-15-00135" class="html-bibr">72</a>,<a href="#B73-information-15-00135" class="html-bibr">73</a>], CK+ [<a href="#B76-information-15-00135" class="html-bibr">76</a>,<a href="#B77-information-15-00135" class="html-bibr">77</a>,<a href="#B78-information-15-00135" class="html-bibr">78</a>], Yale [<a href="#B79-information-15-00135" class="html-bibr">79</a>], AffectNet [<a href="#B83-information-15-00135" class="html-bibr">83</a>], SFEW 2.0 [<a href="#B84-information-15-00135" class="html-bibr">84</a>,<a href="#B85-information-15-00135" class="html-bibr">85</a>], RAFD [<a href="#B89-information-15-00135" class="html-bibr">89</a>], Multi-Pie [<a href="#B96-information-15-00135" class="html-bibr">96</a>], TFD [<a href="#B87-information-15-00135" class="html-bibr">87</a>], MMI [<a href="#B97-information-15-00135" class="html-bibr">97</a>,<a href="#B98-information-15-00135" class="html-bibr">98</a>], Oulu-CASIA [<a href="#B88-information-15-00135" class="html-bibr">88</a>], FER2013 [<a href="#B86-information-15-00135" class="html-bibr">86</a>], and KDEF [<a href="#B90-information-15-00135" class="html-bibr">90</a>,<a href="#B91-information-15-00135" class="html-bibr">91</a>]).</p>
Full article ">Figure 20
<p>Number of papers per year (<a href="https://paperswithcode.com/" target="_blank">https://paperswithcode.com/</a> accessed on 1 February 2023).</p>
Full article ">
18 pages, 7127 KiB  
Article
Benchmarking Automated Machine Learning (AutoML) Frameworks for Object Detection
by Samuel de Oliveira, Oguzhan Topsakal and Onur Toker
Information 2024, 15(1), 63; https://doi.org/10.3390/info15010063 - 21 Jan 2024
Cited by 1 | Viewed by 2798
Abstract
Automated Machine Learning (AutoML) is a subdomain of machine learning that seeks to expand the usability of traditional machine learning methods to non-expert users by automating various tasks which normally require manual configuration. Prior benchmarking studies on AutoML systems—whose aim is to compare [...] Read more.
Automated Machine Learning (AutoML) is a subdomain of machine learning that seeks to expand the usability of traditional machine learning methods to non-expert users by automating various tasks which normally require manual configuration. Prior benchmarking studies on AutoML systems—whose aim is to compare and evaluate their capabilities—have mostly focused on tabular or structured data. In this study, we evaluate AutoML systems on the task of object detection by curating three commonly used object detection datasets (Open Images V7, Microsoft COCO 2017, and Pascal VOC2012) in order to benchmark three different AutoML frameworks—namely, Google’s Vertex AI, NVIDIA’s TAO, and AutoGluon. We reduced the datasets to only include images with a single object instance in order to understand the effect of class imbalance, as well as dataset and object size. We used the metrics of the average precision (AP) and mean average precision (mAP). Solely in terms of accuracy, our results indicate AutoGluon as the best-performing framework, with a mAP of 0.8901, 0.8972, and 0.8644 for the Pascal VOC2012, COCO 2017, and Open Images V7 datasets, respectively. NVIDIA TAO achieved a mAP of 0.8254, 0.8165, and 0.7754 for those same datasets, while Google’s VertexAI scored 0.855, 0.793, and 0.761. We found the dataset size had an inverse relationship to mAP across all the frameworks, and there was no relationship between class size or imbalance and accuracy. Furthermore, we discuss each framework’s relative benefits and drawbacks from the standpoint of ease of use. This study also points out the issues found as we examined the labels of a subset of each dataset. Labeling errors in the datasets appear to have a substantial negative effect on accuracy that is not resolved by larger datasets. Overall, this study provides a platform for future development and research on this nascent field of machine learning. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Computer vision tasks from left to right: image classification, object detection, semantic segmentation, and instance segmentation.</p>
Full article ">Figure 2
<p>Precision–Recall Curve.</p>
Full article ">Figure 3
<p>Image count by class.</p>
Full article ">Figure 4
<p>mAP for each framework by dataset.</p>
Full article ">Figure 5
<p>AP by class for VertexAI.</p>
Full article ">Figure 6
<p>AP by class for AutoGluon.</p>
Full article ">Figure 7
<p>AP by class for AutoGluon.</p>
Full article ">Figure 8
<p>AP by class size for VertexAI.</p>
Full article ">Figure 9
<p>AP by class size for AutoGluon.</p>
Full article ">Figure 10
<p>AP by class size for TAO.</p>
Full article ">Figure 11
<p>AP as a function of bounding box ratio for VertexAI.</p>
Full article ">Figure 12
<p>AP as a function of bounding box ratio for AutoGluon.</p>
Full article ">Figure 13
<p>AP as a function of bounding box ratio for TAO.</p>
Full article ">Figure 14
<p>Missed dog label—V7 dataset.</p>
Full article ">Figure 15
<p>The entire image is labeled “person”—V7 dataset.</p>
Full article ">Figure 16
<p>Improper false positive. Only class “dog” was identified in the original label—V7 dataset.</p>
Full article ">
18 pages, 3164 KiB  
Article
Fast Object Detection Leveraging Global Feature Fusion in Boundary-Aware Convolutional Networks
by Weiming Fan, Jiahui Yu and Zhaojie Ju
Information 2024, 15(1), 53; https://doi.org/10.3390/info15010053 - 17 Jan 2024
Viewed by 1584
Abstract
Endoscopy, a pervasive instrument for the diagnosis and treatment of hollow anatomical structures, conventionally necessitates the arduous manual scrutiny of seasoned medical experts. Nevertheless, the recent strides in deep learning technologies proffer novel avenues for research, endowing it with the potential for amplified [...] Read more.
Endoscopy, a pervasive instrument for the diagnosis and treatment of hollow anatomical structures, conventionally necessitates the arduous manual scrutiny of seasoned medical experts. Nevertheless, the recent strides in deep learning technologies proffer novel avenues for research, endowing it with the potential for amplified robustness and precision, accompanied by the pledge of cost abatement in detection procedures, while simultaneously providing substantial assistance to clinical practitioners. Within this investigation, we usher in an innovative technique for the identification of anomalies in endoscopic imagery, christened as Context-enhanced Feature Fusion with Boundary-aware Convolution (GFFBAC). We employ the Context-enhanced Feature Fusion (CEFF) methodology, underpinned by Convolutional Neural Networks (CNNs), to establish equilibrium amidst the tiers of the feature pyramids. These intricately harnessed features are subsequently amalgamated into the Boundary-aware Convolution (BAC) module to reinforce both the faculties of localization and classification. A thorough exploration conducted across three disparate datasets elucidates that the proposition not only surpasses its contemporaries in object detection performance but also yields detection boxes of heightened precision. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Illustration of our framework.</p>
Full article ">Figure 2
<p>Illustration of our building block of path augmentation.</p>
Full article ">Figure 3
<p>Architecture of the developed BAC.</p>
Full article ">Figure 4
<p>The localization target of BAC for barrel estimation and fine regression on <span class="html-italic">x</span>-axis. The localization target for <span class="html-italic">y</span>-axis can be calculated similarly.</p>
Full article ">Figure 5
<p>Comparison of object detection results on CVC-ClinicDB (<b>A</b>–<b>D</b>), Kvasir-SEG (<b>E</b>–<b>G</b>) and EDD2020 (<b>H</b>–<b>J</b>), respectively.</p>
Full article ">Figure 6
<p>Qualitative results comparison between BAC, BSF and PA on the CVC-ClinicDB dataset.</p>
Full article ">
21 pages, 11275 KiB  
Article
Towards Enhancing Automated Defect Recognition (ADR) in Digital X-ray Radiography Applications: Synthesizing Training Data through X-ray Intensity Distribution Modeling for Deep Learning Algorithms
by Bata Hena, Ziang Wei, Luc Perron, Clemente Ibarra Castanedo and Xavier Maldague
Information 2024, 15(1), 16; https://doi.org/10.3390/info15010016 - 27 Dec 2023
Cited by 4 | Viewed by 2265
Abstract
Industrial radiography is a pivotal non-destructive testing (NDT) method that ensures quality and safety in a wide range of industrial sectors. Conventional human-based approaches, however, are prone to challenges in defect detection accuracy and efficiency, primarily due to the high inspection demand from [...] Read more.
Industrial radiography is a pivotal non-destructive testing (NDT) method that ensures quality and safety in a wide range of industrial sectors. Conventional human-based approaches, however, are prone to challenges in defect detection accuracy and efficiency, primarily due to the high inspection demand from manufacturing industries with high production throughput. To solve this challenge, numerous computer-based alternatives have been developed, including Automated Defect Recognition (ADR) using deep learning algorithms. At the core of training, these algorithms demand large volumes of data that should be representative of real-world cases. However, the availability of digital X-ray radiography data for open research is limited by non-disclosure contractual terms in the industry. This study presents a pipeline that is capable of modeling synthetic images based on statistical information acquired from X-ray intensity distribution from real digital X-ray radiography images. Through meticulous analysis of the intensity distribution in digital X-ray images, the unique statistical patterns associated with the exposure conditions used during image acquisition, type of component, thickness variations, beam divergence, anode heel effect, etc., are extracted. The realized synthetic images were utilized to train deep learning models, yielding an impressive model performance with a mean intersection over union (IoU) of 0.93 and a mean dice coefficient of 0.96 on real unseen digital X-ray radiography images. This methodology is scalable and adaptable, making it suitable for diverse industrial applications. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) shows a raw image that was acquired using a flat panel detector, showing inhomogeneous intensity distribution; (<b>b</b>) shows the final image after pixel corrections and flat-fielding operations.</p>
Full article ">Figure 2
<p>Examples of aluminum plates with cylindrical (<b>a</b>) and cuboidal (<b>b</b>) flat-bottom holes.</p>
Full article ">Figure 3
<p>A color-spectrum representation of an acquired grayscale X-ray radiography image, showing inhomogeneous intensity distribution across the same thickness regions of the plate.</p>
Full article ">Figure 4
<p>A cropped X-ray radiography image showing the edge effect due to the geometry of the cylindrical flat-bottom hole and the X-ray beam divergence.</p>
Full article ">Figure 5
<p>Superimposed histogram representation of the mean gray values measured in the considered 1411 real X-ray radiography images with circular features. The blue and orange lines represent the smoothed distribution of data for background and features respectively.</p>
Full article ">Figure 6
<p>Box-and-whiskers plot of the mean gray values measured in the considered 1411 real X-ray radiography images with circular features.</p>
Full article ">Figure 7
<p>High-level schematic representation of the processes involved in synthetic data generation.</p>
Full article ">Figure 8
<p>(<b>a</b>) Shows line profile pixel intensity readings represented by the yellow line across the edge of the zoomed-in X-ray radiography image presented in (<b>c</b>); while image (<b>b</b>) shows line profile pixel intensity readings represented by the yellow line in across the edge of the zoomed-in synthetic image in (<b>d</b>).</p>
Full article ">Figure 9
<p>Model training description, showing the neural network trained on only synthetic data and tested two distinct times on both synthetic and real test data.</p>
Full article ">Figure 10
<p>Training curves from YOLOv8. Showing the trend over 100 epochs.</p>
Full article ">Figure 11
<p>(<b>a</b>) Precision–Recall curve (<b>b</b>) F1-Confidence score (<b>c</b>) Precision-Confidence curve (<b>d</b>) Confusion Matrix.</p>
Full article ">Figure 12
<p>A cross-section of results of the model performance on real X-ray radiography images. Images on the same row represent a single entry, with columns (<b>a</b>), representing prediction, (<b>b</b>) the original input image, and (<b>c</b>) a conversion of the input images to color spectrum for easier visualization of the pixel intensity distribution.</p>
Full article ">Figure 12 Cont.
<p>A cross-section of results of the model performance on real X-ray radiography images. Images on the same row represent a single entry, with columns (<b>a</b>), representing prediction, (<b>b</b>) the original input image, and (<b>c</b>) a conversion of the input images to color spectrum for easier visualization of the pixel intensity distribution.</p>
Full article ">Figure A1
<p>Process flow for acquiring statistical measurements from real digital X-ray radiography, to be utilized in synthetic image generation.</p>
Full article ">Figure A2
<p>Important steps in synthetic image generation.</p>
Full article ">
18 pages, 6041 KiB  
Article
Dual-Pyramid Wide Residual Network for Semantic Segmentation on Cross-Style Datasets
by Guan-Ting Shen and Yin-Fu Huang
Information 2023, 14(12), 630; https://doi.org/10.3390/info14120630 - 24 Nov 2023
Viewed by 1320
Abstract
Image segmentation is the process of partitioning an image into multiple segments where the goal is to simplify the representation of the image and make the image more meaningful and easier to analyze. In particular, semantic segmentation is an approach of detecting the [...] Read more.
Image segmentation is the process of partitioning an image into multiple segments where the goal is to simplify the representation of the image and make the image more meaningful and easier to analyze. In particular, semantic segmentation is an approach of detecting the classes of objects, based on each pixel. In the past, most semantic segmentation models were for only one single style, such as urban street views, medical images, or even manga. In this paper, we propose a semantic segmentation model called the Dual-Pyramid Wide Residual Network (DPWRN) to solve the segmentation on cross-style datasets, which is suitable for diverse segmentation applications. The DPWRN integrated the Pyramid of Kernel paralleled with Dilation (PKD) and Multi-Feature Fusion (MFF) to improve the accuracy of segmentation. To evaluate the generalization of the DPWRN and its superiority over most state-of-the-art models, three datasets with completely different styles are tested in the experiments. As a result, our model achieves 75.95% of mIoU on CamVid, 83.60% of F1-score on DRIVE, and 86.87% of F1-score on eBDtheque. This verifies that the DPWRN can be generalized and shows its superiority in semantic segmentation on cross-style datasets. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Proposed model.</p>
Full article ">Figure 2
<p>Residual blocks.</p>
Full article ">Figure 3
<p>Atrous Spatial Pyramid Pooling.</p>
Full article ">Figure 4
<p>LKD block.</p>
Full article ">Figure 5
<p>(<b>a</b>) FPN block and (<b>b</b>) MFF block.</p>
Full article ">Figure 6
<p>Segmentation results of our model on CamVid.</p>
Full article ">Figure 7
<p>Segmentation results of our model on DRIVE.</p>
Full article ">Figure 7 Cont.
<p>Segmentation results of our model on DRIVE.</p>
Full article ">Figure 8
<p>Segmentation results of our model on eBDtheque.</p>
Full article ">Figure 8 Cont.
<p>Segmentation results of our model on eBDtheque.</p>
Full article ">
21 pages, 4397 KiB  
Article
POSS-CNN: An Automatically Generated Convolutional Neural Network with Precision and Operation Separable Structure Aiming at Target Recognition and Detection
by Jia Hou, Jingyu Zhang, Qi Chen, Siwei Xiang, Yishuo Meng, Jianfei Wang, Cimang Lu and Chen Yang
Information 2023, 14(11), 604; https://doi.org/10.3390/info14110604 - 7 Nov 2023
Viewed by 1609
Abstract
Artificial intelligence is changing and influencing our world. As one of the main algorithms in the field of artificial intelligence, convolutional neural networks (CNNs) have developed rapidly in recent years. Especially after the emergence of NASNet, CNNs have gradually pushed the idea of [...] Read more.
Artificial intelligence is changing and influencing our world. As one of the main algorithms in the field of artificial intelligence, convolutional neural networks (CNNs) have developed rapidly in recent years. Especially after the emergence of NASNet, CNNs have gradually pushed the idea of AutoML to the public’s attention, and large numbers of new structures designed by automatic searches are appearing. These networks are usually based on reinforcement learning and evolutionary learning algorithms. However, sometimes, the blocks of these networks are complex, and there is no small model for simpler tasks. Therefore, this paper proposes POSS-CNN aiming at target recognition and detection, which employs a multi-branch CNN structure with PSNC and a method of automatic parallel selection for super parameters based on a multi-branch CNN structure. Moreover, POSS-CNN can be broken up. By choosing a single branch or the combination of two branches as the “benchmark”, as well as the overall POSS-CNN, we can achieve seven models with different precision and operations. The test accuracy of POSS-CNN for a recognition task tested on a CIFAR10 dataset can reach 86.4%, which is equivalent to AlexNet and VggNet, but the operation and parameters of the whole model in this paper are 45.9% and 45.8% of AlexNet, and 29.5% and 29.4% of VggNet. The mAP of POSS-CNN for a detection task tested on the LSVH dataset is 45.8, inferior to the 62.3 of YOLOv3. However, compared with YOLOv3, the operation and parameters of the model in this paper are reduced by 57.4% and 15.6%, respectively. After being accelerated by WRA, POSS-CNN for a detection task tested on an LSVH dataset can achieve 27 fps, and the energy efficiency is 0.42 J/f, which is 5 times and 96.6 times better than GPU 2080Ti in performance and energy efficiency, respectively. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Memory access and memory operation.</p>
Full article ">Figure 2
<p>Blocks of automatic models.</p>
Full article ">Figure 3
<p>Parameters and Calculation amounts of CNN.</p>
Full article ">Figure 4
<p>FractalNet.</p>
Full article ">Figure 5
<p>Multi-branch CNN structure: (<b>a</b>) Branch architecture one; (<b>b</b>) Branch architecture one.</p>
Full article ">Figure 6
<p>Seven models with different precision and operations.</p>
Full article ">Figure 7
<p>The comparison for LeNet before and after PSNC.</p>
Full article ">Figure 8
<p>The percentage of parameter reduction using PSNC method.</p>
Full article ">Figure 9
<p>Generation process of POSS-CNN overall architecture.</p>
Full article ">Figure 10
<p>POSS-CNN embedded in target detection model.</p>
Full article ">Figure 11
<p>LSVH dataset.</p>
Full article ">Figure 12
<p>Reduction in operation and parameters normalized to AlexNet and VggNet.</p>
Full article ">Figure 13
<p>Reduction in operation and parameters normalized to YOLOv3, Fast R-CNN, and YOLOv2.</p>
Full article ">
15 pages, 1514 KiB  
Article
Deep-Learning-Based Multitask Ultrasound Beamforming
by Elay Dahan and Israel Cohen
Information 2023, 14(10), 582; https://doi.org/10.3390/info14100582 - 23 Oct 2023
Viewed by 2046
Abstract
In this paper, we present a new method for multitask learning applied to ultrasound beamforming. Beamforming is a critical component in the ultrasound image formation pipeline. Ultrasound images are constructed using sensor readings from multiple transducer elements, with each element typically capturing multiple [...] Read more.
In this paper, we present a new method for multitask learning applied to ultrasound beamforming. Beamforming is a critical component in the ultrasound image formation pipeline. Ultrasound images are constructed using sensor readings from multiple transducer elements, with each element typically capturing multiple acquisitions per frame. Hence, the beamformer is crucial for framerate performance and overall image quality. Furthermore, post-processing, such as image denoising, is usually applied to the beamformed image to achieve high clarity for diagnosis. This work shows a fully convolutional neural network that can learn different tasks by applying a new weight normalization scheme. We adapt our model to both high frame rate requirements by fitting weight normalization parameters for the sub-sampling task and image denoising by optimizing the normalization parameters for the speckle reduction task. Our model outperforms single-angle delay and sum on pixel-level measures for speckle noise reduction, subsampling, and single-angle reconstruction. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>The proposed multi-task beamforming pipeline: raw RF sensor data are scaled to the [−1, 1] range, then Hilbert transform and time-of-flight correction is applied. The time-of-flight corrected IQ is fed into a neural network that reconstructs the beam summed multitask IQ data according to the specific task.</p>
Full article ">Figure 2
<p>The proposed multitask beamforming neural network: pre-processed IQ data are fed to our fully convolutional neural network. The network outputs an IQ estimation corresponding to a task-specific output.</p>
Full article ">Figure 3
<p>Controlling the de-speckling effect. The output is identical to the base task for <math display="inline"><semantics> <mrow> <mi>α</mi> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>. The output is a full task-specific effect for <math display="inline"><semantics> <mrow> <mi>α</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>. By choosing different <math display="inline"><semantics> <mi>α</mi> </semantics></math> values, we can control the amount of convolution kernel weights scale and bias, and hence control the de-speckling effect.</p>
Full article ">Figure 4
<p>Test set samples of our base task multi-angle reconstruction from single-angle acquisition. Our model can remove most of the noise and scattering <math display="inline"><semantics> <mrow> <mo>−</mo> <mo form="prefix">log</mo> <mi>S</mi> <mi>p</mi> <mi>e</mi> <mi>c</mi> <mi>k</mi> <mi>l</mi> <mi>e</mi> <mi>S</mi> <mi>N</mi> <mi>R</mi> </mrow> </semantics></math> of 2.299 and <math display="inline"><semantics> <mi>ρ</mi> </semantics></math> of 0.93—outperforming all the other challenge participants (<a href="#information-14-00582-t002" class="html-table">Table 2</a> and <a href="#information-14-00582-t003" class="html-table">Table 3</a>). Adapt from Goudarzi et al. [<a href="#B14-information-14-00582" class="html-bibr">14</a>].</p>
Full article ">Figure 5
<p>Image reconstruction samples of sub-sampled data, at the channel dimension. Our model can reduce noise from noisy measurements due to the reduced number of elements used. Also, it can generate an image with higher contrast compared to sub-sampled single-angle reconstruction. Both images are samples from CUBDL [<a href="#B23-information-14-00582" class="html-bibr">23</a>] test set.</p>
Full article ">
15 pages, 4049 KiB  
Article
On the Use of Kullback–Leibler Divergence for Kernel Selection and Interpretation in Variational Autoencoders for Feature Creation
by Fábio Mendonça, Sheikh Shanawaz Mostafa, Fernando Morgado-Dias and Antonio G. Ravelo-García
Information 2023, 14(10), 571; https://doi.org/10.3390/info14100571 - 18 Oct 2023
Cited by 1 | Viewed by 1810
Abstract
This study presents a novel approach for kernel selection based on Kullback–Leibler divergence in variational autoencoders using features generated by the convolutional encoder. The proposed methodology focuses on identifying the most relevant subset of latent variables to reduce the model’s parameters. Each latent [...] Read more.
This study presents a novel approach for kernel selection based on Kullback–Leibler divergence in variational autoencoders using features generated by the convolutional encoder. The proposed methodology focuses on identifying the most relevant subset of latent variables to reduce the model’s parameters. Each latent variable is sampled from the distribution associated with a single kernel of the last encoder’s convolutional layer, resulting in an individual distribution for each kernel. Relevant features are selected from the sampled latent variables to perform kernel selection, which filters out uninformative features and, consequently, unnecessary kernels. Both the proposed filter method and the sequential feature selection (standard wrapper method) were examined for feature selection. Particularly, the filter method evaluates the Kullback–Leibler divergence between all kernels’ distributions and hypothesizes that similar kernels can be discarded as they do not convey relevant information. This hypothesis was confirmed through the experiments performed on four standard datasets, where it was observed that the number of kernels can be reduced without meaningfully affecting the performance. This analysis was based on the accuracy of the model when the selected kernels fed a probabilistic classifier and the feature-based similarity index to appraise the quality of the reconstructed images when the variational autoencoder only uses the selected kernels. Therefore, the proposed methodology guides the reduction of the number of parameters of the model, making it suitable for developing applications for resource-constrained devices. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Simplified overview of the proposed methodology, composed of two main steps: The first step develops the VAE, whose encoder was then used for feature creation that fed the classifier created in the second step.</p>
Full article ">Figure 2
<p>Implemented VAE architecture for the two-dimensional input. The layer operations and their produced output shape are also presented.</p>
Full article ">Figure 3
<p>Structure of the classifier used for the classification analysis. The feature extraction part comprises the encoder developed by the VAE, and the weights of this part were frozen (non-trainable). A feature selection procedure was employed for the classification, and the classifier’s weights were optimized using supervised learning.</p>
Full article ">Figure 4
<p>Distributions produced by the samples of the kernel of the last convolution layer of the encoder, for classes (<b>a</b>) 1, (<b>b</b>) 2, (<b>c</b>) 3, (<b>d</b>) 4, (<b>e</b>) 5, (<b>f</b>) 6, (<b>g</b>) 7, (<b>h</b>) 8, (<b>i</b>) 9, and (<b>j</b>) 10, showing the 16 kernels in sequence, from left (kernel 1) to the right (kernel 16). The left figure shows the shape of each distribution produced from the samples, whose amplitude was gradually reduced (from the first to the last kernel) to allow the visualization of all distributions. In contrast, the right figure shows the box plot of the samples.</p>
Full article ">Figure 5
<p>Variation of the Acc through the sequence iteration of the evaluated feature selection algorithms. The total number of parameters of the model, as the number of kernels varies, is also presented.</p>
Full article ">Figure 6
<p>Example of the variation in the models’ forecast (the softmax output) from a specific misclassified sample with high epistemic uncertainty.</p>
Full article ">Figure 7
<p>Variation of the FSIM as the kernels selected by KLDS is progressively used (one by one), from more to less relevant. The image also displays the FSIM variation for a specific example (original image) and the progression in the reconstructed images as the number of used kernels increases.</p>
Full article ">Figure 8
<p>Variation of the Acc and FSIM as the used kernels, whose sequence was selected by KLDS, is progressively increased (leading to an increase in the number of parameters of the model) for the four examined datasets. The asterisk indicates when variation in both metrics is lower than 1%.</p>
Full article ">
16 pages, 5731 KiB  
Article
Innovative Visualization Approach for Biomechanical Time Series in Stroke Diagnosis Using Explainable Machine Learning Methods: A Proof-of-Concept Study
by Kyriakos Apostolidis, Christos Kokkotis, Evangelos Karakasis, Evangeli Karampina, Serafeim Moustakidis, Dimitrios Menychtas, Georgios Giarmatzis, Dimitrios Tsiptsios, Konstantinos Vadikolias and Nikolaos Aggelousis
Information 2023, 14(10), 559; https://doi.org/10.3390/info14100559 - 12 Oct 2023
Cited by 4 | Viewed by 1848
Abstract
Stroke remains a predominant cause of mortality and disability worldwide. The endeavor to diagnose stroke through biomechanical time-series data coupled with Artificial Intelligence (AI) poses a formidable challenge, especially amidst constrained participant numbers. The challenge escalates when dealing with small datasets, a common [...] Read more.
Stroke remains a predominant cause of mortality and disability worldwide. The endeavor to diagnose stroke through biomechanical time-series data coupled with Artificial Intelligence (AI) poses a formidable challenge, especially amidst constrained participant numbers. The challenge escalates when dealing with small datasets, a common scenario in preliminary medical research. While recent advances have ushered in few-shot learning algorithms adept at handling sparse data, this paper pioneers a distinctive methodology involving a visualization-centric approach to navigating the small-data challenge in diagnosing stroke survivors based on gait-analysis-derived biomechanical data. Employing Siamese neural networks (SNNs), our method transforms a biomechanical time series into visually intuitive images, facilitating a unique analytical lens. The kinematic data encapsulated comprise a spectrum of gait metrics, including movements of the ankle, knee, hip, and center of mass in three dimensions for both paretic and non-paretic legs. Following the visual transformation, the SNN serves as a potent feature extractor, mapping the data into a high-dimensional feature space conducive to classification. The extracted features are subsequently fed into various machine learning (ML) models like support vector machines (SVMs), Random Forest (RF), or neural networks (NN) for classification. In pursuit of heightened interpretability, a cornerstone in medical AI applications, we employ the Grad-CAM (Class Activation Map) tool to visually highlight the critical regions influencing the model’s decision. Our methodology, though exploratory, showcases a promising avenue for leveraging visualized biomechanical data in stroke diagnosis, achieving a perfect classification rate in our preliminary dataset. The visual inspection of generated images elucidates a clear separation of classes (100%), underscoring the potential of this visualization-driven approach in the realm of small data. This proof-of-concept study accentuates the novelty of visual data transformation in enhancing both interpretability and performance in stroke diagnosis using limited data, laying a robust foundation for future research in larger-scale evaluations. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Visual representation of image construction.</p>
Full article ">Figure 2
<p>(<b>a</b>) The concept of the Siamese neural network. (<b>b</b>) The proposed pipeline to classify and interpret the images.</p>
Full article ">Figure 3
<p>Image (<b>a</b>) depicts the dissimilarity distance between a stroke and a non-stroke image, while image (<b>b</b>) provides the dissimilarity distance between two normal images.</p>
Full article ">Figure 4
<p>Loss function through iterations during training.</p>
Full article ">Figure 5
<p>Mean and standard deviations of Euclidean distance between stroke and non-stroke samples.</p>
Full article ">Figure 6
<p>CIs of SVM decisions.</p>
Full article ">Figure 7
<p>CIs of RF decisions.</p>
Full article ">Figure 8
<p>Grad-CAM in a stroke survivor. Boxes highlight the critical points according to Grad-CAM algorithm.</p>
Full article ">Figure 9
<p>Grad-CAM in a non-stroke survivor. Boxes highlight the critical points according to Grad-CAM algorithm.</p>
Full article ">Figure 10
<p>Mean Grad-CAM masks for testing group of the employed stroke and non-stroke survivors. Specifically, (<b>a</b>) presents non-stroke participants and (<b>b</b>) stroke survivors. Boxes highlight the critical points according to Grad-CAM algorithm.</p>
Full article ">
17 pages, 1251 KiB  
Article
Sound Event Detection in Domestic Environment Using Frequency-Dynamic Convolution and Local Attention
by Grigorios-Aris Cheimariotis and Nikolaos Mitianoudis
Information 2023, 14(10), 534; https://doi.org/10.3390/info14100534 - 30 Sep 2023
Cited by 3 | Viewed by 1461
Abstract
This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and [...] Read more.
This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>A flowchart of the proposed model (including feature extraction). The blocks with trainable parameters are depicted in orange color. Two such models are used (student and teacher). The two models are identical. Only the student model is trained on data. The teacher model acquires its weights from a moving average of the student’s weights.</p>
Full article ">Figure 2
<p>The dLKA module that is used (its flowchart is on the left) takes its name from the submodule on the right. In the orange and the green boxes, the submodules attention and FFN are presented in detail. The submodule dLKA is actually the main part of the attention module (orange box).</p>
Full article ">Figure 3
<p>A flowchart of the sequence modeling of aggregated features (CNN features and BEATs embeddings) and classification layer for strong and weak predictions.</p>
Full article ">
17 pages, 5637 KiB  
Article
A Deep Neural Network for Working Memory Load Prediction from EEG Ensemble Empirical Mode Decomposition
by Sriniketan Sridhar, Anibal Romney and Vidya Manian
Information 2023, 14(9), 473; https://doi.org/10.3390/info14090473 - 25 Aug 2023
Viewed by 1907
Abstract
Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD) are frequently associated with working memory (WM) dysfunction, which is also observed in various neural psychiatric disorders, including depression, schizophrenia, and ADHD. Early detection of WM dysfunction is essential to predict the onset of MCI [...] Read more.
Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD) are frequently associated with working memory (WM) dysfunction, which is also observed in various neural psychiatric disorders, including depression, schizophrenia, and ADHD. Early detection of WM dysfunction is essential to predict the onset of MCI and AD. Artificial Intelligence (AI)-based algorithms are increasingly used to identify biomarkers for detecting subtle changes in loaded WM. This paper presents an approach using electroencephalograms (EEG), time-frequency signal processing, and a Deep Neural Network (DNN) to predict WM load in normal and MCI-diagnosed subjects. EEG signals were recorded using an EEG cap during working memory tasks, including block tapping and N-back visuospatial interfaces. The data were bandpass-filtered, and independent components analysis was used to select the best electrode channels. The Ensemble Empirical Mode Decomposition (EEMD) algorithm was then applied to the EEG signals to obtain the time-frequency Intrinsic Mode Functions (IMFs). The EEMD and DNN methods perform better than traditional machine learning methods as well as Convolutional Neural Networks (CNN) for the prediction of WM load. Prediction accuracies were consistently higher for both normal and MCI subjects, averaging 97.62%. The average Kappa score for normal subjects was 94.98% and 92.49% for subjects with MCI. Subjects with MCI showed higher values for beta and alpha oscillations in the frontal region than normal subjects. The average power spectral density of the IMFs showed that the IMFs (p = 0.0469 for normal subjects and p = 0.0145 for subjects with MCI) are robust and reliable features for WM load prediction. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Model for information processing in the brain.</p>
Full article ">Figure 2
<p>The proposed model for predicting WM load, outlining the various stages for achieving accurate working memory load predictions.</p>
Full article ">Figure 3
<p>(<b>a</b>) A 16-channel OpenBCI EEG data collection board, (<b>b</b>) EEG cap worn by subject, and (<b>c</b>) subject performing block tapping task.</p>
Full article ">Figure 4
<p>Raw EEG data (red) and Intrinsic Mode Functions (IMFs) (green).</p>
Full article ">Figure 5
<p>Deep Neural Network architecture.</p>
Full article ">Figure 6
<p>Confusion Matrix Format.</p>
Full article ">Figure 7
<p>Performance Metrics for the ICA+EEMD+DNN working memory load prediction method.</p>
Full article ">Figure 8
<p>Average power spectral density of intrinsic model functions per subject for resting state.</p>
Full article ">Figure 9
<p>Coefficient of variance of intrinsic model functions per subject for resting state.</p>
Full article ">Figure 10
<p>Average power spectral density of intrinsic model functions per subject for low WM load.</p>
Full article ">Figure 11
<p>Coefficient of variance of intrinsic mode functions per subject for low WM load.</p>
Full article ">Figure 12
<p>Average power spectral density of intrinsic model functions per subject for high WM load.</p>
Full article ">Figure 13
<p>Coefficient of variance of intrinsic model functions per subject for high WM load.</p>
Full article ">Figure 14
<p>Selected electrodes and corresponding brain region for each subject.</p>
Full article ">Figure 15
<p>Brain topological map for rest state and working memory task in a subject between 20 and 40 years.</p>
Full article ">Figure 16
<p>Brain topological map for rest state and working memory task in a subject with MCI.</p>
Full article ">

Review

Jump to: Research

52 pages, 3960 KiB  
Review
A Critical Analysis of Deep Semi-Supervised Learning Approaches for Enhanced Medical Image Classification
by Kaushlesh Singh Shakya, Azadeh Alavi, Julie Porteous, Priti K, Amit Laddi and Manojkumar Jaiswal
Information 2024, 15(5), 246; https://doi.org/10.3390/info15050246 - 24 Apr 2024
Viewed by 1534
Abstract
Deep semi-supervised learning (DSSL) is a machine learning paradigm that blends supervised and unsupervised learning techniques to improve the performance of various models in computer vision tasks. Medical image classification plays a crucial role in disease diagnosis, treatment planning, and patient care. However, [...] Read more.
Deep semi-supervised learning (DSSL) is a machine learning paradigm that blends supervised and unsupervised learning techniques to improve the performance of various models in computer vision tasks. Medical image classification plays a crucial role in disease diagnosis, treatment planning, and patient care. However, obtaining labeled medical image data is often expensive and time-consuming for medical practitioners, leading to limited labeled datasets. DSSL techniques aim to address this challenge, particularly in various medical image tasks, to improve model generalization and performance. DSSL models leverage both the labeled information, which provides explicit supervision, and the unlabeled data, which can provide additional information about the underlying data distribution. That offers a practical solution to resource-intensive demands of data annotation, and enhances the model’s ability to generalize across diverse and previously unseen data landscapes. The present study provides a critical review of various DSSL approaches and their effectiveness and challenges in enhancing medical image classification tasks. The study categorized DSSL techniques into six classes: consistency regularization method, deep adversarial method, pseudo-learning method, graph-based method, multi-label method, and hybrid method. Further, a comparative analysis of performance for six considered methods is conducted using existing studies. The referenced studies have employed metrics such as accuracy, sensitivity, specificity, AUC-ROC, and F1 score to evaluate the performance of DSSL methods on different medical image datasets. Additionally, challenges of the datasets, such as heterogeneity, limited labeled data, and model interpretability, were discussed and highlighted in the context of DSSL for medical image classification. The current review provides future directions and considerations to researchers to further address the challenges and take full advantage of these methods in clinical practices. Full article
(This article belongs to the Special Issue Deep Learning for Image, Video and Signal Processing)
Show Figures

Figure 1

Figure 1
<p>Deep semi-supervised medical image classification.</p>
Full article ">Figure 2
<p>Proportion of research reviewed across various references.</p>
Full article ">Figure 3
<p>PRISMA diagram provides a visual representation of the literature review process. Out of the 809 articles sourced from five academic platforms, 41 were ultimately selected.</p>
Full article ">Figure 4
<p>Temporal Ensemble and Mean Teacher frameworks are utilized for consistency regularization in deep semi-supervised classification methodologies. Alongside the labels in the diagram, <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>x</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math> signifies the input instance, <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>z</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mrow> <mover accent="true"> <mrow> <mi>z</mi> </mrow> <mo>~</mo> </mover> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math> indicates predictions, and <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>y</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math> denotes the actual ground truth. The <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>z</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math> output ensures that the model learns from both the original and augmented data, leading to better performance.</p>
Full article ">Figure 5
<p>The section explores deep adversarial methods and comprehensively investigates techniques involving Generative Adversarial Networks (GAN) and Variational Autoencoder (VAE). In GAN, data generation involves a Discriminator <math display="inline"><semantics> <mrow> <mi>D</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> assessing the authenticity of generated samples produced by the Generator <math display="inline"><semantics> <mrow> <mi>G</mi> <mo>(</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math>. Conversely, in VAE, data reconstruction occurs through an Encoder <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>q</mi> </mrow> <mrow> <mo>∅</mo> </mrow> </msub> <mfenced separators="|"> <mrow> <mi>Z</mi> </mrow> <mrow> <mi>X</mi> </mrow> </mfenced> </mrow> </semantics></math> compressing input data <math display="inline"><semantics> <mrow> <mi>X</mi> </mrow> </semantics></math> into a latent space <math display="inline"><semantics> <mrow> <mi>Z</mi> </mrow> </semantics></math>, followed by the Decoder <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>p</mi> </mrow> <mrow> <mi>Θ</mi> </mrow> </msub> <mfenced separators="|"> <mrow> <mi>X</mi> </mrow> <mrow> <mi>Z</mi> </mrow> </mfenced> </mrow> </semantics></math> reconstructing the input. Both models traverse distinct processes for data generation (GAN) and reconstruction (VAE), contributing to their respective functionalities. These techniques are of pivotal significance in medical image classification.</p>
Full article ">Figure 6
<p>The pseudo-labeling technique in DSSL classification methodologies is exemplified through depictions of the co-training and self-training frameworks. Co-training showcases a method with data instances <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>v</mi> </mrow> <mrow> <mn>1</mn> </mrow> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>v</mi> </mrow> <mrow> <mn>2</mn> </mrow> </msub> </mrow> </semantics></math>, whereas self-training begins with data augmentation <math display="inline"><semantics> <mrow> <mi>A</mi> <mi>u</mi> <mi>g</mi> </mrow> </semantics></math>, followed by processing to create augmented data pairs <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>x</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mo> </mo> <msub> <mrow> <mi>x</mi> </mrow> <mrow> <mi>j</mi> </mrow> </msub> </mrow> </semantics></math> and their processed forms <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>h</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mo> </mo> <msub> <mrow> <mi>h</mi> </mrow> <mrow> <mi>j</mi> </mrow> </msub> </mrow> </semantics></math>. Fine-tuning then generates final representations <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>z</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mo> </mo> <msub> <mrow> <mi>z</mi> </mrow> <mrow> <mi>j</mi> </mrow> </msub> </mrow> </semantics></math> aiming to maximize similarity.</p>
Full article ">Figure 7
<p>The described frameworks offer fundamental insights into both AutoEncoder and GNN-based approaches for the process of DSSL medical image classification. The graph-based AutoEncoder employs an <math display="inline"><semantics> <mrow> <mi>E</mi> <mi>n</mi> <mi>c</mi> <mi>o</mi> <mi>d</mi> <mi>e</mi> <mi>r</mi> </mrow> </semantics></math> to transform input data into a latent representation <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>Z</mi> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math>, decoded to reconstruct the input graph <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>S</mi> <mo>′</mo> </mrow> <mrow> <mi>i</mi> </mrow> </msub> </mrow> </semantics></math>. The GNN-based model features interconnected nodes <math display="inline"><semantics> <mrow> <mi>A</mi> <mo>−</mo> <mi>E</mi> </mrow> </semantics></math> representing processing stages, arrows indicate data flow within this network.</p>
Full article ">Figure 8
<p>Illustration of the two scenarios in multi-label SSL: Inductive and Transductive. In the Inductive scenario, the trained model <math display="inline"><semantics> <mrow> <mi mathvariant="bold-italic">M</mi> </mrow> </semantics></math> possesses the ability to predict labels for any unseen node. Conversely, in the Transductive scenario, only the labels of unlabeled nodes within the training dataset require inference.</p>
Full article ">
Back to TopTop