MDPI - Publisher of Open Access Journals

24 pages, 8095 KiB

Open AccessArticle

Signature Genes Selection and Functional Analysis of Astrocytoma Phenotypes: A Comparative Study

by Anna Drozdz, Caitriona E. McInerney, Kevin M. Prise, Veronica J. Spence and Jose Sousa

Cancers 2024, 16(19), 3263; https://doi.org/10.3390/cancers16193263 - 25 Sep 2024

Novel cancer biomarkers discoveries are driven by the application of omics technologies. The vast quantity of highly dimensional data necessitates the implementation of feature selection. The mathematical basis of different selection methods varies considerably, which may influence subsequent inferences. In the study, feature [...] Read more.

Novel cancer biomarkers discoveries are driven by the application of omics technologies. The vast quantity of highly dimensional data necessitates the implementation of feature selection. The mathematical basis of different selection methods varies considerably, which may influence subsequent inferences. In the study, feature selection and classification methods were employed to identify six signature gene sets of grade 2 and 3 astrocytoma samples from the Rembrandt repository. Subsequently, the impact of these variables on classification and further discovery of biological patterns was analysed. Principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and hierarchical clustering revealed that the data set (10,096 genes) exhibited a high degree of noise, feature redundancy, and lack of distinct patterns. The application of feature selection methods resulted in a reduction in the number of genes to between 28 and 128. Notably, no single gene was selected by all of the methods tested. Selection led to an increase in classification accuracy and noise reduction. Significant differences in the Gene Ontology terms were discovered, with only 13 terms overlapping. One selection method did not result in any enriched terms. KEGG pathway analysis revealed only one pathway in common (cell cycle), while the two methods did not yield any enriched pathways. The results demonstrated a significant difference in outcomes when classification-type algorithms were utilised in comparison to mixed types (selection and classification). This may result in the inadvertent omission of biological phenomena, while simultaneously achieving enhanced classification outcomes. Full article

(This article belongs to the Section Cancer Biomarkers)

► Show Figures

Figure 1

Figure 1
A schematic illustrating the workflow of the experiments, including: retrieval of data from the REMBRANDT repository, data pre-processing, data inspection, and the two branches of experiments comparing six different methods for feature selection and classification. In the first branch of experiments, DGE, STIR and Boruta methods were used to select signature gene sets, which were further analysed using three classification models: logistic regression, KNN and SVM. In the second set of experiments, RF, LASSO regression and CACTUS methods were used for signature gene selection and subsequent classification. Full article ">Figure 2
Initial evaluation of the transcriptional profiles of grade 2 and 3 astrocytoma samples examined using PCA (A), UMAP (B) and hierarchical clustering (C) with expression levels of the genes displayed as a z-score on a scale from high (red) to low (blue) values. Concordance between the signature gene sets identified by each of the models: DGE, Boruta, RF, STIR, CACTUS, and LASSO (D). Full article ">Figure 3
Volcano plots comparing the distribution of up- and down-regulated genes within signature gene sets selected by the following methods: DGE (A), Boruta (B), STIR (C), LASSO (D), RF (E) and CACTUS (F). Purple points indicate genes selected by a given algorithm. The x-axis displays <math display="inline"><semantics> <mrow> <mi>L</mi> <mi>o</mi> <msub> <mi>g</mi> <mn>2</mn> </msub> <mi>F</mi> <mi>C</mi> </mrow> </semantics></math> in gene expression and the y-axis displays the log odds of a gene being differentially expressed with the adjusted p-value. The dashed lines indicate the significance thresholds for <math display="inline"><semantics> <mrow> <mi>L</mi> <mi>o</mi> <msub> <mi>g</mi> <mn>2</mn> </msub> <mi>F</mi> <mi>C</mi> </mrow> </semantics></math> thresholds for the adjusted p-values used for DGE. Full article ">Figure 4
Heatmaps comparing the hierarchical clustering patterns and gene expression in grade 2 and 3 astrocytoma samples from the signature gene sets selected by the following methods: DGE (A), Boruta (B), STIR (C), LASSO (D), RF (E) and CACTUS (F). Expression levels of the selected genes are displayed as a z-score on a scale from high (red) to low (blue) values. Full article ">Figure 5
Comparison of the total number of significantly enriched GO terms identified for the methods: DGE (A), Boruta (B), STIR (C), RF (D), and CACTUS (E). The top 15 most significant GO terms are plotted and the number of genes identified for that GO term is indicated as the Count on the x-axis. The colour bar indicates the significance of the enrichment for that GO term as an adjusted p-value. (F) Overlap in the enriched GO terms identified by the five methods and the total number identified by each method. No enriched GO terms were found for genes selected with LASSO regression. Full article ">Figure 6
Comparison of the significant KEGG pathways identified from the signature genes of the models DGE (A), STIR (B), Boruta (C), and CACTUS (D). Overlap in the KEGG pathways identified from the signature gene sets of all models (E). Overlap in the genes identified for the cell cycle pathway (F). Full article ">Figure 7
The cell cycle pathway visualisation. Genes marked in green belong to the any of four signature genes with significant enrichment of cell cycle pathway, colour gradient indicates the genes <math display="inline"><semantics> <mrow> <mi>l</mi> <mi>o</mi> <msub> <mi>g</mi> <mn>2</mn> </msub> <mi>F</mi> <mi>C</mi> </mrow> </semantics></math>. Full article ">Figure 8
Results of the survival analysis comparing grade 2 and grade 3 astrocytoma where time estimated is in months (A). Results of the univariate Cox proportional hazard analysis of selected gene sets. The forest plot displays the hazard ratio and 95% confidence interval (CI) of months of survival after diagnosis as a function of gene (B). Results of the multivariate Cox proportional hazards model examining astrocytoma grade and gene interaction (C). Overlap between the genes identified to be significant for modifying the HR from the univariate (D) and multivariate (E) Cox model analysis of the signature gene sets. Full article ">

10 pages, 3467 KiB

Open AccessArticle

Comprehensive Data Augmentation Approach Using WGAN-GP and UMAP for Enhancing Alzheimer’s Disease Diagnosis

by Emi Yuda, Tomoki Ando, Itaru Kaneko, Yutaka Yoshida and Daisuke Hirahara

Electronics 2024, 13(18), 3671; https://doi.org/10.3390/electronics13183671 - 16 Sep 2024

Abstract

In this study, the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) was used to improve the diagnosis of Alzheimer’s disease using medical imaging and the Alzheimer’s disease image dataset across four diagnostic classes. The WGAN-GP was employed for data augmentation. The original [...] Read more.

In this study, the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) was used to improve the diagnosis of Alzheimer’s disease using medical imaging and the Alzheimer’s disease image dataset across four diagnostic classes. The WGAN-GP was employed for data augmentation. The original dataset, the augmented dataset and the combined data were mapped using Uniform Manifold Approximation and Projection (UMAP) in both a 2D and 3D space. The same combined interaction network analysis was then performed on the test data. The results showed that, for the test accuracy, the score was 30.46% for the original dataset (unbalanced), whereas for the WGAN-GP augmented dataset (balanced), it improved to 56.84%, indicating that the WGAN-GP augmentation can effectively address the unbalanced problem. Full article

(This article belongs to the Special Issue Advancements in Cross-Disciplinary AI: Theory and Application—2nd Edition)

► Show Figures

Figure 1

14 pages, 3226 KiB

Open AccessArticle

Identification of Beef Odors under Different Storage Day and Processing Temperature Conditions Using an Odor Sensing System

by Yuanchang Liu, Nan Peng, Jinlong Kang, Takeshi Onodera and Rui Yatabe

Sensors 2024, 24(17), 5590; https://doi.org/10.3390/s24175590 - 29 Aug 2024

Viewed by 292

Abstract

This study used an odor sensing system with a 16-channel electrochemical sensor array to measure beef odors, aiming to distinguish odors under different storage days and processing temperatures for quality monitoring. Six storage days ranged from purchase (D0) to eight days (D8), with [...] Read more.

This study used an odor sensing system with a 16-channel electrochemical sensor array to measure beef odors, aiming to distinguish odors under different storage days and processing temperatures for quality monitoring. Six storage days ranged from purchase (D0) to eight days (D8), with three temperature conditions: no heat (RT), boiling (100 °C), and frying (180 °C). Gas chromatography–mass spectrometry (GC-MS) analysis showed that odorants in the beef varied under different conditions. Compounds like acetoin and 1-hexanol changed significantly with the storage days, while pyrazines and furans were more detectable at higher temperatures. The odor sensing system data were visualized using principal component analysis (PCA) and uniform manifold approximation and projection (UMAP). PCA and unsupervised UMAP clustered beef odors by storage days but struggled with the processing temperatures. Supervised UMAP accurately clustered different temperatures and dates. Machine learning analysis using six classifiers, including support vector machine, achieved 57% accuracy for PCA-reduced data, while unsupervised UMAP reached 49.1% accuracy. Supervised UMAP significantly enhanced the classification accuracy, achieving over 99.5% with the dimensionality reduced to three or above. Results suggest that the odor sensing system can sufficiently enhance non-destructive beef quality and safety monitoring. This research advances electronic nose applications and explores data downscaling techniques, providing valuable insights for future studies. Full article

(This article belongs to the Special Issue Electronic Nose and Artificial Olfaction)

► Show Figures

Figure 1

20 pages, 2302 KiB

Open AccessArticle

Non-Intrusive Load Monitoring Based on Dimensionality Reduction and Adapted Spatial Clustering

by Xu Zhang, Jun Zhou, Chunguang Lu, Lei Song, Fanyu Meng and Xianbo Wang

Energies 2024, 17(17), 4303; https://doi.org/10.3390/en17174303 - 28 Aug 2024

Viewed by 234

Abstract

Non-invasive load monitoring (NILM) deduces changes in energy consumption patterns and operational statuses of electrical equipment from power signals in the feed line. With the emergence of fine-grained power load distribution, the importance of utilizing this technology for implementing demand-side energy management in [...] Read more.

Non-invasive load monitoring (NILM) deduces changes in energy consumption patterns and operational statuses of electrical equipment from power signals in the feed line. With the emergence of fine-grained power load distribution, the importance of utilizing this technology for implementing demand-side energy management in smart grid development has become increasingly prominent. To address the issue of low load identification accuracy stemming from complex and diverse load types, this paper introduces a NILM method based on uniform manifold approximation and projection (UMAP) reduction and enhanced density-based spatial clustering of applications with noise (DBSCAN). Firstly, this paper combines the characteristics of user load under transient and steady-state conditions and selects data with significant differences to construct a load-characteristic database. Additionally, UMAP is employed to reduce the dimensionality of high-dimensional load features and rebuild a load feature database. Subsequently, DBSCAN is utilized to categorize typical user loads, followed by a correlation analysis with the load-characteristic database to determine the types or classes of loads that involve switching actions. Finally, this paper simulates and analyzes the proposed method using the electricity consumption data of industrial users from the CER–Electricity–Data dataset. It identifies the electricity load data commonly utilized by users in a specific area of Zhejiang Province in China. The experimental results indicate that the accuracy of the proposed non-invasive load identification method reaches 95%. Compared to the wavelet transform, decision tree, and backpropagation network methods, the improvement is approximately 5%. Full article

(This article belongs to the Section F1: Electrical Power System)

► Show Figures

Figure 1

23 pages, 1445 KiB

Open AccessArticle

Dynamic Edge-Based High-Dimensional Data Aggregation with Differential Privacy

by Qian Chen, Zhiwei Ni, Xuhui Zhu, Moli Lyu, Wentao Liu and Pingfan Xia

Electronics 2024, 13(16), 3346; https://doi.org/10.3390/electronics13163346 - 22 Aug 2024

Viewed by 401

Abstract

Edge computing enables efficient data aggregation for services like data sharing and analysis in distributed IoT applications. However, uploading dynamic high-dimensional data to an edge server for efficient aggregation is challenging. Additionally, there is the significant risk of privacy leakage associated with direct [...] Read more.

Edge computing enables efficient data aggregation for services like data sharing and analysis in distributed IoT applications. However, uploading dynamic high-dimensional data to an edge server for efficient aggregation is challenging. Additionally, there is the significant risk of privacy leakage associated with direct such data uploading. Therefore, we propose an edge-based differential privacy data aggregation method leveraging progressive UMAP with a dynamic time window based on LSTM (EDP-PUDL). Firstly, a model of the dynamic time window based on a long short-term memory (LSTM) network was developed to divide dynamic data. Then, progressive uniform manifold approximation and projection (UMAP) with differential privacy was performed to reduce the dimension of the window data while preserving privacy. The privacy budget was determined by the data volume and the attribute’s Shapley value, adding DP noise. Finally, the privacy analysis and experimental comparisons demonstrated that EDP-PUDL ensures user privacy while achieving superior aggregation efficiency and availability compared to other algorithms used for dynamic high-dimensional data aggregation. Full article

► Show Figures

Figure 1

19 pages, 10716 KiB

Open AccessArticle

Crop Water Status Analysis from Complex Agricultural Data Using UMAP-Based Local Biplot

by Jenniffer Carolina Triana-Martinez, Andrés Marino Álvarez-Meza, Julian Gil-González, Tom De Swaef and Jose A. Fernandez-Gallego

Remote Sens. 2024, 16(15), 2854; https://doi.org/10.3390/rs16152854 - 4 Aug 2024

Viewed by 679

Abstract

To optimize growth and management, precision agriculture relies on a deep understanding of agricultural dynamics, particularly crop water status analysis. Leveraging unmanned aerial vehicles, we can efficiently acquire high-resolution spatiotemporal samples by utilizing remote sensors. However, non-linear relationships among data features, localized within [...] Read more.

To optimize growth and management, precision agriculture relies on a deep understanding of agricultural dynamics, particularly crop water status analysis. Leveraging unmanned aerial vehicles, we can efficiently acquire high-resolution spatiotemporal samples by utilizing remote sensors. However, non-linear relationships among data features, localized within specific subgroups, frequently emerge in agricultural data. Interpreting these complex patterns requires sophisticated analysis due to the presence of noise, high variability, and non-stationarity behavior in the collected samples. Here, we introduce Local Biplot, a methodological framework tailored for discerning meaningful data patterns in non-stationary contexts for precision agriculture. Local Biplot relies on the well-known uniform manifold approximation and projection method, such as UMAP, and local affine transformations to codify non-stationary and non-linear data patterns while maintaining interpretability. This lets us find important clusters for transformation and projection within a single global axis pair. Hence, our framework encompasses variable and observational contributions within individual clusters. At the same time, we provide a relevance analysis strategy to help explain why those clusters exist, facilitating the understanding of data dynamics while favoring interpretability. We demonstrated our method’s capabilities through experiments on both synthetic and real-world datasets, covering scenarios involving grass and rice crops. Moreover, we use random forest and linear regression models to predict water status variables from our Local Biplot-based feature ranking and clusters. Our findings revealed enhanced clustering and prediction capability while emphasizing the importance of input features in precision agriculture. As a result, Local Biplot is a useful tool to visualize, analyze, and compare the intricate underlying patterns and internal structures of complex agricultural datasets. Full article

(This article belongs to the Special Issue Application of Satellite and UAV Data in Precision Agriculture)

► Show Figures

Figure 1

21 pages, 4306 KiB

Open AccessArticle

Comparative Analysis of Manifold Learning-Based Dimension Reduction Methods: A Mathematical Perspective

by Wenting Yi, Siqi Bu, Hiu-Hung Lee and Chun-Hung Chan

Mathematics 2024, 12(15), 2388; https://doi.org/10.3390/math12152388 - 31 Jul 2024

Viewed by 651

Abstract

Manifold learning-based approaches have emerged as prominent techniques for dimensionality reduction. Among these methods, t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) stand out as two of the most widely used and effective approaches. While both methods share similar [...] Read more.

Manifold learning-based approaches have emerged as prominent techniques for dimensionality reduction. Among these methods, t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) stand out as two of the most widely used and effective approaches. While both methods share similar underlying procedures, empirical observations indicate two distinctive properties: global data structure preservation and computational efficiency. However, the underlying mathematical principles behind these distinctions remain elusive. To address this gap, this study presents a comparative analysis of the subprocesses involved in these methods, aiming to elucidate the mathematical mechanisms underlying the observed distinctions. By meticulously examining the equation formulations, the mathematical mechanisms contributing to global data structure preservation and computational efficiency are elucidated. To validate the theoretical analysis, data are collected through a laboratory experiment, and an open-source dataset is utilized for validation across different datasets. The consistent alignment of results obtained from both balanced and unbalanced datasets robustly confirms the study’s findings. The insights gained from this study provide a deeper understanding of the mathematical underpinnings of t-SNE and UMAP, enabling more informed and effective use of these dimensionality reduction techniques in various applications, such as anomaly detection, natural language processing, and bioinformatics. Full article

(This article belongs to the Section Mathematics and Computer Science)

► Show Figures

Figure 1

Figure 1
Illustration of high-dimensional data and low-dimensional data with manifold learning (ML) and non-manifold learning (non-ML) techniques. Four colors represent four different clusters. Full article ">Figure 2
Illustration of global and local structures in high dimension and low dimension with t-SNE and UMAP. Four colors represent four different clusters. Full article ">Figure 3
The general algorithm structure of manifold learning techniques. Full article ">Figure 4
The impact of the parameters <math display="inline"><semantics> <mi>α</mi> </semantics></math> and <math display="inline"><semantics> <mi>β</mi> </semantics></math> on similarity scores in low dimension. Full article ">Figure 5
The curve-fitting results with different values of <math display="inline"><semantics> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> <mo>_</mo> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </mrow> </semantics></math>. Full article ">Figure 6
Comparison of the loss functions of t-SNE and UMAP. Full article ">Figure 7
The impact of different <math display="inline"><semantics> <mi>σ</mi> </semantics></math> on the high-dimensional probability. Full article ">Figure 8
Laboratory experimental test for data collection. Full article ">

22 pages, 16238 KiB

Open AccessFeature PaperArticle

Spectroscopic Phenological Characterization of Mangrove Communities

by Christopher Small and Daniel Sousa

Remote Sens. 2024, 16(15), 2796; https://doi.org/10.3390/rs16152796 - 30 Jul 2024

Viewed by 473

Abstract

Spaceborne spectroscopic imaging offers the potential to improve our understanding of biodiversity and ecosystem services, particularly for challenging and rich environments like mangroves. Understanding the signals present in large volumes of high-dimensional spectroscopic observations of vegetation communities requires the characterization of seasonal phenology [...] Read more.

Spaceborne spectroscopic imaging offers the potential to improve our understanding of biodiversity and ecosystem services, particularly for challenging and rich environments like mangroves. Understanding the signals present in large volumes of high-dimensional spectroscopic observations of vegetation communities requires the characterization of seasonal phenology and response to environmental conditions. This analysis leverages both spectroscopic and phenological information to characterize vegetation communities in the Sundarban riverine mangrove forest of the Ganges–Brahmaputra delta. Parallel analyses of surface reflectance spectra from NASA’s EMIT imaging spectrometer and MODIS vegetation abundance time series (2000–2022) reveal the spectroscopic and phenological diversity of the Sundarban mangrove communities. A comparison of spectral and temporal feature spaces rendered with low-order principal components and 3D embeddings from Uniform Manifold Approximation and Projection (UMAP) reveals similar structures with multiple spectral and temporal endmembers and multiple internal amplitude continua for both EMIT reflectance and MODIS Enhanced Vegetation Index (EVI) phenology. The spectral and temporal feature spaces of the Sundarban represent independent observations sharing a common structure that is driven by the physical processes controlling tree canopy spectral properties and their temporal evolution. Spectral and phenological endmembers reside at the peripheries of the mangrove forest with multiple outward gradients in amplitude of reflectance and phenology within the forest. Longitudinal gradients of both phenology and reflectance amplitude coincide with LiDAR-derived gradients in tree canopy height and sub-canopy ground elevation, suggesting the influence of surface hydrology and sediment deposition. RGB composite maps of both linear (PC) and nonlinear (UMAP) 3D feature spaces reveal a strong contrast between the phenological and spectroscopic diversity of the eastern Sundarban and the less diverse western Sundarban. Full article

(This article belongs to the Special Issue Remote Sensing of Land Surface Phenology II)

► Show Figures

Figure 1

15 pages, 6283 KiB

Open AccessArticle

Precision Detection of Salt Stress in Soybean Seedlings Based on Deep Learning and Chlorophyll Fluorescence Imaging

by Yixin Deng, Nan Xin, Longgang Zhao, Hongtao Shi, Limiao Deng, Zhongzhi Han and Guangxia Wu

Plants 2024, 13(15), 2089; https://doi.org/10.3390/plants13152089 - 27 Jul 2024

Viewed by 579

Abstract

Soil salinization poses a critical challenge to global food security, impacting plant growth, development, and crop yield. This study investigates the efficacy of deep learning techniques alongside chlorophyll fluorescence (ChlF) imaging technology for discerning varying levels of salt stress in soybean seedlings. Traditional [...] Read more.

Soil salinization poses a critical challenge to global food security, impacting plant growth, development, and crop yield. This study investigates the efficacy of deep learning techniques alongside chlorophyll fluorescence (ChlF) imaging technology for discerning varying levels of salt stress in soybean seedlings. Traditional methods for stress identification in plants are often laborious and time-intensive, prompting the exploration of more efficient approaches. A total of six classic convolutional neural network (CNN) models—AlexNet, GoogLeNet, ResNet50, ShuffleNet, SqueezeNet, and MobileNetv2—are evaluated for salt stress recognition based on three types of ChlF images. Results indicate that ResNet50 outperforms other models in classifying salt stress levels across three types of ChlF images. Furthermore, feature fusion after extracting three types of ChlF image features in the average pooling layer of ResNet50 significantly enhanced classification accuracy, achieving the highest accuracy of 98.61% in particular when fusing features from three types of ChlF images. UMAP dimensionality reduction analysis confirms the discriminative power of fused features in distinguishing salt stress levels. These findings underscore the efficacy of deep learning and ChlF imaging technologies in elucidating plant responses to salt stress, offering insights for precision agriculture and crop management. Overall, this study demonstrates the potential of integrating deep learning with ChlF imaging for precise and efficient crop stress detection, offering a robust tool for advancing precision agriculture. The findings contribute to enhancing agricultural sustainability and addressing global food security challenges by enabling more effective crop stress management. Full article

(This article belongs to the Special Issue Practical Applications of Chlorophyll Fluorescence Measurements)

► Show Figures

Figure 1

Figure 1
Workflow of using chlorophyll fluorescence imaging to analyze and detect various concentrations of salt stress in soybean seedlings. ChlF: Chlorophyll fluorescence; CNN: convolutional neural network; SVM: support vector machine. Full article ">Figure 2
Pseudo-color images of Fv/Fm, Y(NO), and Inh parameters of soybean seedlings under different salt concentration stresses. Full article ">Figure 3
Soybean seedling images and leaf SPAD values under various salt concentration stress. (a–e): Phenotypic images of soybean seedlings treated with 0, 50, 100, 150, and 200 mM NaCl, respectively. (f): SPAD values measured from the leaves of soybean seedlings under different salt concentrations of salt stress treatment. Values shown are means ± SD of three measurements taken in all leaves per treatment. The lowercase letters in the bar charts represent significant differences between the indicated groups as tested with a one-way analysis of variance (ANOVA, p < 0.05). Full article ">Figure 4
Comparison of validation and testing accuracies among six different CNN models. The red asterisk indicates the highest accuracy. Full article ">Figure 5
Confusion matrix of different modeling algorithms. The different shades of green represent varying levels of classification accuracy, with darker shades indicating higher accuracy and lighter shades indicating lower accuracy. Full article ">Figure 6
The scatter plot of 2D dimensionality reduction test features using UMAP. Full article ">

43 pages, 8643 KiB

Open AccessArticle

Diffusion on PCA-UMAP Manifold: The Impact of Data Structure Preservation to Denoise High-Dimensional Single-Cell RNA Sequencing Data

by Padron-Manrique Cristian, Vázquez-Jiménez Aarón, Esquivel-Hernandez Diego Armando, Martinez-Lopez Yoscelina Estrella, Neri-Rosario Daniel, Giron-Villalobos David, Mixcoha Edgar, Sánchez-Castañeda Jean Paul and Resendis-Antonio Osbaldo

Biology 2024, 13(7), 512; https://doi.org/10.3390/biology13070512 - 9 Jul 2024

Cited by 1 | Viewed by 887

Abstract

Single-cell transcriptomics (scRNA-seq) is revolutionizing biological research, yet it faces challenges such as inefficient transcript capture and noise. To address these challenges, methods like neighbor averaging or graph diffusion are used. These methods often rely on k-nearest neighbor graphs from low-dimensional manifolds. However, [...] Read more.

Single-cell transcriptomics (scRNA-seq) is revolutionizing biological research, yet it faces challenges such as inefficient transcript capture and noise. To address these challenges, methods like neighbor averaging or graph diffusion are used. These methods often rely on k-nearest neighbor graphs from low-dimensional manifolds. However, scRNA-seq data suffer from the ‘curse of dimensionality’, leading to the over-smoothing of data when using imputation methods. To overcome this, sc-PHENIX employs a PCA-UMAP diffusion method, which enhances the preservation of data structures and allows for a refined use of PCA dimensions and diffusion parameters (e.g., k-nearest neighbors, exponentiation of the Markov matrix) to minimize noise introduction. This approach enables a more accurate construction of the exponentiated Markov matrix (cell neighborhood graph), surpassing methods like MAGIC. sc-PHENIX significantly mitigates over-smoothing, as validated through various scRNA-seq datasets, demonstrating improved cell phenotype representation. Applied to a multicellular tumor spheroid dataset, sc-PHENIX identified known extreme phenotype states, showcasing its effectiveness. sc-PHENIX is open-source and available for use and modification. Full article

(This article belongs to the Special Issue Machine Learning Applications in Biology)

► Show Figures

Graphical abstract

13 pages, 7339 KiB

Open AccessArticle

Improving the Two-Color Temperature Sensing Using Machine Learning Approach: GdVO₄:Sm³⁺ Prepared by Solution Combustion Synthesis (SCS)

by Jovana Z. Jelic, Aleksa Dencevski, Mihailo D. Rabasovic, Janez Krizan, Svetlana Savic-Sevic, Marko G. Nikolic, Myriam H. Aguirre, Dragutin Sevic and Maja S. Rabasovic

Photonics 2024, 11(7), 642; https://doi.org/10.3390/photonics11070642 - 6 Jul 2024

Viewed by 587

Abstract

The gadolinium vanadate doped with samarium (GdVO₄:Sm³⁺) nanopowder was prepared by the solution combustion synthesis (SCS) method. After synthesis, in order to achieve full crystallinity, the material was annealed in air atmosphere at 900 °C. Phase identification in the [...] Read more.

The gadolinium vanadate doped with samarium (GdVO₄:Sm³⁺) nanopowder was prepared by the solution combustion synthesis (SCS) method. After synthesis, in order to achieve full crystallinity, the material was annealed in air atmosphere at 900 °C. Phase identification in the post-annealed powder samples was performed by X-ray diffraction, and morphology was investigated by high-resolution scanning electron microscope (SEM) and transmission electron microscope (TEM). Photoluminescence characterization of emission spectrum and time resolved analysis was performed using tunable laser optical parametric oscillator excitation and streak camera. In addition to samarium emission bands, a weak broad luminescence emission band of host VO₄³⁻ was also observed by the detection system. In our earlier work, we analyzed the possibility of using the host luminescence for two-color temperature sensing, improving the method by introducing the temporal dependence in line intensity ratio measurements. Here, we showed that further improvements are possible by using the machine learning approach. To facilitate the initial data assessment, we incorporated Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) clustering of GdVO4:Sm³⁺ spectra at various temperatures. Good predictions of temperature were obtained using deep neural networks. Performance of the deep learning network was enhanced by data augmentation technique. Full article

(This article belongs to the Special Issue Editorial Board Members’ Collection Series: Photonics Sensors)

► Show Figures

Figure 1

16 pages, 3790 KiB

Open AccessArticle

iNP_ESM: Neuropeptide Identification Based on Evolutionary Scale Modeling and Unified Representation Embedding Features

by Honghao Li, Liangzhen Jiang, Kaixiang Yang, Shulin Shang, Mingxin Li and Zhibin Lv

Int. J. Mol. Sci. 2024, 25(13), 7049; https://doi.org/10.3390/ijms25137049 - 27 Jun 2024

Viewed by 1174

Abstract

Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. [...] Read more.

Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. Hence, in this research, we constructed an SVM-based machine learning neuropeptide predictor, iNP_ESM, by integrating protein language models Evolutionary Scale Modeling (ESM) and Unified Representation (UniRep) for the first time. Our model utilized feature fusion and feature selection strategies to improve prediction accuracy during optimization. In addition, we validated the effectiveness of the optimization strategy with UMAP (Uniform Manifold Approximation and Projection) visualization. iNP_ESM outperforms existing models on a variety of machine learning evaluation metrics, with an accuracy of up to 0.937 in cross-validation and 0.928 in independent testing, demonstrating optimal neuropeptide recognition capabilities. We anticipate improved neuropeptide data in the future, and we believe that the iNP_ESM model will have broader applications in the research and clinical treatment of neurological diseases. Full article

(This article belongs to the Section Molecular Neurobiology)

► Show Figures

Figure 1

Figure 1
An overview of the iNP_ESM model. Initially, neuropeptide sequences are input into the protein language models ESM and UniRep, generating 1280D ESM features and 1900D UniRep features for each sequence. Subsequently, these features are combined to form a 3180D fused feature. This fused feature can be directly input into an SVM model. Alternatively, after dimensionality reduction through feature selection to 120 dimensions, the reduced feature can also be input into the SVM model. Following a series of optimizations and performance comparisons, the iNP_ESM model is finalized. Full article ">Figure 2
Comparison of 10-fold cross-validation metrics for the combination of six feature encoding methods and seven machine learning algorithms. Here, UniRep is represented in dark green, ESM in light green, SSA in light yellow, LM in dark yellow, BiLSTM in dark red, and TAPE_BERT in light red. The machine learning algorithms include (A) GNB, (B) KNN, (C) LDA, (D) LGBM, (E) LR, (F) RF, and (G) SVM. Full article ">Figure 3
Comparison of the average values from 10-fold cross-validation and an independent test between fused feature models and single feature models. Here, UniRep is represented in green, ESM in light yellow, and UniRep+ESM_F3180 in dark yellow. The machine learning algorithms include (A) LGBM (parameters: {‘num_trees’: 1300, ‘learning_rate’: 0.28}) and (B) SVM (parameters: {‘C’: 1.9306977288832496, ‘gamma’: ‘scale’}). Full article ">Figure 4
The variation in (A) accuracy and (B) Matthews correlation coefficient during the feature selection process for ESM+UniRep_F3180 with the number of features. Here, 10-fold cross-validation metrics are represented in green, independent test metrics in red, and the average of cross-validation and independent test metrics in yellow. LGBM Classifier parameters: {‘num_leaves’: 32, ‘n_estimators’: 888, ‘max_depth’: 12, ‘learning_rate’: 0.16, ‘min_child_samples’: 50, ‘random_state’: 2020, ‘n_jobs’: 8}. Full article ">Figure 5
UMAP visualization plots (parameters: {‘metric’: ‘correlation’, ‘n_neighbors’: 45, ‘min_dist’: 0.12}). (A) using the ESM training set; (B) using the UniRep training set; (C) using the ESM+UniRep_F3180 training set; and (D) using the ESM+UniRep_F120 training set. Neuropeptides are represented by orange dots, and non-neuropeptides are represented by blue dots. Full article ">

28 pages, 9412 KiB

Open AccessArticle

Deciphering Abnormal Platelet Subpopulations in COVID-19, Sepsis and Systemic Lupus Erythematosus through Machine Learning and Single-Cell Transcriptomics

by Xinru Qiu, Meera G. Nair, Lukasz Jaroszewski and Adam Godzik

Int. J. Mol. Sci. 2024, 25(11), 5941; https://doi.org/10.3390/ijms25115941 - 29 May 2024

Viewed by 973

Abstract

This study focuses on understanding the transcriptional heterogeneity of activated platelets and its impact on diseases such as sepsis, COVID-19, and systemic lupus erythematosus (SLE). Recognizing the limited knowledge in this area, our research aims to dissect the complex transcriptional profiles of activated [...] Read more.

This study focuses on understanding the transcriptional heterogeneity of activated platelets and its impact on diseases such as sepsis, COVID-19, and systemic lupus erythematosus (SLE). Recognizing the limited knowledge in this area, our research aims to dissect the complex transcriptional profiles of activated platelets to aid in developing targeted therapies for abnormal and pathogenic platelet subtypes. We analyzed single-cell transcriptional profiles from 47,977 platelets derived from 413 samples of patients with these diseases, utilizing Deep Neural Network (DNN) and eXtreme Gradient Boosting (XGB) to distinguish transcriptomic signatures predictive of fatal or survival outcomes. Our approach included source data annotations and platelet markers, along with SingleR and Seurat for comprehensive profiling. Additionally, we employed Uniform Manifold Approximation and Projection (UMAP) for effective dimensionality reduction and visualization, aiding in the identification of various platelet subtypes and their relation to disease severity and patient outcomes. Our results highlighted distinct platelet subpopulations that correlate with disease severity, revealing that changes in platelet transcription patterns can intensify endotheliopathy, increasing the risk of coagulation in fatal cases. Moreover, these changes may impact lymphocyte function, indicating a more extensive role for platelets in inflammatory and immune responses. This study identifies crucial biomarkers of platelet heterogeneity in serious health conditions, paving the way for innovative therapeutic approaches targeting platelet activation, which could improve patient outcomes in diseases characterized by altered platelet function. Full article

(This article belongs to the Special Issue New Advances in Platelet Biology and Functions: 2nd Edition)

► Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
PBMC profiling from healthy controls, sepsis, similar symptom hospitalized, COVID-19, and SLE patients. (A) Schematic outline depicting the workflow for data collection from published literature and subsequent integrated analysis. Created with biorender.com. (B–F) Bar plots depicting the percentage of different cell types under different disease severities. (B) Platelets, (C) T cells, (D) B cells, (E) monocytes, and (F) neutrophils. (G) DC under different outcome situations. The differences in percentages associated with adjusted p-values below 0.05, 0.01, 0.001, and 0.0001 are indicated as *, **, ***, and ****, respectively, and not significant ones are not shown. The significance analysis was performed using Wilcoxon tests. Standard error bars were also added. (H) Receiver operating characteristic (ROC) curves for the platelet to T cell ratio and other cell type percentages were used to distinguish non-survivors from survivors. Full article ">Figure 2
Deep neural networks and XGBoost modeling identify biomarkers of survival and fatal platelets. (A) Comparison of DNN and XGBoost Model Performance, DNN (represented in green) vs. XGB (represented in blue). (B) Venn diagram representing features from the Deep Neural Network (DNN) and XGBoost (XGB) models, specifically including only those features that rank in the highest 5% in terms of their importance or gain metrics within each model. Additionally, incorporate features that demonstrate a differential expression gene (DEG) profile with an absolute log2 fold change (log2fc) greater than 1. (C) Volcano plot depicting genes that are upregulated or downregulated when comparing platelets from survivors to those from fatal cases. The x-axis represents the log fold change. This is a measure of the change in expression levels of variables between two conditions. A zero value indicates no change, positive values indicate upregulation, and negative values indicate downregulation in the condition of interest relative to a reference condition. The y-axis represents the negative log10 adjusted p-value. This transformation is used to amplify differences in p-values, where small p-values (which indicate statistical significance) result in larger values on the plot. The horizontal dashed line typically represents a threshold of significance (e.g., adjusted p-value of 0.05), above which the findings are considered statistically significant. (D,E) Bar chart that illustrates the enrichment of certain biological pathways in a set of genes related to genes up-regulated in survival/fatal. The color of the bars represents the level of statistical significance after adjustment for multiple comparisons, with darker colors indicating more statistically significant enrichment. Full article ">Figure 3
Differential expression of platelets affects endotheliopathy across disease severity states. (A) The expression of the ITGA2B gene in platelets across severity states. (B) The expression module GO: Cytoskeleton organization in platelets across severity states. Violin plots are ordered according to the decreasing average value of the expression. (C) The comparison of GO terms blood coagulation (GO:0007596), inflammatory response (GO:0006954), apoptotic process (GO:0006915), extracellular matrix disassembly (GO:0022617), and platelet activation (GO:0030168) expression. Heatmap coloring represents z-scored scores averaged across all cells in a given sample. Full article ">Figure 4
Clustered integrative analysis of platelets’ single-cell transcriptional landscape. (A,B) Cell cluster UMAP representation of all merged platelets (A) colored by data source; (B) colored by clusters. (C,D) stacked bar plots display platelet cluster proportion under (C) different disease severity, and (D) different outcome situations. (E) Violin plots showing expression levels of marker genes for each cluster in platelets. In (A), we gathered single-cell RNA-seq datasets of peripheral blood mononuclear cells (PBMCs) from COVID-19 [<a href="#B30-ijms-25-05941" class="html-bibr">30</a>,<a href="#B31-ijms-25-05941" class="html-bibr">31</a>,<a href="#B32-ijms-25-05941" class="html-bibr">32</a>,<a href="#B33-ijms-25-05941" class="html-bibr">33</a>,<a href="#B34-ijms-25-05941" class="html-bibr">34</a>,<a href="#B35-ijms-25-05941" class="html-bibr">35</a>,<a href="#B36-ijms-25-05941" class="html-bibr">36</a>,<a href="#B37-ijms-25-05941" class="html-bibr">37</a>,<a href="#B38-ijms-25-05941" class="html-bibr">38</a>], sepsis [<a href="#B12-ijms-25-05941" class="html-bibr">12</a>,<a href="#B39-ijms-25-05941" class="html-bibr">39</a>], and systemic lupus erythematosus (SLE) [<a href="#B40-ijms-25-05941" class="html-bibr">40</a>] patients. Full article ">Figure 5
Clustered platelets and their unique pathway expression changes. (A) Enrichment analysis for human hallmark gene sets for each platelet cluster. Expression level is the normalized enrichment score in the GSEA algorithm. (B–E) Bar plot of hallmark gene sets expression among clusters. (B) Coagulation (C) Apical Junction (D) Epithelial Mesenchymal Transition (E) E2F Targets. (F) Heatmap display fatal cluster C9 up-regulated genes and enriched pathways in gene ontology (GO). (G) Gene-Concept network display fatal cluster C4 up-regulated genes and enriched pathways from Gene, Disease Features Ontology-based Overview System (gendoo) [<a href="#B41-ijms-25-05941" class="html-bibr">41</a>] and diseases category. (H) Tree plot display convalescence cluster C8 enriched pathways in GO. Full article ">Figure 6
Platelet module signatures in patients related to survival and the fatal dynamic trend (A,B) Pseudo-time plot of platelets from clusters C0, C1, C4, and C6 exhibiting trajectory fates. (C,D) Differential expression genes in (C) C4 vs. C0; (D) C0 vs. C1. The volcano plot displays the genes constantly up-or down-regulated in the direction of disease severity. Volcano plots were prepared with the R package EnhancedVolcano (v.1.13.2). In the volcano plot, genes with an absolute log2 fold change greater than 0.5 are represented by red dots, while those with a lower absolute log2 fold change are represented by blue dots. (E–H) Ridge plots showing the density of expression level of (E) Fatal module expression under platelet clusters, (F) Fatal module expression under disease severities, (G) Survival module expression under platelet clusters, and (H) Survival module expression under disease severities. Ridge plots are ordered in descending order. (I) Receiver operating characteristic (ROC) curves for each platelet cluster percentage from PBMC were used to distinguish non-survivors from survivors. Full article ">Figure 7
Platelets pathway expression among healthy controls, sepsis, similar symptom hospitalized, COVID-19, and SLE patients. (A,B) Venn diagrams describing changes in Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways from COVID-19, similar symptoms hospitalized (SSH), sepsis, and systemic sclerosis lupus (SLE) vs. healthy controls (HC), (A) Up-regulated pathways. (B) Down-regulated pathways. The pathways were filtered for those with an adjusted p-value under 0.05. (C) Heatmap illustration of non-survivor vs. survivor up- or down-regulated pathways. Colors are decided by the product of the COVID-19 and sepsis up/down-regulated enriched pathway log10 (adjusted p value). The GO terms were reduced to representative ones using R package rrvgo (v.1.8.0) [<a href="#B44-ijms-25-05941" class="html-bibr">44</a>] (the cutoffs were similarity > 0.4) and then overlapped. (D) Heatmap illustration of diseases vs. health controls up- or down-regulated pathways. Color is decided by the product of the COVID-19, SSH, SLE, and sepsis up- or down-regulated enriched pathway log10 (adjusted p value). The GO terms were reduced to representative ones using R package rrvgo (1.8.0) (the cutoffs were similarity > 0.1) and then overlapped. Full article ">Figure 8
Alterations in platelet and other cell type interactions among healthy controls, sepsis, similar symptom hospitalized, COVID-19, and SLE patients. (A) Comparison of ligand-receptor interaction scores between platelets and other cell types. Heatmap coloration corresponds to z-scored, log-normalized mean interaction scores averaged across all cells from a specific sample. (B–E) Ligand and receptor interaction scores between platelets and (B) T cells across various disease severity levels, (C) T cells across different outcomes, (D) B cells across various disease severity levels, and (E) B cells across different outcomes. The circle size represents the z-scored interaction scores. (F,G) Pathway module scores across different outcome situations in platelets, including (F) T cell differentiation and (G) B cell proliferation. The differences in scores associated with adjusted P-values below 0.05, 0.01, 0.001, and 0.0001 are indicated as * and ****, respectively. Full article ">

19 pages, 20483 KiB

Open AccessArticle

Subcellular Feature-Based Classification of α and β Cells Using Soft X-ray Tomography

by Aneesh Deshmukh, Kevin Chang, Janielle Cuala, Bieke Vanslembrouck, Senta Georgia, Valentina Loconte and Kate L. White

Cells 2024, 13(10), 869; https://doi.org/10.3390/cells13100869 - 18 May 2024

Viewed by 1350

Abstract

The dysfunction of α and β cells in pancreatic islets can lead to diabetes. Many questions remain on the subcellular organization of islet cells during the progression of disease. Existing three-dimensional cellular mapping approaches face challenges such as time-intensive sample sectioning and subjective [...] Read more.

The dysfunction of α and β cells in pancreatic islets can lead to diabetes. Many questions remain on the subcellular organization of islet cells during the progression of disease. Existing three-dimensional cellular mapping approaches face challenges such as time-intensive sample sectioning and subjective cellular identification. To address these challenges, we have developed a subcellular feature-based classification approach, which allows us to identify α and β cells and quantify their subcellular structural characteristics using soft X-ray tomography (SXT). We observed significant differences in whole-cell morphological and organelle statistics between the two cell types. Additionally, we characterize subtle biophysical differences between individual insulin and glucagon vesicles by analyzing vesicle size and molecular density distributions, which were not previously possible using other methods. These sub-vesicular parameters enable us to predict cell types systematically using supervised machine learning. We also visualize distinct vesicle and cell subtypes using Uniform Manifold Approximation and Projection (UMAP) embeddings, which provides us with an innovative approach to explore structural heterogeneity in islet cells. This methodology presents an innovative approach for tracking biologically meaningful heterogeneity in cells that can be applied to any cellular system. Full article

(This article belongs to the Special Issue Advanced Technology for Cellular Imaging)

► Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
3D reconstruction and quantitative analysis of α and β cell morphology. (A) Orthoslice showing the XY plane through the soft X-ray tomogram of representative α and β cells (α_3 and β_6, respectively). Cell constituents and organelles are distinguished from one another based on their LAC values and are identified as follows: nucleus–blue arrowhead, mitochondria–pink arrowhead, glucagon vesicles–red arrowhead, and insulin vesicles–green arrowhead. The overall LAC value range of the orthoslice is between 0.15 and 0.4 μm−1 to optimize contrast. Scale bar: 2 μm. (B) 3D reconstruction of the representative α and β cells (α_3 and β_6, respectively). In detail, the reconstruction shows the nucleus (blue), mitochondria (pink), glucagon vesicles ((left), in red), insulin vesicles ((right), in green), and plasma membrane (gray). (C) Cellular volume of both cell types, showing a significantly higher volume (*** p < 0.001) for β cells (1191 ± 277 μm3) compared with α cells (579 ± 247 μm3). (D) Nuclear volume of both cell types showed no significant difference (p = 0.76). (E) Comparison between mean nuclear occupancy for each cell type, with a significant increase (*** p < 0.001) in percentage occupancy of the nucleus for α cells (21 ± 5%) compared with β cells (10 ± 3%). (F) Number of insulin vesicles normalized by cytosolic volume indicating a significantly higher number of vesicles (* p = 0.03) per cytosolic μm3 for α cells (3.3 ± 1.4 vesicles/μm3) compared with β cells (2 ± 0.6 vesicles/μm3). (G) Plot of mean vesicle diameters of α and β cell vesicles demonstrating a higher vesicle diameter (*** p < 0.001) for α cell vesicles (212 ± 21 nm) compared with β cell vesicles (163 ± 13 nm). (H) Mean Vesicle LAC for secretory vesicles of α and β cells showing a significantly higher mean LAC (** p = 0.003) for α cell vesicles (0.37 ± 0.03 μm−1) compared with β cell vesicles (0.33 ± 0.02 μm−1). Error bars in each plot are representative of the standard deviation. Welch’s t-test was used as a statistical test. n = 8 for α cells (red) and n = 7 for β cells (green). Full article ">Figure 2
3D reconstruction and quantitative analysis of pooled insulin and glucagon vesicles. (A) (left) XY Orthoslice through the SXT of representative α and β cells. Glucagon vesicles (red arrowheads) and insulin vesicles (green arrowheads) can be identified based on their high LAC values. The overall LAC value in the orthoslice is thresholded between 0.15 and 0.40 μm−1 (scale bar: 0.5 μm). (right) 3D reconstruction of a section of representative α and β cells (α_3 and β_6, respectively). In detail, the reconstruction shows glucagon vesicles ((top), in red) and insulin vesicles ((bottom), in green), and plasma membrane (gray). (B) Histogram showing the size distribution of glucagon and insulin vesicles. The vesicles for each cell type are pooled together and show a significantly higher diameter (**** p < 0.0001) for insulin vesicles (194.2 ± 49 nm, green dotted line), compared with glucagon vesicles (157 ± 35 nm, red dotted line). (C) Histogram showing LAC distribution of glucagon and insulin vesicles demonstrating a significantly higher mean vesicle LAC values (**** p < 0.0001) for insulin vesicles (0.37 ± 0.04 μm−1, red dotted line), compared with glucagon vesicles (0.33 ± 0.03 μm−1, green dotted line). (B,C) n = 10,694 for glucagon vesicles (red) and n = 14,690 for insulin vesicles (green). Welch’s t-test was used as a statistical test. Full article ">Figure 3
Description and comparison of LAC-based parameters between insulin and glucagon vesicles (A) (top) XY Orthoslice through SXT of representative α cell (α_3). Scale bar: 0.5 μm. (middle) 3D reconstruction of a representative glucagon vesicle (red). (bottom) Histogram displaying the LAC distribution of the glucagon vesicle picture in the top and middle panels showing a mean LAC value of 0.34 μm−1 for the vesicle. (B) (top) XY Orthoslice through SXT of representative β-cell (β_6). Scale bar: 500 nm. (middle) 3D reconstruction of a representative insulin vesicle (green). (bottom) Histogram displaying the LAC distribution of the insulin vesicle picture in the (top) and (middle) panels, showing a mean LAC value of 0.318 μm−1 for the vesicle. (C) (top) A comparison of vesicle LAC parameters (minimum LAC, 25th quantile LAC, mean LAC, 75th quantile LAC, maximum LAC) between glucagon vesicles (red) and insulin vesicles (green) showing significantly higher values (**** p < 0.0001; one-way ANOVA with Bonferroni’s correction) for glucagon vesicles for all displayed parameters compared with insulin vesicles. (bottom) LAC histogram curve for a sample vesicle, with arrows indicating the value being compared in the (top) panel. (D) Plot showing a significantly higher (**** p < 0.0001; Welch’s t-test) standard deviation for glucagon vesicles (red) compared with insulin vesicles (green). Error bars in all plots are representative of the standard deviation. n = 10,694 for glucagon vesicles (red) and n = 14,690 for insulin vesicles (green). Full article ">Figure 4
Overview of machine learning strategy. (A) Data pre-processing for insulin and glucagon vesicles from β and α cells is conducted. A final vesicle feature matrix, including group labels (denoting which cell a vesicle is from) for vesicles, is used as input for machine learning. (B) Train/test split for grouping vesicles from α and β cells. The process of model building is described, with Leave One Group Out cross-validation used to estimate the performance of predicting vesicle identity from unseen cells. Model building and testing are repeated over 56 combinations to understand variability in performance. Full article ">Figure 5
Representing vesicle feature importances in UMAP embeddings. (A) Representative feature importances from each ML model listed in order of accuracy. The radius of circles is scaled to the magnitude of the permutation feature importances. Since the LAC standard deviation in the XGBoost model had the highest overall importance magnitude, the other parameters are scaled to it. In general, LAC mean, LAC standard deviation, and diameter seem to be the most important representative parameters. (B) UMAP embedding of vesicles colored by vesicle identity. Semi-distinct clusters of insulin and glucagon vesicles can be observed. (C) Vesicles colored by cellular origin. The overall trends in the pooled vesicle UMAP space do not seem to be driven by cell-dependent effects. (D) Embeddings colored by vesicle feature values. Gradients of LAC mean, standard deviation, and diameter correspond to regions of insulin and glucagon vesicles. The grouping of heterogeneous vesicle subpopulations can also be visualized. Full article ">

16 pages, 1496 KiB

Open AccessArticle

Identifying Novel Subtypes of Functional Gastrointestinal Disorder by Analyzing Nonlinear Structure in Integrative Biopsychosocial Questionnaire Data

by Sa-Yoon Park, Hyojin Bae, Ha-Yeong Jeong, Ju Yup Lee, Young-Kyu Kwon and Chang-Eop Kim

J. Clin. Med. 2024, 13(10), 2821; https://doi.org/10.3390/jcm13102821 - 10 May 2024

Cited by 1 | Viewed by 734

Abstract

Background/Objectives: Given the limited success in treating functional gastrointestinal disorders (FGIDs) through conventional methods, there is a pressing need for tailored treatments that account for the heterogeneity and biopsychosocial factors associated with FGIDs. Here, we considered the potential of novel subtypes of FGIDs [...] Read more.

Background/Objectives: Given the limited success in treating functional gastrointestinal disorders (FGIDs) through conventional methods, there is a pressing need for tailored treatments that account for the heterogeneity and biopsychosocial factors associated with FGIDs. Here, we considered the potential of novel subtypes of FGIDs based on biopsychosocial information. Methods: We collected data from 198 FGID patients utilizing an integrative approach that included the traditional Korean medicine diagnosis questionnaire for digestive symptoms (KM), as well as the 36-item Short Form Health Survey (SF-36), alongside the conventional Rome-criteria-based Korean Bowel Disease Questionnaire (K-BDQ). Multivariate analyses were conducted to assess whether KM or SF-36 provided additional information beyond the K-BDQ and its statistical relevance to symptom severity. Questions related to symptom severity were selected using an extremely randomized trees (ERT) regressor to develop an integrative questionnaire. For the identification of novel subtypes, Uniform Manifold Approximation and Projection and spectral clustering were used for nonlinear dimensionality reduction and clustering, respectively. The validity of the clusters was assessed using certain metrics, such as trustworthiness, silhouette coefficient, and accordance rate. An ERT classifier was employed to further validate the clustered result. Results: The multivariate analyses revealed that SF-36 and KM supplemented the psychosocial aspects lacking in K-BDQ. Through the application of nonlinear clustering using the integrative questionnaire data, four subtypes of FGID were identified: mild, severe, mind-symptom predominance, and body-symptom predominance. Conclusions: The identification of these subtypes offers a framework for personalized treatment strategies, thus potentially enhancing therapeutic outcomes by tailoring interventions to the unique biopsychosocial profiles of FGID patients. Full article

(This article belongs to the Special Issue Clinical Innovations in Digestive Disease Diagnosis and Treatment)

► Show Figures

Figure 1

Figure 1
Overview of the whole analysis procedure. K-BDQ, Rome-criteria-based Korean Bowel Disease Questionnaire; KM, traditional Korean medicine diagnosis questionnaire for digestive symptoms; SF-36, 36-item Short Form Health Survey; CCA, Canonical Correlation Analysis; MLR, Multiple Linear Regression; FGID, functional gastrointestinal disorder; ERT, extremely randomized tree. Full article ">Figure 2
Exploring the similarity between each questionnaire. (A) Result of CCA between each pair of questionnaires, K-BDQ, KM, and SF-36. The width of the arrows represents the strength of the canonical correlation. (B) Distribution of adjusted <math display="inline"><semantics> <mrow> <msup> <mrow> <mi>R</mi> </mrow> <mrow> <mn>2</mn> </mrow> </msup> </mrow> </semantics></math> values from MLR models to explain KM and SF-36 variables with a combination of upper and lower gastrointestinal symptom variables in K-BDQ. CCA, Canonical Correlation Analysis; MLR, Multiple Linear Regression; K-BDQ, Rome-criteria-based Korean Bowel Disease Questionnaire; KM, traditional Korean medicine diagnosis questionnaire for digestive symptoms; SF-36, 36-item Short Form Health Survey. Full article ">Figure 3
Relevance of the nonlinear clustered structure and the symptom severity. Each dot corresponds to a patient. Each axis corresponds to the UMAP dimensions 1, 2, and 3. (A) Result of dimensionality reduction and clustering analysis after hyperparameter tuning using each questionnaire’s information. Each color represents the result of clustering analysis. Patients’ data from K-BDQ, KM, and SF-36 were divided into six (n_neighbors = 10, n_clusters = 6, trustworthiness = 0.87, silhouette coefficient = 0.48, accordance rate = 0.83), three (n_neighbors = 30, n_clusters = 3, trustworthiness = 0.83, silhouette coefficient = 0.43, accordance rate = 0.84), and two (n_neighbors = 10, n_clusters = 2, trustworthiness = 0.87, silhouette coefficient = 0.64, accordance rate = 0.88) clusters. (B) Data embedding of K-BDQ was color-coded with the clustering result of 2A (the clustered labels when using K-BDQ, KM, and SF-36 information). Each color indicates the label colors of the clustering analysis (the same color as 2A). (C) Relevance of the clustered structure and the symptom severity when using K-BDQ, KM, and SF-36 information, respectively. Color intensity indicates the symptom severity. K-BDQ, Rome-criteria-based Korean Bowel Disease Questionnaire; KM, traditional Korean medicine diagnosis questionnaire for digestive symptoms; SF-36, 36-item Short Form Health Survey; C1, cluster 1; C2, cluster 2; C3, cluster 3. Full article ">Figure 4
Subtype identification result of FGID patients using the integrative questionnaire information. (A) Result of dimensionality reduction and clustering analysis after hyperparameter tuning using the integrative questionnaire information. Patients’ data from the integrative questionnaire were divided into four (n_neighbors = 30, n_clusters = 4, trustworthiness = 0.84, silhouette coefficient = 0.39, accordance rate = 0.85) clusters. Each color indicates different subtypes. Each dot corresponds to a patient, and each axis corresponds to the UMAP dimensions 1, 2, and 3. (B) Receiver operating characteristic curves and corresponding area under the curve statistics for the prediction of subtype label based on the integrative questionnaire information. S1, subtype 1; S2, subtype 2; S3, subtype 3; S4, subtype 4. Full article ">Figure 5
Characteristics of each FGID subtype. Line plot indicates the normalized average score of 29 body-symptom-related questions and 21 mind-symptom-related questions. Dot plot represents the normalized average score of each question for each subtype. S1, subtype 1; S2, subtype 2; S3, subtype 3; S4, subtype 4. Full article ">

Search Results (100)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (100)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI