[go: up one dir, main page]

 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (100)

Search Parameters:
Keywords = UMAP

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
24 pages, 8095 KiB  
Article
Signature Genes Selection and Functional Analysis of Astrocytoma Phenotypes: A Comparative Study
by Anna Drozdz, Caitriona E. McInerney, Kevin M. Prise, Veronica J. Spence and Jose Sousa
Cancers 2024, 16(19), 3263; https://doi.org/10.3390/cancers16193263 - 25 Sep 2024
Abstract
Novel cancer biomarkers discoveries are driven by the application of omics technologies. The vast quantity of highly dimensional data necessitates the implementation of feature selection. The mathematical basis of different selection methods varies considerably, which may influence subsequent inferences. In the study, feature [...] Read more.
Novel cancer biomarkers discoveries are driven by the application of omics technologies. The vast quantity of highly dimensional data necessitates the implementation of feature selection. The mathematical basis of different selection methods varies considerably, which may influence subsequent inferences. In the study, feature selection and classification methods were employed to identify six signature gene sets of grade 2 and 3 astrocytoma samples from the Rembrandt repository. Subsequently, the impact of these variables on classification and further discovery of biological patterns was analysed. Principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and hierarchical clustering revealed that the data set (10,096 genes) exhibited a high degree of noise, feature redundancy, and lack of distinct patterns. The application of feature selection methods resulted in a reduction in the number of genes to between 28 and 128. Notably, no single gene was selected by all of the methods tested. Selection led to an increase in classification accuracy and noise reduction. Significant differences in the Gene Ontology terms were discovered, with only 13 terms overlapping. One selection method did not result in any enriched terms. KEGG pathway analysis revealed only one pathway in common (cell cycle), while the two methods did not yield any enriched pathways. The results demonstrated a significant difference in outcomes when classification-type algorithms were utilised in comparison to mixed types (selection and classification). This may result in the inadvertent omission of biological phenomena, while simultaneously achieving enhanced classification outcomes. Full article
(This article belongs to the Section Cancer Biomarkers)
Show Figures

Figure 1

Figure 1
<p>A schematic illustrating the workflow of the experiments, including: retrieval of data from the REMBRANDT repository, data pre-processing, data inspection, and the two branches of experiments comparing six different methods for feature selection and classification. In the first branch of experiments, DGE, STIR and Boruta methods were used to select signature gene sets, which were further analysed using three classification models: logistic regression, KNN and SVM. In the second set of experiments, RF, LASSO regression and CACTUS methods were used for signature gene selection and subsequent classification.</p>
Full article ">Figure 2
<p>Initial evaluation of the transcriptional profiles of grade 2 and 3 astrocytoma samples examined using PCA (<b>A</b>), UMAP (<b>B</b>) and hierarchical clustering (<b>C</b>) with expression levels of the genes displayed as a z-score on a scale from high (red) to low (blue) values. Concordance between the signature gene sets identified by each of the models: DGE, Boruta, RF, STIR, CACTUS, and LASSO (<b>D</b>).</p>
Full article ">Figure 3
<p>Volcano plots comparing the distribution of up- and down-regulated genes within signature gene sets selected by the following methods: DGE (<b>A</b>), Boruta (<b>B</b>), STIR (<b>C</b>), LASSO (<b>D</b>), RF (<b>E</b>) and CACTUS (<b>F</b>). Purple points indicate genes selected by a given algorithm. The x-axis displays <math display="inline"><semantics> <mrow> <mi>L</mi> <mi>o</mi> <msub> <mi>g</mi> <mn>2</mn> </msub> <mi>F</mi> <mi>C</mi> </mrow> </semantics></math> in gene expression and the y-axis displays the log odds of a gene being differentially expressed with the adjusted <span class="html-italic">p</span>-value. The dashed lines indicate the significance thresholds for <math display="inline"><semantics> <mrow> <mi>L</mi> <mi>o</mi> <msub> <mi>g</mi> <mn>2</mn> </msub> <mi>F</mi> <mi>C</mi> </mrow> </semantics></math> thresholds for the adjusted <span class="html-italic">p</span>-values used for DGE.</p>
Full article ">Figure 4
<p>Heatmaps comparing the hierarchical clustering patterns and gene expression in grade 2 and 3 astrocytoma samples from the signature gene sets selected by the following methods: DGE (<b>A</b>), Boruta (<b>B</b>), STIR (<b>C</b>), LASSO (<b>D</b>), RF (<b>E</b>) and CACTUS (<b>F</b>). Expression levels of the selected genes are displayed as a z-score on a scale from high (red) to low (blue) values.</p>
Full article ">Figure 5
<p>Comparison of the total number of significantly enriched GO terms identified for the methods: DGE (<b>A</b>), Boruta (<b>B</b>), STIR (<b>C</b>), RF (<b>D</b>), and CACTUS (<b>E</b>). The top 15 most significant GO terms are plotted and the number of genes identified for that GO term is indicated as the Count on the x-axis. The colour bar indicates the significance of the enrichment for that GO term as an adjusted <span class="html-italic">p</span>-value. (<b>F</b>) Overlap in the enriched GO terms identified by the five methods and the total number identified by each method. No enriched GO terms were found for genes selected with LASSO regression.</p>
Full article ">Figure 6
<p>Comparison of the significant KEGG pathways identified from the signature genes of the models DGE (<b>A</b>), STIR (<b>B</b>), Boruta (<b>C</b>), and CACTUS (<b>D</b>). Overlap in the KEGG pathways identified from the signature gene sets of all models (<b>E</b>). Overlap in the genes identified for the cell cycle pathway (<b>F</b>).</p>
Full article ">Figure 7
<p>The cell cycle pathway visualisation. Genes marked in green belong to the any of four signature genes with significant enrichment of cell cycle pathway, colour gradient indicates the genes <math display="inline"><semantics> <mrow> <mi>l</mi> <mi>o</mi> <msub> <mi>g</mi> <mn>2</mn> </msub> <mi>F</mi> <mi>C</mi> </mrow> </semantics></math>.</p>
Full article ">Figure 8
<p>Results of the survival analysis comparing grade 2 and grade 3 astrocytoma where time estimated is in months (<b>A</b>). Results of the univariate Cox proportional hazard analysis of selected gene sets. The forest plot displays the hazard ratio and 95% confidence interval (CI) of months of survival after diagnosis as a function of gene (<b>B</b>). Results of the multivariate Cox proportional hazards model examining astrocytoma grade and gene interaction (<b>C</b>). Overlap between the genes identified to be significant for modifying the HR from the univariate (<b>D</b>) and multivariate (<b>E</b>) Cox model analysis of the signature gene sets.</p>
Full article ">
10 pages, 3467 KiB  
Article
Comprehensive Data Augmentation Approach Using WGAN-GP and UMAP for Enhancing Alzheimer’s Disease Diagnosis
by Emi Yuda, Tomoki Ando, Itaru Kaneko, Yutaka Yoshida and Daisuke Hirahara
Electronics 2024, 13(18), 3671; https://doi.org/10.3390/electronics13183671 - 16 Sep 2024
Abstract
In this study, the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) was used to improve the diagnosis of Alzheimer’s disease using medical imaging and the Alzheimer’s disease image dataset across four diagnostic classes. The WGAN-GP was employed for data augmentation. The original [...] Read more.
In this study, the Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) was used to improve the diagnosis of Alzheimer’s disease using medical imaging and the Alzheimer’s disease image dataset across four diagnostic classes. The WGAN-GP was employed for data augmentation. The original dataset, the augmented dataset and the combined data were mapped using Uniform Manifold Approximation and Projection (UMAP) in both a 2D and 3D space. The same combined interaction network analysis was then performed on the test data. The results showed that, for the test accuracy, the score was 30.46% for the original dataset (unbalanced), whereas for the WGAN-GP augmented dataset (balanced), it improved to 56.84%, indicating that the WGAN-GP augmentation can effectively address the unbalanced problem. Full article
Show Figures

Figure 1

Figure 1
<p>Flow chart of data preprocessing.</p>
Full article ">Figure 2
<p>2D UMAP visualization; (<b>a</b>) original plot, (<b>b</b>) GAN plot, and (<b>c</b>) both.</p>
Full article ">Figure 3
<p>3D UMAP visualization: (<b>a</b>) original plot; (<b>b</b>) GAN plot; and (<b>c</b>) both. The colors of the plot points are the same as in <a href="#electronics-13-03671-f002" class="html-fig">Figure 2</a>.</p>
Full article ">Figure 4
<p>Heatmap showing actual vs predicted: (<b>a</b>) is for the original dataset with diseases classified; and (<b>b</b>) is for the dataset trained by the WGAN-GP, an advanced generative inverse network technique.</p>
Full article ">
14 pages, 3226 KiB  
Article
Identification of Beef Odors under Different Storage Day and Processing Temperature Conditions Using an Odor Sensing System
by Yuanchang Liu, Nan Peng, Jinlong Kang, Takeshi Onodera and Rui Yatabe
Sensors 2024, 24(17), 5590; https://doi.org/10.3390/s24175590 - 29 Aug 2024
Viewed by 292
Abstract
This study used an odor sensing system with a 16-channel electrochemical sensor array to measure beef odors, aiming to distinguish odors under different storage days and processing temperatures for quality monitoring. Six storage days ranged from purchase (D0) to eight days (D8), with [...] Read more.
This study used an odor sensing system with a 16-channel electrochemical sensor array to measure beef odors, aiming to distinguish odors under different storage days and processing temperatures for quality monitoring. Six storage days ranged from purchase (D0) to eight days (D8), with three temperature conditions: no heat (RT), boiling (100 °C), and frying (180 °C). Gas chromatography–mass spectrometry (GC-MS) analysis showed that odorants in the beef varied under different conditions. Compounds like acetoin and 1-hexanol changed significantly with the storage days, while pyrazines and furans were more detectable at higher temperatures. The odor sensing system data were visualized using principal component analysis (PCA) and uniform manifold approximation and projection (UMAP). PCA and unsupervised UMAP clustered beef odors by storage days but struggled with the processing temperatures. Supervised UMAP accurately clustered different temperatures and dates. Machine learning analysis using six classifiers, including support vector machine, achieved 57% accuracy for PCA-reduced data, while unsupervised UMAP reached 49.1% accuracy. Supervised UMAP significantly enhanced the classification accuracy, achieving over 99.5% with the dimensionality reduced to three or above. Results suggest that the odor sensing system can sufficiently enhance non-destructive beef quality and safety monitoring. This research advances electronic nose applications and explores data downscaling techniques, providing valuable insights for future studies. Full article
(This article belongs to the Special Issue Electronic Nose and Artificial Olfaction)
Show Figures

Figure 1

Figure 1
<p>Experimental system of the odor sensing system.</p>
Full article ">Figure 2
<p>Odor sensing system 16-channel sensor array responses of a beef sample. (<b>A</b>) Apiezon L; (<b>B</b>) SE-30; (<b>C</b>) OV-1; (<b>D</b>) DC-11; (<b>E</b>) SE-52; (<b>F</b>) OV-3; (<b>G</b>) DC-550; (<b>H</b>) DC-710; (<b>I</b>) OV-17; (<b>J</b>) Tween 80; (<b>K</b>) OV-210; (<b>L</b>) Siponate DS-10; (<b>M</b>) PEG 1000; (<b>N</b>) PEG 600; (<b>O</b>) OV-275; (<b>P</b>) PEG 2000.</p>
Full article ">Figure 3
<p>PCA dimensionality reduction results of beef odor measured by odor sensing system.</p>
Full article ">Figure 4
<p>Unsupervised UMAP dimensionality reduction results of beef odor measured by odor sensing system.</p>
Full article ">Figure 5
<p>Supervised UMAP dimensionality reduction results of beef odor measured by odor sensing system.</p>
Full article ">
20 pages, 2302 KiB  
Article
Non-Intrusive Load Monitoring Based on Dimensionality Reduction and Adapted Spatial Clustering
by Xu Zhang, Jun Zhou, Chunguang Lu, Lei Song, Fanyu Meng and Xianbo Wang
Energies 2024, 17(17), 4303; https://doi.org/10.3390/en17174303 - 28 Aug 2024
Viewed by 234
Abstract
Non-invasive load monitoring (NILM) deduces changes in energy consumption patterns and operational statuses of electrical equipment from power signals in the feed line. With the emergence of fine-grained power load distribution, the importance of utilizing this technology for implementing demand-side energy management in [...] Read more.
Non-invasive load monitoring (NILM) deduces changes in energy consumption patterns and operational statuses of electrical equipment from power signals in the feed line. With the emergence of fine-grained power load distribution, the importance of utilizing this technology for implementing demand-side energy management in smart grid development has become increasingly prominent. To address the issue of low load identification accuracy stemming from complex and diverse load types, this paper introduces a NILM method based on uniform manifold approximation and projection (UMAP) reduction and enhanced density-based spatial clustering of applications with noise (DBSCAN). Firstly, this paper combines the characteristics of user load under transient and steady-state conditions and selects data with significant differences to construct a load-characteristic database. Additionally, UMAP is employed to reduce the dimensionality of high-dimensional load features and rebuild a load feature database. Subsequently, DBSCAN is utilized to categorize typical user loads, followed by a correlation analysis with the load-characteristic database to determine the types or classes of loads that involve switching actions. Finally, this paper simulates and analyzes the proposed method using the electricity consumption data of industrial users from the CER–Electricity–Data dataset. It identifies the electricity load data commonly utilized by users in a specific area of Zhejiang Province in China. The experimental results indicate that the accuracy of the proposed non-invasive load identification method reaches 95%. Compared to the wavelet transform, decision tree, and backpropagation network methods, the improvement is approximately 5%. Full article
(This article belongs to the Section F1: Electrical Power System)
Show Figures

Figure 1

Figure 1
<p>Industrial user load diagram.</p>
Full article ">Figure 2
<p>Flow chart of DBSCAN feature clustering method.</p>
Full article ">Figure 3
<p>Flow chart of the NILM method based on UMAP dimensionality reduction and spatial density clustering.</p>
Full article ">Figure 4
<p>Dimensionality reduction results.</p>
Full article ">Figure 5
<p>DBSCAN cluster analysis of load characteristics.</p>
Full article ">Figure 6
<p>DBSCAN cluster analysis of electricity load data.</p>
Full article ">
23 pages, 1445 KiB  
Article
Dynamic Edge-Based High-Dimensional Data Aggregation with Differential Privacy
by Qian Chen, Zhiwei Ni, Xuhui Zhu, Moli Lyu, Wentao Liu and Pingfan Xia
Electronics 2024, 13(16), 3346; https://doi.org/10.3390/electronics13163346 - 22 Aug 2024
Viewed by 401
Abstract
Edge computing enables efficient data aggregation for services like data sharing and analysis in distributed IoT applications. However, uploading dynamic high-dimensional data to an edge server for efficient aggregation is challenging. Additionally, there is the significant risk of privacy leakage associated with direct [...] Read more.
Edge computing enables efficient data aggregation for services like data sharing and analysis in distributed IoT applications. However, uploading dynamic high-dimensional data to an edge server for efficient aggregation is challenging. Additionally, there is the significant risk of privacy leakage associated with direct such data uploading. Therefore, we propose an edge-based differential privacy data aggregation method leveraging progressive UMAP with a dynamic time window based on LSTM (EDP-PUDL). Firstly, a model of the dynamic time window based on a long short-term memory (LSTM) network was developed to divide dynamic data. Then, progressive uniform manifold approximation and projection (UMAP) with differential privacy was performed to reduce the dimension of the window data while preserving privacy. The privacy budget was determined by the data volume and the attribute’s Shapley value, adding DP noise. Finally, the privacy analysis and experimental comparisons demonstrated that EDP-PUDL ensures user privacy while achieving superior aggregation efficiency and availability compared to other algorithms used for dynamic high-dimensional data aggregation. Full article
Show Figures

Figure 1

Figure 1
<p>IoT data aggregation based on edge computing.</p>
Full article ">Figure 2
<p>The EDP-PUDL process.</p>
Full article ">Figure 3
<p>The structures of LSTM cell and stacked LSTM model.</p>
Full article ">Figure 4
<p>An example of UE selection.</p>
Full article ">Figure 5
<p>Variance in Adult dataset.</p>
Full article ">Figure 6
<p><math display="inline"><semantics> <mrow> <mi>T</mi> <mi>A</mi> <mi>V</mi> <mi>D</mi> </mrow> </semantics></math> with various privacy budgets on the Adult and TPC-E datasets.</p>
Full article ">Figure 7
<p><math display="inline"><semantics> <mrow> <mi>T</mi> <mi>L</mi> <mn>2</mn> <mi>E</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> </mrow> </semantics></math> with various privacy budgets on the ACS and NLTCS datasets.</p>
Full article ">Figure 8
<p><math display="inline"><semantics> <mrow> <mi>T</mi> <mi>A</mi> <mi>V</mi> <mi>D</mi> </mrow> </semantics></math> with various initial window sizes on the Adult and TPC-E datasets.</p>
Full article ">Figure 9
<p><math display="inline"><semantics> <mrow> <mi>T</mi> <mi>L</mi> <mn>2</mn> <mi>E</mi> <mi>r</mi> <mi>r</mi> <mi>o</mi> <mi>r</mi> </mrow> </semantics></math> with various initial window sizes on the ACS and NLTCS datasets.</p>
Full article ">Figure 10
<p>SVM misclassification rates on the Adult and TPC-E datasets.</p>
Full article ">Figure 11
<p>SVM misclassification rates on the ACS and NLTCS datasets.</p>
Full article ">
19 pages, 10716 KiB  
Article
Crop Water Status Analysis from Complex Agricultural Data Using UMAP-Based Local Biplot
by Jenniffer Carolina Triana-Martinez, Andrés Marino Álvarez-Meza, Julian Gil-González, Tom De Swaef and Jose A. Fernandez-Gallego
Remote Sens. 2024, 16(15), 2854; https://doi.org/10.3390/rs16152854 - 4 Aug 2024
Viewed by 679
Abstract
To optimize growth and management, precision agriculture relies on a deep understanding of agricultural dynamics, particularly crop water status analysis. Leveraging unmanned aerial vehicles, we can efficiently acquire high-resolution spatiotemporal samples by utilizing remote sensors. However, non-linear relationships among data features, localized within [...] Read more.
To optimize growth and management, precision agriculture relies on a deep understanding of agricultural dynamics, particularly crop water status analysis. Leveraging unmanned aerial vehicles, we can efficiently acquire high-resolution spatiotemporal samples by utilizing remote sensors. However, non-linear relationships among data features, localized within specific subgroups, frequently emerge in agricultural data. Interpreting these complex patterns requires sophisticated analysis due to the presence of noise, high variability, and non-stationarity behavior in the collected samples. Here, we introduce Local Biplot, a methodological framework tailored for discerning meaningful data patterns in non-stationary contexts for precision agriculture. Local Biplot relies on the well-known uniform manifold approximation and projection method, such as UMAP, and local affine transformations to codify non-stationary and non-linear data patterns while maintaining interpretability. This lets us find important clusters for transformation and projection within a single global axis pair. Hence, our framework encompasses variable and observational contributions within individual clusters. At the same time, we provide a relevance analysis strategy to help explain why those clusters exist, facilitating the understanding of data dynamics while favoring interpretability. We demonstrated our method’s capabilities through experiments on both synthetic and real-world datasets, covering scenarios involving grass and rice crops. Moreover, we use random forest and linear regression models to predict water status variables from our Local Biplot-based feature ranking and clusters. Our findings revealed enhanced clustering and prediction capability while emphasizing the importance of input features in precision agriculture. As a result, Local Biplot is a useful tool to visualize, analyze, and compare the intricate underlying patterns and internal structures of complex agricultural datasets. Full article
(This article belongs to the Special Issue Application of Satellite and UAV Data in Precision Agriculture)
Show Figures

Figure 1

Figure 1
<p>Local Biplot sketch. Dotted line: cluster-based operation.</p>
Full article ">Figure 2
<p>RiceClimaRemote dataset site’s location. (<b>a</b>) Colombia; (<b>b</b>) Tolima department; (<b>c</b>) Espinal and experimental field with RGB raster.</p>
Full article ">Figure 3
<p>Multivariate Gaussians dataset visual inspection results. (<b>Left</b>): SVD-based biplot. (<b>Right</b>): Local Biplot (ours). Gray arrows depict each feature in the dataset (f1–f5), which shed light on their correlations. Examining the scatter points and their colors allows us to visually understand sample distributions. PC stands for principal component (basis).</p>
Full article ">Figure 4
<p>Multivariate Gaussians: Pearson correlation results. First row: a panel of absolute feature correlation matrices showcases values for both the complete database and each cluster using the SVD-based biplot. The second row displays cluster-specific absolute correlations from our Local Biplot.</p>
Full article ">Figure 5
<p>Forage Grasses biplots. <b>Left</b>: SVD-based biplot. <b>Middle</b>: Local Biplot. <b>Right</b>: Local Biplot and cluster-based probability boundaries. The colors in the left and middle plots represent the clustering label. The right plot’s color emphasizes the target variable (breeding score), while the flight dates (TF1, TF2, and TF3) determine the color of the curves. PC: principal component (basis).</p>
Full article ">Figure 6
<p>Forage Grasses Pearson correlation results: SVD-based biplot. We show the absolute correlation between the VIs and the breeding score (target) for each species individually and collectively. We also establish correlations for each cluster separately and throughout the dataset.</p>
Full article ">Figure 7
<p>Forage Grasses Pearson correlation results: Local Biplot (ours). We show the absolute correlation between the VIs and the breeding score (target) for each species individually and collectively. We also establish correlations for each cluster separately and throughout the dataset.</p>
Full article ">Figure 8
<p>Forage Grasses feature relevance analysis. SVD-based biplot and Local Biplot (ours) normalized feature relevance are presented. We also show the LR and RF regressor weights. The bar color in the second column stands for the Local Biplot clusters labels (see <a href="#remotesensing-16-02854-f005" class="html-fig">Figure 5</a>).</p>
Full article ">Figure 9
<p>RiceClimaRemote biplot results. <b>Left</b>: SVD-based biplot. <b>Middle</b>: Local Biplot. <b>Right</b>: Local Biplot and cluster-based probability boundaries. The colors in the left and middle plots represent the clustering label. The right plot’s color emphasizes the target variable (CWC), while the growth stages of rice (vegetative, reproductive, and ripening) determine the color of the curves. PC: principal component (basis).</p>
Full article ">Figure 10
<p>RiceClimaRemote Pearson correlation results: SVD-based biplot. For each irrigation treatment, we show the absolute correlation (cluster- and entire-data-based) between the selected feature and the CWC (target), both individually and collectively.</p>
Full article ">Figure 11
<p>RiceClimaRemote Pearson correlation results: Local Biplot (ours). For each irrigation treatment, we show the absolute correlation (cluster- and entire-data-based) between the selected feature and the CWC (target), both individually and collectively.</p>
Full article ">Figure 12
<p>RiceClimaRemote feature relevance analysis. SVD-based biplot and Local Biplot (ours) normalized feature relevance are presented. We also present the LR and RF regressor weights. The bar color in the second column stands for the Local Biplot clusters labels (see <a href="#remotesensing-16-02854-f009" class="html-fig">Figure 9</a>).</p>
Full article ">
21 pages, 4306 KiB  
Article
Comparative Analysis of Manifold Learning-Based Dimension Reduction Methods: A Mathematical Perspective
by Wenting Yi, Siqi Bu, Hiu-Hung Lee and Chun-Hung Chan
Mathematics 2024, 12(15), 2388; https://doi.org/10.3390/math12152388 - 31 Jul 2024
Viewed by 651
Abstract
Manifold learning-based approaches have emerged as prominent techniques for dimensionality reduction. Among these methods, t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) stand out as two of the most widely used and effective approaches. While both methods share similar [...] Read more.
Manifold learning-based approaches have emerged as prominent techniques for dimensionality reduction. Among these methods, t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) stand out as two of the most widely used and effective approaches. While both methods share similar underlying procedures, empirical observations indicate two distinctive properties: global data structure preservation and computational efficiency. However, the underlying mathematical principles behind these distinctions remain elusive. To address this gap, this study presents a comparative analysis of the subprocesses involved in these methods, aiming to elucidate the mathematical mechanisms underlying the observed distinctions. By meticulously examining the equation formulations, the mathematical mechanisms contributing to global data structure preservation and computational efficiency are elucidated. To validate the theoretical analysis, data are collected through a laboratory experiment, and an open-source dataset is utilized for validation across different datasets. The consistent alignment of results obtained from both balanced and unbalanced datasets robustly confirms the study’s findings. The insights gained from this study provide a deeper understanding of the mathematical underpinnings of t-SNE and UMAP, enabling more informed and effective use of these dimensionality reduction techniques in various applications, such as anomaly detection, natural language processing, and bioinformatics. Full article
(This article belongs to the Section Mathematics and Computer Science)
Show Figures

Figure 1

Figure 1
<p>Illustration of high-dimensional data and low-dimensional data with manifold learning (ML) and non-manifold learning (non-ML) techniques. Four colors represent four different clusters.</p>
Full article ">Figure 2
<p>Illustration of global and local structures in high dimension and low dimension with t-SNE and UMAP. Four colors represent four different clusters.</p>
Full article ">Figure 3
<p>The general algorithm structure of manifold learning techniques.</p>
Full article ">Figure 4
<p>The impact of the parameters <math display="inline"><semantics> <mi>α</mi> </semantics></math> and <math display="inline"><semantics> <mi>β</mi> </semantics></math> on similarity scores in low dimension.</p>
Full article ">Figure 5
<p>The curve-fitting results with different values of <math display="inline"><semantics> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> <mo>_</mo> <mrow> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </mrow> </semantics></math>.</p>
Full article ">Figure 6
<p>Comparison of the loss functions of t-SNE and UMAP.</p>
Full article ">Figure 7
<p>The impact of different <math display="inline"><semantics> <mi>σ</mi> </semantics></math> on the high-dimensional probability.</p>
Full article ">Figure 8
<p>Laboratory experimental test for data collection.</p>
Full article ">
22 pages, 16238 KiB  
Article
Spectroscopic Phenological Characterization of Mangrove Communities
by Christopher Small and Daniel Sousa
Remote Sens. 2024, 16(15), 2796; https://doi.org/10.3390/rs16152796 - 30 Jul 2024
Viewed by 473
Abstract
Spaceborne spectroscopic imaging offers the potential to improve our understanding of biodiversity and ecosystem services, particularly for challenging and rich environments like mangroves. Understanding the signals present in large volumes of high-dimensional spectroscopic observations of vegetation communities requires the characterization of seasonal phenology [...] Read more.
Spaceborne spectroscopic imaging offers the potential to improve our understanding of biodiversity and ecosystem services, particularly for challenging and rich environments like mangroves. Understanding the signals present in large volumes of high-dimensional spectroscopic observations of vegetation communities requires the characterization of seasonal phenology and response to environmental conditions. This analysis leverages both spectroscopic and phenological information to characterize vegetation communities in the Sundarban riverine mangrove forest of the Ganges–Brahmaputra delta. Parallel analyses of surface reflectance spectra from NASA’s EMIT imaging spectrometer and MODIS vegetation abundance time series (2000–2022) reveal the spectroscopic and phenological diversity of the Sundarban mangrove communities. A comparison of spectral and temporal feature spaces rendered with low-order principal components and 3D embeddings from Uniform Manifold Approximation and Projection (UMAP) reveals similar structures with multiple spectral and temporal endmembers and multiple internal amplitude continua for both EMIT reflectance and MODIS Enhanced Vegetation Index (EVI) phenology. The spectral and temporal feature spaces of the Sundarban represent independent observations sharing a common structure that is driven by the physical processes controlling tree canopy spectral properties and their temporal evolution. Spectral and phenological endmembers reside at the peripheries of the mangrove forest with multiple outward gradients in amplitude of reflectance and phenology within the forest. Longitudinal gradients of both phenology and reflectance amplitude coincide with LiDAR-derived gradients in tree canopy height and sub-canopy ground elevation, suggesting the influence of surface hydrology and sediment deposition. RGB composite maps of both linear (PC) and nonlinear (UMAP) 3D feature spaces reveal a strong contrast between the phenological and spectroscopic diversity of the eastern Sundarban and the less diverse western Sundarban. Full article
(This article belongs to the Special Issue Remote Sensing of Land Surface Phenology II)
Show Figures

Figure 1

Figure 1
<p>Index map of the Sundarban mangrove forest at the mouths of the Ganges–Brahmaputra delta. The Sentinel 2 false color composite from 2018 shows river channel network and forest canopy cover variations. Note the contrast of the mangrove canopy with the dry season agriculture (bright green), fallow fields (tan), and aquaculture ponds (black) on embanked islands surrounding the forest. The contrast in mangrove reflectance between the eastern and western tiles is a BRDF effect due to the contrasting view geometries at the opposite edges of adjacent Sentinel 2 swaths. GPS tracks (white) show the extents of boat-based field surveys. The east–west scale is 185 km.</p>
Full article ">Figure 2
<p>Sundarban EMIT mosaic. Three swaths provide a full coverage of mangroves and surrounding agriculture and aquaculture, with significant swath overlap. Acquisition dates span nearly the full annual phenological cycle from post-monsoon in December to pre-monsoon in April. Solar zenith angles at times of acquisition range from 11.5° (04.24) to 50.5° (12.27). The white vector boundary shows the extent of the Bangladesh Sundarban, for which tree species maps are available. Acquisition times are UTC + 6 h offset.</p>
Full article ">Figure 3
<p>Spectral feature space of the vegetation-masked Sundarban EMIT mosaic. Orthogonal projections of three low-order principal components (PCs) reveal spectral endmembers (labeled) and distinct amplitude gradients (vectors) within three clusters. Varying amplitude reflectance spectra (right) correspond to vector continua of the same color (left). While the reflectance spectra of the clusters overlap at VNIR wavelengths, each is distinct in the SWIR. The feature space spans amplitude continua with both agricultural (Ag) and forest (F) components. The agricultural continuum arises from the varying abundance of photosynthetic (PV) and non-photosynthetic (NPV) vegetation, but the amplitude of mangrove reflectance is modulated primarily by the canopy structure and varying amounts of crown shadow. There is a considerable overlap in EVI range between adjacent gradients, but a negligible overlap in NDWI range.</p>
Full article ">Figure 4
<p>Complementary spectral feature spaces of the vegetation-masked Sundarban EMIT mosaic. Orthogonal projections of two 3D UMAP embeddings (nn: 50 and 100) reveal consistent spectral endmembers (labeled) and distinct reflectance amplitude gradients (vectors) within three clusters corresponding to those in <a href="#remotesensing-16-02796-f003" class="html-fig">Figure 3</a>. In both UMAP 3/2 projections (right), the western continuum (yellow) is distinct from the connected eastern and central continua (cyan and magenta). Note the bifurcation of the high-amplitude end of the eastern continuum into the northern (N) and southern (S) peripheries of the mangroves. As in the PC feature space, the surrounding agriculture forms a separate 2D continuum spanning photosynthetic and non-photosynthetic vegetation.</p>
Full article ">Figure 5
<p>PC and UMAP feature space composites. Reflectance amplitude gradients and inset spectra correspond to those in <a href="#remotesensing-16-02796-f003" class="html-fig">Figure 3</a> and <a href="#remotesensing-16-02796-f004" class="html-fig">Figure 4</a>. Gradient vectors indicate a direction of increasing NIR reflectance. Composite colors are determined by 3D feature space topology in PC and UMAP spaces. The swath edge discontinuity in the center contrasts post-monsoon (west) from dry season (east) reflectance and longitudinal gradients in species composition and environmental conditions.</p>
Full article ">Figure 6
<p>Bitemporal reflectance change in swath overlap. PCs of the bitemporal reflectance space show a continuum bounded by three endmember reflectance changes for mangrove forest and one for dry season agriculture (Ag). For each forest change endmember, SWIR liquid water absorptions are deeper in December, following the monsoon, but significantly reduced by the April dry season. In contrast, the chlorophyll absorptions in the visible change little. Coherent spatial patterns in bitemporal PC composite suggest aggregate responses to solar illumination (θ) and SWIR water absorption.</p>
Full article ">Figure 7
<p>Temporal feature space of MODIS EVI time series for the entire Sundarban mangrove forest, 2000–2022. The UMAP composite (<b>upper left</b>) shows both N–S and E–W gradients in seasonal phenology, as well as several abrupt transitions. The 1/3 projection of the 3D UMAP feature space (<b>upper center</b>) has a single root (11) corresponding to lower EVI mixtures of canopy, water, and shadow at riverbanks and narrow channels within the mangrove. As EVI increases with canopy closure, the root diverges (10) into mixing trends, terminating at 9 distinct temporal endmembers. EVI time series (<b>bottom</b>) increase abruptly during the summer monsoon, and then decrease gradually over the rest of the year. The 9 endmembers correspond to peripheral regions of the Sundarban (<b>upper left</b>) with the highest post-monsoon EVI. The correlation matrix (<b>upper right</b>) of all 11 endmembers shows the highest correlations between geographically adjacent endmembers. The 2 lowest- (10, 11) and 2 highest (1, 9)-amplitude endmembers are less intercorrelated than those at the southern and eastern peripheries (2–8). Also apparent in the UMAP composite is the distinction between the phenological diversity of the eastern Sundarban and the more homogeneous center in the west.</p>
Full article ">Figure 8
<p>Sundarban temporal endmember phenologies from the temporal feature space in <a href="#remotesensing-16-02796-f007" class="html-fig">Figure 7</a>. The mean EVI (white) shows the rapid post-monsoon greening and gradual dry season senescence, while mean-removed residuals (color) of individual endmembers (offset for clarity) show a diversity of periodic excursions from the mean. As seen in <a href="#remotesensing-16-02796-f007" class="html-fig">Figure 7</a>, all endmembers are phase-aligned and differ primarily in the rate and amplitude of dry season EVI decrease. Despite the considerable noise, distinct annual periodicity is apparent in all but the lowest-amplitude (e.g., 2, 4, 7) residuals—which are most similar to the mean. The largest-amplitude residuals are those from the root and branch (10, 11) of the feature space, corresponding to lower EVI associated with partial canopy cover on shorelines and small channels. Note the slight decadal increase in minimum EVI of the mean.</p>
Full article ">Figure 9
<p>EMIT reflectance mosaic UMAP composite and elevation maps for the Sundarban and surrounding delta. GEDI LiDAR maps (center) reveal that an ongoing sediment deposition in the mangrove forest results in 1–2 m higher ground elevation in the eastern Sundarban relative to the surrounding embanked islands, which have been sediment-starved for decades. The higher SRTM elevation of the eastern Sundarban is a result of both the higher ground elevation and the greater canopy height of the tree species. Mono-species epicenters, from a Bangladesh Forest Department species map (2002), are labeled by common (local) names of tree species. Bi-species gradients compose most of the eastern Sundarban. Arrows show two SRTM swath discontinuity artifacts, which are distinct from the numerous height discontinuities occurring across channels.</p>
Full article ">Figure 10
<p>Field photos illustrate the forest diversity of the Bangladesh Sundarban. The northeast Sundarban (<b>top</b>) reaches canopy heights of 25 m, in contrast with the surrounding embanked islands, which are often below sea level. The sand-dominant islands of the southeast (<b>upper center</b>) are intertidal only around their peripheries and contain different tree species from the rest of the Sundarban. The vegetation gradient of Bird Island on the Bay of Bengal (<b>lower center</b>) illustrates the succession of grasses, shrubs, and trees that colonize sand-dominant islands. River channel networks (<b>bottom</b>) continually deliver silt and mud to intertidal islands throughout the Sundarban. Photos © C. Small 2012–2022.</p>
Full article ">Figure A1
<p>Multiscale UMAP temporal feature space with 3D PC(UMAP<sub>10+50</sub>) composite for MODIS EVI phenology. The low-order PCs of two 3D UMAP embeddings with contrasting n neighbor scales (nn: 10 and 50) preserve both the global scale limb structure and the finer scale clusters that are both phenologically and geographically distinct—including anomalous tree species assemblages at Hiron Point (HP) and Shelar Char (SC) on the Bay of Bengal shorelines. Compare the map structure with the maps in <a href="#remotesensing-16-02796-f005" class="html-fig">Figure 5</a> and <a href="#remotesensing-16-02796-f007" class="html-fig">Figure 7</a>. Manifold density and UMAP color scale equivalent to those in <a href="#remotesensing-16-02796-f007" class="html-fig">Figure 7</a>.</p>
Full article ">Figure A2
<p>Variance partition and sparse component distribution for the MODIS EVI phenology of the Ganges–Brahmaputra delta. The singular values (top) of the low-rank component suggest that the temporal feature space is effectively 4D (&gt;1%) with 96% of the total variance, while the sparse component has a nearly uniform noise floor over all dimensions. The spatial standard deviation (σ) and range (ρ) of the sparse component peak during the monsoon as a result of a transient cloud cover.</p>
Full article ">Figure A3
<p>Coregistered overlap between 12.23 and 12.27 EMIT acquisitions. Natural color composites illustrate the difference in aerosol optical depth with a reduced dynamic range and a greater adjacency effect on 12.27.</p>
Full article ">Figure A4
<p>Coregistered overlap between 12.27 and 04.24 acquisitions. Natural color composites show the difference in aerosol optical depth with reduced dynamic range and greater adjacency effect on 12.27.</p>
Full article ">Figure A5
<p>Apparent change in reflectance for overlaps on 12.23, 12.27, and 04.24. The mean (white) ± 1 standard deviation (green) of all vegetation spectra in each swath overlap show the effects of residual atmospheric scattering on 12.27 and actual changes in illumination and leaf water content on 04.24. The December 23 and 27 difference suggests short wavelength-dependent scattering on 12.27 with increased visible and reduced NIR but negligible change in SWIR wavelengths. In contrast, a greater VNIR scatter on 12.27 is manifested as a reduced visible and increased NIR scatter relative to 04.24. The reduced leaf water content on 04.24 results in greater SWIR residual from reduced H<sub>2</sub>O absorption after the 4-month dry season. A higher solar elevation in April also contributes to higher NIR and SWIR reflectance.</p>
Full article ">
15 pages, 6283 KiB  
Article
Precision Detection of Salt Stress in Soybean Seedlings Based on Deep Learning and Chlorophyll Fluorescence Imaging
by Yixin Deng, Nan Xin, Longgang Zhao, Hongtao Shi, Limiao Deng, Zhongzhi Han and Guangxia Wu
Plants 2024, 13(15), 2089; https://doi.org/10.3390/plants13152089 - 27 Jul 2024
Viewed by 579
Abstract
Soil salinization poses a critical challenge to global food security, impacting plant growth, development, and crop yield. This study investigates the efficacy of deep learning techniques alongside chlorophyll fluorescence (ChlF) imaging technology for discerning varying levels of salt stress in soybean seedlings. Traditional [...] Read more.
Soil salinization poses a critical challenge to global food security, impacting plant growth, development, and crop yield. This study investigates the efficacy of deep learning techniques alongside chlorophyll fluorescence (ChlF) imaging technology for discerning varying levels of salt stress in soybean seedlings. Traditional methods for stress identification in plants are often laborious and time-intensive, prompting the exploration of more efficient approaches. A total of six classic convolutional neural network (CNN) models—AlexNet, GoogLeNet, ResNet50, ShuffleNet, SqueezeNet, and MobileNetv2—are evaluated for salt stress recognition based on three types of ChlF images. Results indicate that ResNet50 outperforms other models in classifying salt stress levels across three types of ChlF images. Furthermore, feature fusion after extracting three types of ChlF image features in the average pooling layer of ResNet50 significantly enhanced classification accuracy, achieving the highest accuracy of 98.61% in particular when fusing features from three types of ChlF images. UMAP dimensionality reduction analysis confirms the discriminative power of fused features in distinguishing salt stress levels. These findings underscore the efficacy of deep learning and ChlF imaging technologies in elucidating plant responses to salt stress, offering insights for precision agriculture and crop management. Overall, this study demonstrates the potential of integrating deep learning with ChlF imaging for precise and efficient crop stress detection, offering a robust tool for advancing precision agriculture. The findings contribute to enhancing agricultural sustainability and addressing global food security challenges by enabling more effective crop stress management. Full article
(This article belongs to the Special Issue Practical Applications of Chlorophyll Fluorescence Measurements)
Show Figures

Figure 1

Figure 1
<p>Workflow of using chlorophyll fluorescence imaging to analyze and detect various concentrations of salt stress in soybean seedlings. ChlF: Chlorophyll fluorescence; CNN: convolutional neural network; SVM: support vector machine.</p>
Full article ">Figure 2
<p>Pseudo-color images of Fv/Fm, Y(NO), and Inh parameters of soybean seedlings under different salt concentration stresses.</p>
Full article ">Figure 3
<p>Soybean seedling images and leaf SPAD values under various salt concentration stress. (<b>a</b>–<b>e</b>): Phenotypic images of soybean seedlings treated with 0, 50, 100, 150, and 200 mM NaCl, respectively. (<b>f</b>): SPAD values measured from the leaves of soybean seedlings under different salt concentrations of salt stress treatment. Values shown are means ± SD of three measurements taken in all leaves per treatment. The lowercase letters in the bar charts represent significant differences between the indicated groups as tested with a one-way analysis of variance (ANOVA, <span class="html-italic">p</span> &lt; 0.05).</p>
Full article ">Figure 4
<p>Comparison of validation and testing accuracies among six different CNN models. The red asterisk indicates the highest accuracy.</p>
Full article ">Figure 5
<p>Confusion matrix of different modeling algorithms. The different shades of green represent varying levels of classification accuracy, with darker shades indicating higher accuracy and lighter shades indicating lower accuracy.</p>
Full article ">Figure 6
<p>The scatter plot of 2D dimensionality reduction test features using UMAP.</p>
Full article ">
43 pages, 8643 KiB  
Article
Diffusion on PCA-UMAP Manifold: The Impact of Data Structure Preservation to Denoise High-Dimensional Single-Cell RNA Sequencing Data
by Padron-Manrique Cristian, Vázquez-Jiménez Aarón, Esquivel-Hernandez Diego Armando, Martinez-Lopez Yoscelina Estrella, Neri-Rosario Daniel, Giron-Villalobos David, Mixcoha Edgar, Sánchez-Castañeda Jean Paul and Resendis-Antonio Osbaldo
Biology 2024, 13(7), 512; https://doi.org/10.3390/biology13070512 - 9 Jul 2024
Cited by 1 | Viewed by 887
Abstract
Single-cell transcriptomics (scRNA-seq) is revolutionizing biological research, yet it faces challenges such as inefficient transcript capture and noise. To address these challenges, methods like neighbor averaging or graph diffusion are used. These methods often rely on k-nearest neighbor graphs from low-dimensional manifolds. However, [...] Read more.
Single-cell transcriptomics (scRNA-seq) is revolutionizing biological research, yet it faces challenges such as inefficient transcript capture and noise. To address these challenges, methods like neighbor averaging or graph diffusion are used. These methods often rely on k-nearest neighbor graphs from low-dimensional manifolds. However, scRNA-seq data suffer from the ‘curse of dimensionality’, leading to the over-smoothing of data when using imputation methods. To overcome this, sc-PHENIX employs a PCA-UMAP diffusion method, which enhances the preservation of data structures and allows for a refined use of PCA dimensions and diffusion parameters (e.g., k-nearest neighbors, exponentiation of the Markov matrix) to minimize noise introduction. This approach enables a more accurate construction of the exponentiated Markov matrix (cell neighborhood graph), surpassing methods like MAGIC. sc-PHENIX significantly mitigates over-smoothing, as validated through various scRNA-seq datasets, demonstrating improved cell phenotype representation. Applied to a multicellular tumor spheroid dataset, sc-PHENIX identified known extreme phenotype states, showcasing its effectiveness. sc-PHENIX is open-source and available for use and modification. Full article
(This article belongs to the Special Issue Machine Learning Applications in Biology)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>The imputation process using sc-PHENIX. The sc-PHENIX imputation approach for scRNA-seq data consists of two main steps: (<b>A</b>) The construction of the distance matrix (<math display="inline"><semantics> <mrow> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mi>D</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> </semantics></math>): sc-PHENIX is characterized by applying PCA and then UMAP (PCA-UMAP). In this PCA-UMAP multidimensional space, sc-PHENIX constructs the best denoise representation of cell distance measurements for the diffusion process to preserve data structures. (<b>B</b>) The diffusion maps for imputation: the imputation process using diffusion maps consists of several steps: (i) Construction of the Markov transition matrix <b><span class="html-italic">M</span></b> from <math display="inline"><semantics> <mrow> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mi>D</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> </semantics></math>: sc-PHENIX uses the adaptive Gaussian kernel to generate a non-symmetric affinity matrix (<math display="inline"><semantics> <mrow> <msub> <mrow> <mi>A</mi> </mrow> <mrow> <mi>n</mi> <mi>o</mi> <mi>n</mi> <mo>−</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> </mrow> </msub> </mrow> </semantics></math>), it is symmetrized. Then, it is normalized to generate (<b><span class="html-italic">M</span></b>). (ii) Diffusion process: <b><span class="html-italic">M</span></b> is exponentiated to a chosen power <span class="html-italic">t</span> (random walk of length <span class="html-italic">t</span> named “diffusion time”) to obtain the exponentiated Markov matrix (<b><span class="html-italic">M<sup>t</sup></span></b>). The <b><span class="html-italic">M<sup>t</sup></span></b> graph well preserves the continuum structure better than the previous steps. (iii) Imputation: This step consists of multiplying the exponentiated Markov matrix (<b><span class="html-italic">M<sup>t</sup></span></b>) times the single-cell-matrix data <b>D</b> to obtain an imputed and denoised scRNA-seq matrix (<math display="inline"><semantics> <mrow> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mi>i</mi> <mi>m</mi> <mi>p</mi> <mi>u</mi> <mi>t</mi> <mi>e</mi> <mi>d</mi> </mrow> </msub> </mrow> </semantics></math>). Note: The symbol * used in this figure indicates matrix multiplication for <b><span class="html-italic">M<sup>t</sup></span></b> and <b><span class="html-italic">D</span></b> in a computational formalism, which is equivalent to the formal mathematical notation <b><span class="html-italic">M<sup>t</sup></span></b> ⋅ <b><span class="html-italic">D</span></b>. All equations are described in the <a href="#sec2-biology-13-00512" class="html-sec">Section 2</a> section. (<b>C</b>) Visualization of the exponentiated Markov matrix: We convert the <b><span class="html-italic">M<sup>t</sup></span></b> into a distance matrix (<math display="inline"><semantics> <mrow> <msub> <mrow> <mi>D</mi> </mrow> <mrow> <mi>D</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> </semantics></math>). Then, we apply a multidimensional scaling method to project data in 2D or 3D dimensions. This projection can be used as a heuristic method for quality control of imputation.</p>
Full article ">Figure 2
<p><b>Comparative Analysis of Imputation Methods on Corrupted Microarray Data.</b> We corrupted the data by randomly assigning zeros to 80% of the values and compared the imputed data with the original data. We fixed <span class="html-italic">t</span> = 5 and <span class="html-italic">decay</span> = 15. We used Pearson, Spearman, and R<sup>2</sup> metrics to evaluate the imputed data compared to the original data. (<b>A</b>) Imputation based on <span class="html-italic">knn</span> = 30. (<b>B</b>) sc-PHENIX <span class="html-italic">knn</span> = 30 and MAGIC <span class="html-italic">knn</span> = 5. This comparison aims to identify the optimal scenarios for using MAGIC and sc-PHENIX: a low <span class="html-italic">knn</span> value for MAGIC and a high <span class="html-italic">knn</span> value for sc-PHENIX. (<b>C</b>) sc-PHENIX <span class="html-italic">knn</span> = 30 and MAGIC <span class="html-italic">knn</span> = 30. (<b>D</b>) Comparison of the performance of sc-PHENIX imputation using UMAP without initialization and with PCA-UMAP initialization. (<b>E</b>) 2D-UMAP plots of the microarray data visualizing the gene values from corrupted data, imputed data with sc-PHENIX and MAGIC, and the developmental time. (<b>F</b>) Observation of gene trends along the developmental time (left to right) with original values, imputed values, and points dropped out at 60%. (<b>G</b>) t-SNE and PCA-t-SNE initialization for sc-PHENIX on the 80% corrupted data, with the same metrics used in (<b>A</b>). (<b>H</b>). Explained Variance by PCA Components: The graph shows the explained variance ratio (blue bars) and cumulative explained variance (black dashed line) for the principal components (PCs) of the dataset. The vertical blue line indicates PC 71, where the cumulative explained variance reaches 70% (green dashed line). It demonstrates the number of components required to capture 70% of the total variance in the dataset. Note: The dotted lines represent the global mean of the metric for all samples, with orange for sc-PHENIX and blue for MAGIC.</p>
Full article ">Figure 3
<p><b>Implications of the <span class="html-italic">Decay</span> Parameter and PCA Dimensionality on sc-PHENIX and MAGIC.</b> The parameters are the same as in <a href="#biology-13-00512-f002" class="html-fig">Figure 2</a>B, with the only difference being that the <span class="html-italic">decay</span> values increase as the PCA dimensionality increases. Additionally, we fixed <span class="html-italic">t</span> = 5 sc-PHENIX <span class="html-italic">knn</span> = 30 and MAGIC <span class="html-italic">knn</span> = 5. This comparison aims to identify the optimal scenarios for using MAGIC and sc-PHENIX: a low <span class="html-italic">knn</span> value for MAGIC and a high <span class="html-italic">knn</span> value for sc-PHENIX. The dotted lines represent the global mean of the metric for all samples, with orange for sc-PHENIX and blue for MAGIC.</p>
Full article ">Figure 4
<p><b>MNIST dataset visualization: Multidimensional scaling.</b> (<b>A</b>). Three-dimensional MDS plot of the PCA manifold (500 PC’s, <span class="html-italic">knn</span> = 30 and <span class="html-italic">t</span> = 5). (<b>B</b>). Three-dimensional MDS plot of the exponentiated Markov matrix (500 PCs as input, <span class="html-italic">knn</span> = 30 and <span class="html-italic">t</span> = 5). (<b>C</b>). Three-dimensional MDS plot of the exponentiated Markov matrix, <b><span class="html-italic">M</span><sup>t</sup></b> (500 PCs transformed into 60 UMAP components as input, <span class="html-italic">knn</span> = 50, <span class="html-italic">t</span> = 10). (<b>D</b>). Two-dimensional MDS plot of the exponentiated Markov matrix, <b><span class="html-italic">M</span><sup>t</sup></b> (500 PCs as input, <span class="html-italic">knn</span> = 30 and <span class="html-italic">t</span> = 5, MNIST), as MAGIC. (<b>E</b>). Two-dimensional MDS plot of the exponentiated Markov matrix, <b><span class="html-italic">M<sup>t</sup></span></b> (500PC’s transformed into 60 UMAP components as input, <span class="html-italic">knn</span> = 50, <span class="html-italic">t</span> = 10), as sc-PHENIX. (<b>F</b>). One branch of the 6’s digit images of the PCA space (subsection of <a href="#biology-13-00512-f002" class="html-fig">Figure 2</a>D). Redline line color indicates the branch continuum, and red lines were drawn to visualize the branches. (<b>G</b>). Three branches of the 6’s digit images subsection of the PCA-UMAP (subsection of <a href="#biology-13-00512-f002" class="html-fig">Figure 2</a>E).</p>
Full article ">Figure 5
<p><b>Multidimensional scaling visualization of a neuronal scRNA-seq dataset.</b> (<b>A</b>) Two-dimensional MDS plot of the PCA manifold (500 PC’s). (<b>B</b>). Two-dimensional UMAP plot (2 UMAP components). (<b>C</b>) Two-dimensional MDS plot of the exponentiated Markov matrix, <b>M<sup>t</sup></b> (500 PC’s, <span class="html-italic">knn</span> = 30, <span class="html-italic">t</span> = 5). (<b>D</b>). Two-dimensional MDS plot of the exponentiated Markov matrix, <b>M<sup>t</sup></b> (500 PCs transformed into 60 UMAP components as input, <span class="html-italic">knn</span> = 30 and, <span class="html-italic">t</span> = 5) of the adult mouse visual cortex cells dataset. Note: For the adult mouse visual cortex cells dataset, three main clusters are GABAergic (red-yellowish), glutamatergic (blueish), and non-neuronal (greenish) cell types.</p>
Full article ">Figure 6
<p>Overview of the accuracy of the <span class="html-italic">knn</span> classifier model on the 2D-MDS space derived from MAGIC’s and sc-PHENIX’s exponentiated Markovian matrix. Each plot represents different combinations of diffusion parameters <span class="html-italic">t</span>, <span class="html-italic">knn</span> values, and PCA dimensions. The color gradient, using a rainbow palette, indicates the classifier accuracy, with red representing higher accuracy and blue representing lower accuracy. The top row corresponds to MAGIC, while the bottom row corresponds to sc-PHENIX. The results show how increasing <span class="html-italic">t</span>, <span class="html-italic">knn</span>, and PCA dimensions affect local structure preservation.</p>
Full article ">Figure 7
<p><b>HDBSCAN clusters on the exponentiated Markov matrix of sc-PHENIX.</b> Clusters were assigned as letters (A, B, C, etc.). (<b>A</b>) MNIST samples distribution of different HDBSCAN clusters (PCA space). (<b>B</b>) Distribution of MNIST samples on different HDBSCAN clusters of the <b>M<sup>t</sup></b> (diffusion on PCA space, also known as MAGIC). (<b>C</b>) Distribution of MNIST samples on different HDBSCAN clusters of the <b>M<sup>t</sup></b> (diffusion on PCA-UMAP space, also known as sc-PHENIX). (<b>D</b>) Condense tree plot (PCA space). (<b>E</b>) Condense tree plot (diffusion on PCA space). (<b>F</b>) Condense tree plot (diffusion on PCA-UMAP space). (<b>G</b>) Scheme of an inaccurate diffusion process. Diffusion in PCA space connects two distinct clusters (black and blue). This connection occurs in the proximate regions between different clusters (distinct cell phenotypes) separated by a small gap. Due to the diffusion process, this artifact includes spurious neighboring samples that do not share similar features. This occurs because all points (cells) are relatively close to each other in the multidimensional PCA space, and PCA does not provide sufficient separation. Note: In (<b>D</b>–<b>F</b>), the red circles indicate the most stable and persistent clusters identified by HDBSCAN. These clusters are highlighted because they exhibit higher stability, measured by the λ values at which points remain within them before splitting into smaller clusters, indicating their significance and robustness.</p>
Full article ">Figure 8
<p><b>Imputation of the adult mouse visual cortex using MAGIC and sc-PHENIX.</b> (<b>A</b>) Recovered Flt1 expression by MAGIC and sc-PHENIX are visualized on UMAP projection of the adult mouse visual cortex cells dataset with different parameter combinations. (<b>B</b>) The non-imputed expression values of Flt1 are visualized on UMAP projection of the adult mouse visual cortex cells dataset. (<b>C</b>) In the 2D UMAP projection of the adult mouse visual cortex cells dataset (without imputation), three main clusters are GABAergic (green circle), glutamatergic (orange circle), and non-neuronal cell types, with 21 cell phenotypes in total. Different parameters were used to see the effect of over-smoothing for MAGIC and sc-PHENIX methods.</p>
Full article ">Figure 9
<p><b>Imputation performance of MAGIC and sc-PHENIX imputation through different increasing combinations of parameters.</b> Here, we show the precision, recall, and f1-score performance metrics for imputation using different combinations of parameters of <span class="html-italic">knn</span>, <span class="html-italic">t</span>, and PCA dimensions. We used <span class="html-italic">Flt1</span> for NonNeu_Endo and NonNeu_SMC cell types, <span class="html-italic">Chat</span> for GABA_Vip cell type, <span class="html-italic">Sst</span> for GABA_Sst cell type, and <span class="html-italic">Serpin11</span> for Gluta_L6B cell types. The differential expression for the imputed gene markers was set to Fold Change = 2.0 and FWRD = 0.05 using Tukey’s HSD (honestly significant difference). Note: To increase values for PCA, we set it to <span class="html-italic">knn</span> = 15 and <span class="html-italic">t</span>= 15. To increase values for <span class="html-italic">t</span>, we set it to <span class="html-italic">knn</span> = 15 and <span class="html-italic">n_pca</span> = 30. For increasing values of <span class="html-italic">knn</span>, we set <span class="html-italic">t</span>= 15 and PCA= 30, similar to that shown in <a href="#biology-13-00512-f006" class="html-fig">Figure 6</a> and <a href="#app1-biology-13-00512" class="html-app">Figure S13</a>. Note: Gene markers: <span class="html-italic">Flt1</span> (vascular endothelial growth factor receptor 1), and <span class="html-italic">Chat</span> (Choline Acetyltransferase), <span class="html-italic">Sst</span> (Somatostatin), and <span class="html-italic">Serpinb11</span> (Serpin Family B Member 11).</p>
Full article ">Figure 10
<p><b>Characterization of MCF7 MCTS polarization phenotypes based on the recovered data by MAGIC and sc-PHENIX.</b> (<b>A</b>) Downstream analysis post imputation with MAGIC (diffusion on PCA). (<b>B</b>) Downstream analysis post imputation with sc-PHENIX (diffusion on PCA-UMAP). The downstream analysis is: (i) Differential expression by cluster, see heatmap of DEG (Differential expressed genes) using EMD score (red and blueish colored), DEG (rows) for each HDBSCAN cluster (columns in numbers). (ii) GSEA heatmap (greenish colored), the statistical significance is pronounced at FDR <math display="inline"><semantics> <mrow> <mo>&lt;</mo> </mrow> </semantics></math> 0.05, the normalized enrichment score (NES), only positive enrichment values were considered for visualization. (iii) Only dense cluster clusters are visualized in the 3D-PCA space of the recovered (imputed) data. Note: the 3D PCA plots here come from the interactive plots in <a href="#app1-biology-13-00512" class="html-app">Supplementary Material: Figure S12</a> (MAGIC) and <a href="#app1-biology-13-00512" class="html-app">Supplementary Material: Figure S13</a> (sc-PHENIX) in order to see more details in a 3D plot context.</p>
Full article ">Figure 11
<p><b>Overview of the characterization of extreme and transition state cell phenotypes from sc-PHENIX.</b> Overview of the characterization of cell phenotypes obtained from the imputed data with sc-PHENIX. Arrows indicate clusters that are in the extreme of clusters (archetypes) or extreme phenotypes in cell space in the 3D-PCA plot, and it is already known that extreme archetypes are samples that differ in gene expression. On the right: GSEA heatmap (greenish colored), the statistical significance is pronounced at FDR <math display="inline"><semantics> <mrow> <mo>&lt;</mo> </mrow> </semantics></math> 0.05, the normalized enrichment score (NES), only positive enrichment values were considered for visualization. There are 4 extreme clusters that we named (1) Proliferative state, where enriched pathways are related to a proliferation state. (2) Invasive state, where enriched pathways are related to a hypoxic phenotype in which inflammatory responses and angiogenesis pathways are present. (3) Necrotic state, where we already identified over-expression of VEGFA and MT GENES, low UMI counts, and diminished pathways. (4) Transition state, which shares HALLMARKS from the extreme clusters 4 and 5 but based on the REACTOME database (<a href="#app1-biology-13-00512" class="html-app">Supplementary Material: Table S3</a>), this cluster presents high values of NES of the mitophagy and starving signaling pathways in the extreme clusters 7 and 8.</p>
Full article ">
13 pages, 7339 KiB  
Article
Improving the Two-Color Temperature Sensing Using Machine Learning Approach: GdVO4:Sm3+ Prepared by Solution Combustion Synthesis (SCS)
by Jovana Z. Jelic, Aleksa Dencevski, Mihailo D. Rabasovic, Janez Krizan, Svetlana Savic-Sevic, Marko G. Nikolic, Myriam H. Aguirre, Dragutin Sevic and Maja S. Rabasovic
Photonics 2024, 11(7), 642; https://doi.org/10.3390/photonics11070642 - 6 Jul 2024
Viewed by 587
Abstract
The gadolinium vanadate doped with samarium (GdVO4:Sm3+) nanopowder was prepared by the solution combustion synthesis (SCS) method. After synthesis, in order to achieve full crystallinity, the material was annealed in air atmosphere at 900 °C. Phase identification in the [...] Read more.
The gadolinium vanadate doped with samarium (GdVO4:Sm3+) nanopowder was prepared by the solution combustion synthesis (SCS) method. After synthesis, in order to achieve full crystallinity, the material was annealed in air atmosphere at 900 °C. Phase identification in the post-annealed powder samples was performed by X-ray diffraction, and morphology was investigated by high-resolution scanning electron microscope (SEM) and transmission electron microscope (TEM). Photoluminescence characterization of emission spectrum and time resolved analysis was performed using tunable laser optical parametric oscillator excitation and streak camera. In addition to samarium emission bands, a weak broad luminescence emission band of host VO43− was also observed by the detection system. In our earlier work, we analyzed the possibility of using the host luminescence for two-color temperature sensing, improving the method by introducing the temporal dependence in line intensity ratio measurements. Here, we showed that further improvements are possible by using the machine learning approach. To facilitate the initial data assessment, we incorporated Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) clustering of GdVO4:Sm3+ spectra at various temperatures. Good predictions of temperature were obtained using deep neural networks. Performance of the deep learning network was enhanced by data augmentation technique. Full article
(This article belongs to the Special Issue Editorial Board Members’ Collection Series: Photonics Sensors)
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) TEM image of GdVO<sub>4</sub>-Sm nanoparticles with the inset showing the SAED pattern. (<b>b</b>) HRTEM image of a section of the nanoparticle highlighted in the red square in (<b>a</b>). The inset displays an interplanar distance of 4.77 Å, likely corresponding to the (101) lattice plane of tetragonal GdVO<sub>4</sub>.</p>
Full article ">Figure 2
<p>The XRD pattern of the GdVO<sub>4</sub>-Sm nanopowder with respective Miller indices.</p>
Full article ">Figure 3
<p>SEM image of GdVO<sub>4</sub>:Sm nanopowder.</p>
Full article ">Figure 4
<p>Streak image of the photoluminescence spectrum of GdVO<sub>4</sub>:Sm<sup>3+</sup> nanophosphor (OPO excitation at 330 nm) at room temperature.</p>
Full article ">Figure 5
<p>Streak images of the photoluminescence spectrum of GdVO<sub>4</sub>:Sm<sup>3+</sup> nanophosphor at two temperatures.</p>
Full article ">Figure 6
<p>Temperature dependence of intensity ratios of luminescence peaks (<b>a</b>). Temperature dependence of luminescence lifetimes (<b>b</b>).</p>
Full article ">Figure 7
<p>Scores on first two principal components of GdVO<sub>4</sub>:Sm<sup>3+</sup> spectra at different temperatures.</p>
Full article ">Figure 8
<p>t-SNE component clustering of GdVO<sub>4</sub>:Sm<sup>3+</sup> spectra at different temperatures.</p>
Full article ">Figure 9
<p>Uniform Manifold Approximation and Projection (UMAP) clustering of GdVO<sub>4</sub>:Sm<sup>3+</sup> spectra at different temperatures.</p>
Full article ">Figure 10
<p>A schematic diagram illustrating the deep learning network used in this study. It is not possible to draw all arrows, and the shown arrows are merely symbolic. The pixel rows of the region of interest in the image are fed into the input layer as shown in the diagram.</p>
Full article ">Figure 11
<p>Plot of predicted temperatures using the training set of 10 samples for each temperature. RMSEC refers to the root mean standard error of calibration; RMSECV refers to the root mean standard error of cross-validation.</p>
Full article ">Figure 12
<p>Plot of predicted temperatures using the training set of 50 samples for each temperature. RMSEC refers to the root mean standard error of calibration; RMSECV refers to the root mean standard error of cross-validation.</p>
Full article ">Figure 13
<p>The learning curves depicting the training progress of the ANNDL model.</p>
Full article ">
16 pages, 3790 KiB  
Article
iNP_ESM: Neuropeptide Identification Based on Evolutionary Scale Modeling and Unified Representation Embedding Features
by Honghao Li, Liangzhen Jiang, Kaixiang Yang, Shulin Shang, Mingxin Li and Zhibin Lv
Int. J. Mol. Sci. 2024, 25(13), 7049; https://doi.org/10.3390/ijms25137049 - 27 Jun 2024
Viewed by 1174
Abstract
Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. [...] Read more.
Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. Hence, in this research, we constructed an SVM-based machine learning neuropeptide predictor, iNP_ESM, by integrating protein language models Evolutionary Scale Modeling (ESM) and Unified Representation (UniRep) for the first time. Our model utilized feature fusion and feature selection strategies to improve prediction accuracy during optimization. In addition, we validated the effectiveness of the optimization strategy with UMAP (Uniform Manifold Approximation and Projection) visualization. iNP_ESM outperforms existing models on a variety of machine learning evaluation metrics, with an accuracy of up to 0.937 in cross-validation and 0.928 in independent testing, demonstrating optimal neuropeptide recognition capabilities. We anticipate improved neuropeptide data in the future, and we believe that the iNP_ESM model will have broader applications in the research and clinical treatment of neurological diseases. Full article
(This article belongs to the Section Molecular Neurobiology)
Show Figures

Figure 1

Figure 1
<p>An overview of the iNP_ESM model. Initially, neuropeptide sequences are input into the protein language models ESM and UniRep, generating 1280D ESM features and 1900D UniRep features for each sequence. Subsequently, these features are combined to form a 3180D fused feature. This fused feature can be directly input into an SVM model. Alternatively, after dimensionality reduction through feature selection to 120 dimensions, the reduced feature can also be input into the SVM model. Following a series of optimizations and performance comparisons, the iNP_ESM model is finalized.</p>
Full article ">Figure 2
<p>Comparison of 10-fold cross-validation metrics for the combination of six feature encoding methods and seven machine learning algorithms. Here, UniRep is represented in dark green, ESM in light green, SSA in light yellow, LM in dark yellow, BiLSTM in dark red, and TAPE_BERT in light red. The machine learning algorithms include (<b>A</b>) GNB, (<b>B</b>) KNN, (<b>C</b>) LDA, (<b>D</b>) LGBM, (<b>E</b>) LR, (<b>F</b>) RF, and (<b>G</b>) SVM.</p>
Full article ">Figure 3
<p>Comparison of the average values from 10-fold cross-validation and an independent test between fused feature models and single feature models. Here, UniRep is represented in green, ESM in light yellow, and UniRep+ESM_F3180 in dark yellow. The machine learning algorithms include (<b>A</b>) LGBM (parameters: {‘num_trees’: 1300, ‘learning_rate’: 0.28}) and (<b>B</b>) SVM (parameters: {‘C’: 1.9306977288832496, ‘gamma’: ‘scale’}).</p>
Full article ">Figure 4
<p>The variation in (<b>A</b>) accuracy and (<b>B</b>) Matthews correlation coefficient during the feature selection process for ESM+UniRep_F3180 with the number of features. Here, 10-fold cross-validation metrics are represented in green, independent test metrics in red, and the average of cross-validation and independent test metrics in yellow. LGBM Classifier parameters: {‘num_leaves’: 32, ‘n_estimators’: 888, ‘max_depth’: 12, ‘learning_rate’: 0.16, ‘min_child_samples’: 50, ‘random_state’: 2020, ‘n_jobs’: 8}.</p>
Full article ">Figure 5
<p>UMAP visualization plots (parameters: {‘metric’: ‘correlation’, ‘n_neighbors’: 45, ‘min_dist’: 0.12}). (<b>A</b>) using the ESM training set; (<b>B</b>) using the UniRep training set; (<b>C</b>) using the ESM+UniRep_F3180 training set; and (<b>D</b>) using the ESM+UniRep_F120 training set. Neuropeptides are represented by orange dots, and non-neuropeptides are represented by blue dots.</p>
Full article ">
28 pages, 9412 KiB  
Article
Deciphering Abnormal Platelet Subpopulations in COVID-19, Sepsis and Systemic Lupus Erythematosus through Machine Learning and Single-Cell Transcriptomics
by Xinru Qiu, Meera G. Nair, Lukasz Jaroszewski and Adam Godzik
Int. J. Mol. Sci. 2024, 25(11), 5941; https://doi.org/10.3390/ijms25115941 - 29 May 2024
Viewed by 973
Abstract
This study focuses on understanding the transcriptional heterogeneity of activated platelets and its impact on diseases such as sepsis, COVID-19, and systemic lupus erythematosus (SLE). Recognizing the limited knowledge in this area, our research aims to dissect the complex transcriptional profiles of activated [...] Read more.
This study focuses on understanding the transcriptional heterogeneity of activated platelets and its impact on diseases such as sepsis, COVID-19, and systemic lupus erythematosus (SLE). Recognizing the limited knowledge in this area, our research aims to dissect the complex transcriptional profiles of activated platelets to aid in developing targeted therapies for abnormal and pathogenic platelet subtypes. We analyzed single-cell transcriptional profiles from 47,977 platelets derived from 413 samples of patients with these diseases, utilizing Deep Neural Network (DNN) and eXtreme Gradient Boosting (XGB) to distinguish transcriptomic signatures predictive of fatal or survival outcomes. Our approach included source data annotations and platelet markers, along with SingleR and Seurat for comprehensive profiling. Additionally, we employed Uniform Manifold Approximation and Projection (UMAP) for effective dimensionality reduction and visualization, aiding in the identification of various platelet subtypes and their relation to disease severity and patient outcomes. Our results highlighted distinct platelet subpopulations that correlate with disease severity, revealing that changes in platelet transcription patterns can intensify endotheliopathy, increasing the risk of coagulation in fatal cases. Moreover, these changes may impact lymphocyte function, indicating a more extensive role for platelets in inflammatory and immune responses. This study identifies crucial biomarkers of platelet heterogeneity in serious health conditions, paving the way for innovative therapeutic approaches targeting platelet activation, which could improve patient outcomes in diseases characterized by altered platelet function. Full article
(This article belongs to the Special Issue New Advances in Platelet Biology and Functions: 2nd Edition)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>PBMC profiling from healthy controls, sepsis, similar symptom hospitalized, COVID-19, and SLE patients. (<b>A</b>) Schematic outline depicting the workflow for data collection from published literature and subsequent integrated analysis. Created with biorender.com. (<b>B</b>–<b>F</b>) Bar plots depicting the percentage of different cell types under different disease severities. (<b>B</b>) Platelets, (<b>C</b>) T cells, (<b>D</b>) B cells, (<b>E</b>) monocytes, and (<b>F</b>) neutrophils. (<b>G</b>) DC under different outcome situations. The differences in percentages associated with adjusted <span class="html-italic">p</span>-values below 0.05, 0.01, 0.001, and 0.0001 are indicated as *, **, ***, and ****, respectively, and not significant ones are not shown. The significance analysis was performed using Wilcoxon tests. Standard error bars were also added. (<b>H</b>) Receiver operating characteristic (ROC) curves for the platelet to T cell ratio and other cell type percentages were used to distinguish non-survivors from survivors.</p>
Full article ">Figure 2
<p>Deep neural networks and XGBoost modeling identify biomarkers of survival and fatal platelets. (<b>A</b>) Comparison of DNN and XGBoost Model Performance, DNN (represented in green) vs. XGB (represented in blue). (<b>B</b>) Venn diagram representing features from the Deep Neural Network (DNN) and XGBoost (XGB) models, specifically including only those features that rank in the highest 5% in terms of their importance or gain metrics within each model. Additionally, incorporate features that demonstrate a differential expression gene (DEG) profile with an absolute log2 fold change (log2fc) greater than 1. (<b>C</b>) Volcano plot depicting genes that are upregulated or downregulated when comparing platelets from survivors to those from fatal cases. The x-axis represents the log fold change. This is a measure of the change in expression levels of variables between two conditions. A zero value indicates no change, positive values indicate upregulation, and negative values indicate downregulation in the condition of interest relative to a reference condition. The y-axis represents the negative log10 adjusted <span class="html-italic">p</span>-value. This transformation is used to amplify differences in <span class="html-italic">p</span>-values, where small <span class="html-italic">p</span>-values (which indicate statistical significance) result in larger values on the plot. The horizontal dashed line typically represents a threshold of significance (e.g., adjusted <span class="html-italic">p</span>-value of 0.05), above which the findings are considered statistically significant. (<b>D</b>,<b>E</b>) Bar chart that illustrates the enrichment of certain biological pathways in a set of genes related to genes up-regulated in survival/fatal. The color of the bars represents the level of statistical significance after adjustment for multiple comparisons, with darker colors indicating more statistically significant enrichment.</p>
Full article ">Figure 3
<p>Differential expression of platelets affects endotheliopathy across disease severity states. (<b>A</b>) The expression of the ITGA2B gene in platelets across severity states. (<b>B</b>) The expression module GO: Cytoskeleton organization in platelets across severity states. Violin plots are ordered according to the decreasing average value of the expression. (<b>C</b>) The comparison of GO terms blood coagulation (GO:0007596), inflammatory response (GO:0006954), apoptotic process (GO:0006915), extracellular matrix disassembly (GO:0022617), and platelet activation (GO:0030168) expression. Heatmap coloring represents <span class="html-italic">z</span>-scored scores averaged across all cells in a given sample.</p>
Full article ">Figure 4
<p>Clustered integrative analysis of platelets’ single-cell transcriptional landscape. (<b>A</b>,<b>B</b>) Cell cluster UMAP representation of all merged platelets (<b>A</b>) colored by data source; (<b>B</b>) colored by clusters. (<b>C</b>,<b>D</b>) stacked bar plots display platelet cluster proportion under (<b>C</b>) different disease severity, and (<b>D</b>) different outcome situations. (<b>E</b>) Violin plots showing expression levels of marker genes for each cluster in platelets. In (<b>A</b>), we gathered single-cell RNA-seq datasets of peripheral blood mononuclear cells (PBMCs) from COVID-19 [<a href="#B30-ijms-25-05941" class="html-bibr">30</a>,<a href="#B31-ijms-25-05941" class="html-bibr">31</a>,<a href="#B32-ijms-25-05941" class="html-bibr">32</a>,<a href="#B33-ijms-25-05941" class="html-bibr">33</a>,<a href="#B34-ijms-25-05941" class="html-bibr">34</a>,<a href="#B35-ijms-25-05941" class="html-bibr">35</a>,<a href="#B36-ijms-25-05941" class="html-bibr">36</a>,<a href="#B37-ijms-25-05941" class="html-bibr">37</a>,<a href="#B38-ijms-25-05941" class="html-bibr">38</a>], sepsis [<a href="#B12-ijms-25-05941" class="html-bibr">12</a>,<a href="#B39-ijms-25-05941" class="html-bibr">39</a>], and systemic lupus erythematosus (SLE) [<a href="#B40-ijms-25-05941" class="html-bibr">40</a>] patients.</p>
Full article ">Figure 5
<p>Clustered platelets and their unique pathway expression changes. (<b>A</b>) Enrichment analysis for human hallmark gene sets for each platelet cluster. Expression level is the normalized enrichment score in the GSEA algorithm. (<b>B</b>–<b>E</b>) Bar plot of hallmark gene sets expression among clusters. (<b>B</b>) Coagulation (<b>C</b>) Apical Junction (<b>D</b>) Epithelial Mesenchymal Transition (<b>E</b>) E2F Targets. (<b>F</b>) Heatmap display fatal cluster C9 up-regulated genes and enriched pathways in gene ontology (GO). (<b>G</b>) Gene-Concept network display fatal cluster C4 up-regulated genes and enriched pathways from Gene, Disease Features Ontology-based Overview System (gendoo) [<a href="#B41-ijms-25-05941" class="html-bibr">41</a>] and diseases category. (<b>H</b>) Tree plot display convalescence cluster C8 enriched pathways in GO.</p>
Full article ">Figure 6
<p>Platelet module signatures in patients related to survival and the fatal dynamic trend (<b>A</b>,<b>B</b>) Pseudo-time plot of platelets from clusters C0, C1, C4, and C6 exhibiting trajectory fates. (<b>C</b>,<b>D</b>) Differential expression genes in (<b>C</b>) C4 vs. C0; (<b>D</b>) C0 vs. C1. The volcano plot displays the genes constantly up-or down-regulated in the direction of disease severity. Volcano plots were prepared with the R package EnhancedVolcano (v.1.13.2). In the volcano plot, genes with an absolute log2 fold change greater than 0.5 are represented by red dots, while those with a lower absolute log2 fold change are represented by blue dots. (<b>E</b>–<b>H</b>) Ridge plots showing the density of expression level of (<b>E</b>) Fatal module expression under platelet clusters, (<b>F</b>) Fatal module expression under disease severities, (<b>G</b>) Survival module expression under platelet clusters, and (<b>H</b>) Survival module expression under disease severities. Ridge plots are ordered in descending order. (<b>I</b>) Receiver operating characteristic (ROC) curves for each platelet cluster percentage from PBMC were used to distinguish non-survivors from survivors.</p>
Full article ">Figure 7
<p>Platelets pathway expression among healthy controls, sepsis, similar symptom hospitalized, COVID-19, and SLE patients. (<b>A</b>,<b>B</b>) Venn diagrams describing changes in Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways from COVID-19, similar symptoms hospitalized (SSH), sepsis, and systemic sclerosis lupus (SLE) vs. healthy controls (HC), (<b>A</b>) Up-regulated pathways. (<b>B</b>) Down-regulated pathways. The pathways were filtered for those with an adjusted <span class="html-italic">p</span>-value under 0.05. (<b>C</b>) Heatmap illustration of non-survivor vs. survivor up- or down-regulated pathways. Colors are decided by the product of the COVID-19 and sepsis up/down-regulated enriched pathway log10 (adjusted <span class="html-italic">p</span> value). The GO terms were reduced to representative ones using R package rrvgo (v.1.8.0) [<a href="#B44-ijms-25-05941" class="html-bibr">44</a>] (the cutoffs were similarity &gt; 0.4) and then overlapped. (<b>D</b>) Heatmap illustration of diseases vs. health controls up- or down-regulated pathways. Color is decided by the product of the COVID-19, SSH, SLE, and sepsis up- or down-regulated enriched pathway log10 (adjusted <span class="html-italic">p</span> value). The GO terms were reduced to representative ones using R package rrvgo (1.8.0) (the cutoffs were similarity &gt; 0.1) and then overlapped.</p>
Full article ">Figure 8
<p>Alterations in platelet and other cell type interactions among healthy controls, sepsis, similar symptom hospitalized, COVID-19, and SLE patients. (<b>A</b>) Comparison of ligand-receptor interaction scores between platelets and other cell types. Heatmap coloration corresponds to z-scored, log-normalized mean interaction scores averaged across all cells from a specific sample. (<b>B</b>–<b>E</b>) Ligand and receptor interaction scores between platelets and (<b>B</b>) T cells across various disease severity levels, (<b>C</b>) T cells across different outcomes, (<b>D</b>) B cells across various disease severity levels, and (<b>E</b>) B cells across different outcomes. The circle size represents the <span class="html-italic">z</span>-scored interaction scores. (<b>F</b>,<b>G</b>) Pathway module scores across different outcome situations in platelets, including (<b>F</b>) T cell differentiation and (<b>G</b>) B cell proliferation. The differences in scores associated with adjusted P-values below 0.05, 0.01, 0.001, and 0.0001 are indicated as * and ****, respectively.</p>
Full article ">
19 pages, 20483 KiB  
Article
Subcellular Feature-Based Classification of α and β Cells Using Soft X-ray Tomography
by Aneesh Deshmukh, Kevin Chang, Janielle Cuala, Bieke Vanslembrouck, Senta Georgia, Valentina Loconte and Kate L. White
Cells 2024, 13(10), 869; https://doi.org/10.3390/cells13100869 - 18 May 2024
Viewed by 1350
Abstract
The dysfunction of α and β cells in pancreatic islets can lead to diabetes. Many questions remain on the subcellular organization of islet cells during the progression of disease. Existing three-dimensional cellular mapping approaches face challenges such as time-intensive sample sectioning and subjective [...] Read more.
The dysfunction of α and β cells in pancreatic islets can lead to diabetes. Many questions remain on the subcellular organization of islet cells during the progression of disease. Existing three-dimensional cellular mapping approaches face challenges such as time-intensive sample sectioning and subjective cellular identification. To address these challenges, we have developed a subcellular feature-based classification approach, which allows us to identify α and β cells and quantify their subcellular structural characteristics using soft X-ray tomography (SXT). We observed significant differences in whole-cell morphological and organelle statistics between the two cell types. Additionally, we characterize subtle biophysical differences between individual insulin and glucagon vesicles by analyzing vesicle size and molecular density distributions, which were not previously possible using other methods. These sub-vesicular parameters enable us to predict cell types systematically using supervised machine learning. We also visualize distinct vesicle and cell subtypes using Uniform Manifold Approximation and Projection (UMAP) embeddings, which provides us with an innovative approach to explore structural heterogeneity in islet cells. This methodology presents an innovative approach for tracking biologically meaningful heterogeneity in cells that can be applied to any cellular system. Full article
(This article belongs to the Special Issue Advanced Technology for Cellular Imaging)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>3D reconstruction and quantitative analysis of α and β cell morphology. (<b>A</b>) Orthoslice showing the XY plane through the soft X-ray tomogram of representative α and β cells (α_3 and β_6, respectively). Cell constituents and organelles are distinguished from one another based on their LAC values and are identified as follows: nucleus–blue arrowhead, mitochondria–pink arrowhead, glucagon vesicles–red arrowhead, and insulin vesicles–green arrowhead. The overall LAC value range of the orthoslice is between 0.15 and 0.4 μm<sup>−1</sup> to optimize contrast. Scale bar: 2 μm. (<b>B</b>) 3D reconstruction of the representative α and β cells (α_3 and β_6, respectively). In detail, the reconstruction shows the nucleus (blue), mitochondria (pink), glucagon vesicles ((<b>left</b>), in red), insulin vesicles ((<b>right</b>), in green), and plasma membrane (gray). (<b>C</b>) Cellular volume of both cell types, showing a significantly higher volume (*** <span class="html-italic">p</span> &lt; 0.001) for β cells (1191 ± 277 μm<sup>3</sup>) compared with α cells (579 ± 247 μm<sup>3</sup>). (<b>D</b>) Nuclear volume of both cell types showed no significant difference (<span class="html-italic">p</span> = 0.76). (<b>E</b>) Comparison between mean nuclear occupancy for each cell type, with a significant increase (*** <span class="html-italic">p</span> &lt; 0.001) in percentage occupancy of the nucleus for α cells (21 ± 5%) compared with β cells (10 ± 3%). (<b>F</b>) Number of insulin vesicles normalized by cytosolic volume indicating a significantly higher number of vesicles (* <span class="html-italic">p</span> = 0.03) per cytosolic μm<sup>3</sup> for α cells (3.3 ± 1.4 vesicles/μm<sup>3</sup>) compared with β cells (2 ± 0.6 vesicles/μm<sup>3</sup>). (<b>G</b>) Plot of mean vesicle diameters of α and β cell vesicles demonstrating a higher vesicle diameter (*** <span class="html-italic">p</span> &lt; 0.001) for α cell vesicles (212 ± 21 nm) compared with β cell vesicles (163 ± 13 nm). (<b>H</b>) Mean Vesicle LAC for secretory vesicles of α and β cells showing a significantly higher mean LAC (** <span class="html-italic">p</span> = 0.003) for α cell vesicles (0.37 ± 0.03 μm<sup>−1</sup>) compared with β cell vesicles (0.33 ± 0.02 μm<sup>−1</sup>). Error bars in each plot are representative of the standard deviation. Welch’s <span class="html-italic">t</span>-test was used as a statistical test. n = 8 for α cells (red) and n = 7 for β cells (green).</p>
Full article ">Figure 2
<p>3D reconstruction and quantitative analysis of pooled insulin and glucagon vesicles. (<b>A</b>) (<b>left</b>) XY Orthoslice through the SXT of representative α and β cells. Glucagon vesicles (red arrowheads) and insulin vesicles (green arrowheads) can be identified based on their high LAC values. The overall LAC value in the orthoslice is thresholded between 0.15 and 0.40 μm<sup>−1</sup> (scale bar: 0.5 μm). (<b>right</b>) 3D reconstruction of a section of representative α and β cells (α_3 and β_6, respectively). In detail, the reconstruction shows glucagon vesicles ((<b>top</b>), in red) and insulin vesicles ((<b>bottom</b>), in green), and plasma membrane (gray). (<b>B</b>) Histogram showing the size distribution of glucagon and insulin vesicles. The vesicles for each cell type are pooled together and show a significantly higher diameter (**** <span class="html-italic">p</span> &lt; 0.0001) for insulin vesicles (194.2 ± 49 nm, green dotted line), compared with glucagon vesicles (157 ± 35 nm, red dotted line). (<b>C</b>) Histogram showing LAC distribution of glucagon and insulin vesicles demonstrating a significantly higher mean vesicle LAC values (**** <span class="html-italic">p</span> &lt; 0.0001) for insulin vesicles (0.37 ± 0.04 μm<sup>−1</sup>, red dotted line), compared with glucagon vesicles (0.33 ± 0.03 μm<sup>−1</sup>, green dotted line). (<b>B</b>,<b>C</b>) n = 10,694 for glucagon vesicles (red) and n = 14,690 for insulin vesicles (green). Welch’s <span class="html-italic">t</span>-test was used as a statistical test.</p>
Full article ">Figure 3
<p>Description and comparison of LAC-based parameters between insulin and glucagon vesicles (<b>A</b>) (<b>top</b>) XY Orthoslice through SXT of representative α cell (α_3). Scale bar: 0.5 μm. (<b>middle</b>) 3D reconstruction of a representative glucagon vesicle (red). (<b>bottom</b>) Histogram displaying the LAC distribution of the glucagon vesicle picture in the top and middle panels showing a mean LAC value of 0.34 μm<sup>−1</sup> for the vesicle. (<b>B</b>) (<b>top</b>) XY Orthoslice through SXT of representative β-cell (β_6). Scale bar: 500 nm. (<b>middle</b>) 3D reconstruction of a representative insulin vesicle (green). (<b>bottom</b>) Histogram displaying the LAC distribution of the insulin vesicle picture in the (<b>top</b>) and (<b>middle</b>) panels, showing a mean LAC value of 0.318 μm<sup>−1</sup> for the vesicle. (<b>C</b>) (<b>top</b>) A comparison of vesicle LAC parameters (minimum LAC, 25th quantile LAC, mean LAC, 75th quantile LAC, maximum LAC) between glucagon vesicles (red) and insulin vesicles (green) showing significantly higher values (**** <span class="html-italic">p</span> &lt; 0.0001; one-way ANOVA with Bonferroni’s correction) for glucagon vesicles for all displayed parameters compared with insulin vesicles. (<b>bottom</b>) LAC histogram curve for a sample vesicle, with arrows indicating the value being compared in the (<b>top</b>) panel. (<b>D</b>) Plot showing a significantly higher (**** <span class="html-italic">p</span> &lt; 0.0001; Welch’s <span class="html-italic">t</span>-test) standard deviation for glucagon vesicles (red) compared with insulin vesicles (green). Error bars in all plots are representative of the standard deviation. n = 10,694 for glucagon vesicles (red) and n = 14,690 for insulin vesicles (green).</p>
Full article ">Figure 4
<p>Overview of machine learning strategy. (<b>A</b>) Data pre-processing for insulin and glucagon vesicles from β and α cells is conducted. A final vesicle feature matrix, including group labels (denoting which cell a vesicle is from) for vesicles, is used as input for machine learning. (<b>B</b>) Train/test split for grouping vesicles from α and β cells. The process of model building is described, with Leave One Group Out cross-validation used to estimate the performance of predicting vesicle identity from unseen cells. Model building and testing are repeated over 56 combinations to understand variability in performance.</p>
Full article ">Figure 5
<p>Representing vesicle feature importances in UMAP embeddings. (<b>A</b>) Representative feature importances from each ML model listed in order of accuracy. The radius of circles is scaled to the magnitude of the permutation feature importances. Since the LAC standard deviation in the XGBoost model had the highest overall importance magnitude, the other parameters are scaled to it. In general, LAC mean, LAC standard deviation, and diameter seem to be the most important representative parameters. (<b>B</b>) UMAP embedding of vesicles colored by vesicle identity. Semi-distinct clusters of insulin and glucagon vesicles can be observed. (<b>C</b>) Vesicles colored by cellular origin. The overall trends in the pooled vesicle UMAP space do not seem to be driven by cell-dependent effects. (<b>D</b>) Embeddings colored by vesicle feature values. Gradients of LAC mean, standard deviation, and diameter correspond to regions of insulin and glucagon vesicles. The grouping of heterogeneous vesicle subpopulations can also be visualized.</p>
Full article ">
16 pages, 1496 KiB  
Article
Identifying Novel Subtypes of Functional Gastrointestinal Disorder by Analyzing Nonlinear Structure in Integrative Biopsychosocial Questionnaire Data
by Sa-Yoon Park, Hyojin Bae, Ha-Yeong Jeong, Ju Yup Lee, Young-Kyu Kwon and Chang-Eop Kim
J. Clin. Med. 2024, 13(10), 2821; https://doi.org/10.3390/jcm13102821 - 10 May 2024
Cited by 1 | Viewed by 734
Abstract
Background/Objectives: Given the limited success in treating functional gastrointestinal disorders (FGIDs) through conventional methods, there is a pressing need for tailored treatments that account for the heterogeneity and biopsychosocial factors associated with FGIDs. Here, we considered the potential of novel subtypes of FGIDs [...] Read more.
Background/Objectives: Given the limited success in treating functional gastrointestinal disorders (FGIDs) through conventional methods, there is a pressing need for tailored treatments that account for the heterogeneity and biopsychosocial factors associated with FGIDs. Here, we considered the potential of novel subtypes of FGIDs based on biopsychosocial information. Methods: We collected data from 198 FGID patients utilizing an integrative approach that included the traditional Korean medicine diagnosis questionnaire for digestive symptoms (KM), as well as the 36-item Short Form Health Survey (SF-36), alongside the conventional Rome-criteria-based Korean Bowel Disease Questionnaire (K-BDQ). Multivariate analyses were conducted to assess whether KM or SF-36 provided additional information beyond the K-BDQ and its statistical relevance to symptom severity. Questions related to symptom severity were selected using an extremely randomized trees (ERT) regressor to develop an integrative questionnaire. For the identification of novel subtypes, Uniform Manifold Approximation and Projection and spectral clustering were used for nonlinear dimensionality reduction and clustering, respectively. The validity of the clusters was assessed using certain metrics, such as trustworthiness, silhouette coefficient, and accordance rate. An ERT classifier was employed to further validate the clustered result. Results: The multivariate analyses revealed that SF-36 and KM supplemented the psychosocial aspects lacking in K-BDQ. Through the application of nonlinear clustering using the integrative questionnaire data, four subtypes of FGID were identified: mild, severe, mind-symptom predominance, and body-symptom predominance. Conclusions: The identification of these subtypes offers a framework for personalized treatment strategies, thus potentially enhancing therapeutic outcomes by tailoring interventions to the unique biopsychosocial profiles of FGID patients. Full article
(This article belongs to the Special Issue Clinical Innovations in Digestive Disease Diagnosis and Treatment)
Show Figures

Figure 1

Figure 1
<p>Overview of the whole analysis procedure. K-BDQ, Rome-criteria-based Korean Bowel Disease Questionnaire; KM, traditional Korean medicine diagnosis questionnaire for digestive symptoms; SF-36, 36-item Short Form Health Survey; CCA, Canonical Correlation Analysis; MLR, Multiple Linear Regression; FGID, functional gastrointestinal disorder; ERT, extremely randomized tree.</p>
Full article ">Figure 2
<p>Exploring the similarity between each questionnaire. (<b>A</b>) Result of CCA between each pair of questionnaires, K-BDQ, KM, and SF-36. The width of the arrows represents the strength of the canonical correlation. (<b>B</b>) Distribution of adjusted <math display="inline"><semantics> <mrow> <msup> <mrow> <mi>R</mi> </mrow> <mrow> <mn>2</mn> </mrow> </msup> </mrow> </semantics></math> values from MLR models to explain KM and SF-36 variables with a combination of upper and lower gastrointestinal symptom variables in K-BDQ. CCA, Canonical Correlation Analysis; MLR, Multiple Linear Regression; K-BDQ, Rome-criteria-based Korean Bowel Disease Questionnaire; KM, traditional Korean medicine diagnosis questionnaire for digestive symptoms; SF-36, 36-item Short Form Health Survey.</p>
Full article ">Figure 3
<p>Relevance of the nonlinear clustered structure and the symptom severity. Each dot corresponds to a patient. Each axis corresponds to the UMAP dimensions 1, 2, and 3. (<b>A</b>) Result of dimensionality reduction and clustering analysis after hyperparameter tuning using each questionnaire’s information. Each color represents the result of clustering analysis. Patients’ data from K-BDQ, KM, and SF-36 were divided into six (n_neighbors = 10, n_clusters = 6, trustworthiness = 0.87, silhouette coefficient = 0.48, accordance rate = 0.83), three (n_neighbors = 30, n_clusters = 3, trustworthiness = 0.83, silhouette coefficient = 0.43, accordance rate = 0.84), and two (n_neighbors = 10, n_clusters = 2, trustworthiness = 0.87, silhouette coefficient = 0.64, accordance rate = 0.88) clusters. (<b>B</b>) Data embedding of K-BDQ was color-coded with the clustering result of 2A (the clustered labels when using K-BDQ, KM, and SF-36 information). Each color indicates the label colors of the clustering analysis (the same color as 2A). (<b>C</b>) Relevance of the clustered structure and the symptom severity when using K-BDQ, KM, and SF-36 information, respectively. Color intensity indicates the symptom severity. K-BDQ, Rome-criteria-based Korean Bowel Disease Questionnaire; KM, traditional Korean medicine diagnosis questionnaire for digestive symptoms; SF-36, 36-item Short Form Health Survey; C1, cluster 1; C2, cluster 2; C3, cluster 3.</p>
Full article ">Figure 4
<p>Subtype identification result of FGID patients using the integrative questionnaire information. (<b>A</b>) Result of dimensionality reduction and clustering analysis after hyperparameter tuning using the integrative questionnaire information. Patients’ data from the integrative questionnaire were divided into four (n_neighbors = 30, n_clusters = 4, trustworthiness = 0.84, silhouette coefficient = 0.39, accordance rate = 0.85) clusters. Each color indicates different subtypes. Each dot corresponds to a patient, and each axis corresponds to the UMAP dimensions 1, 2, and 3. (<b>B</b>) Receiver operating characteristic curves and corresponding area under the curve statistics for the prediction of subtype label based on the integrative questionnaire information. S1, subtype 1; S2, subtype 2; S3, subtype 3; S4, subtype 4.</p>
Full article ">Figure 5
<p>Characteristics of each FGID subtype. Line plot indicates the normalized average score of 29 body-symptom-related questions and 21 mind-symptom-related questions. Dot plot represents the normalized average score of each question for each subtype. S1, subtype 1; S2, subtype 2; S3, subtype 3; S4, subtype 4.</p>
Full article ">
Back to TopTop