Article
Open access
Published: 12 June 2024

DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering

Qingyun Gao¹ &
Qing Ai¹

Scientific Reports volume 14, Article number: 13541 (2024) Cite this article

648 Accesses
8 Altmetric
Metrics details

Subjects

Abstract

Single-cell ribonucleic acid sequencing (scRNA-seq) is a high-throughput genomic technique that is utilized to investigate single-cell transcriptomes. Cluster analysis can effectively reveal the heterogeneity and diversity of cells in scRNA-seq data, but existing clustering algorithms struggle with the inherent high dimensionality, noise, and sparsity of scRNA-seq data. To overcome these limitations, we propose a clustering algorithm: the Dual Correlation Reduction network-based Extreme Learning Machine (DCRELM). First, DCRELM obtains the low-dimensional and dense result features of scRNA-seq data in an extreme learning machine (ELM) random mapping space. Second, the ELM graph distortion module is employed to obtain a dual view of the resulting features, effectively enhancing their robustness. Third, the autoencoder fusion module is employed to learn the attributes and structural information of the resulting features, and merge these two types of information to generate consistent latent representations of these features. Fourth, the dual information reduction network is used to filter the redundant information and noise in the dual consistent latent representations. Last, a triplet self-supervised learning mechanism is utilized to further improve the clustering performance. Extensive experiments show that the DCRELM performs well in terms of clustering performance and robustness. The code is available at https://github.com/gaoqingyun-lucky/awesome-DCRELM.

Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding

Article 04 June 2024

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

Article Open access 11 May 2020

scCAN: single-cell clustering using autoencoder and network fusion

Article Open access 17 June 2022

Introduction

scRNA-seq is a technique for sequencing and analysing the genome or transcriptome at the single-cell level. This approach reveals heterogeneity and diversity among cell populations and allows the analysis of gene sequences and expression within the transcriptome range^1,2,3, which is crucial for investigating large-scale cell atlases⁴ and complex diseases^5,6,7 and for characterizing cell types^8,9.

scRNA-seq technology has been widely applied in various fields of biology and medicine^10,11,12. Cell clustering is a crucial step in scRNA-seq data analysis. Clustering of cells groups similar cells into different cell clusters, which helps us identify different types, subtypes, and states of cells, facilitating a better understanding of the diversity and function of cells. Moreover, effective identification of cell types affects the downstream analysis of scRNA-seq data^1,2. Therefore, many clustering algorithms, such as spectral clustering¹³, k-means^14,15, Celltree¹⁶, and Gaussian mixture models (GMMs), are used to identify cell types¹⁷. Transcriptional bursting refers to the activation signal that promotes genes to transition from a silent state to an active state, rapidly initiating transcription, generating a large amount of mRNA within a short period, and then returning to a silent state¹⁸. In the early discovery of transcriptional bursting, it was widely regarded as noise¹⁹. The single-cell transcriptional bursting effect is caused by the randomness and noise of transcription. During the transcription process, gene expression undergoes random bursts of enhancement or suppression, resulting in transcriptional differences among cells. As a result, scRNA-seq data are sparser, leading to the majority of measurements being zero. The most prominent phenomenon is the dropout event, where low RNA capture rates yield false or close zero values of gene expression in some cells^20,21,22. In addition, scRNA-seq data exhibit a high degree of variability at the gene expression level. To address this problem, a significant amount of noise arises due to biological and technical variations. Therefore, effective scRNA-seq data clustering algorithms are crucial.

In recent years, single-cell clustering algorithms have been proposed to address the challenges associated with scRNA-seq data. SIMLR²³ utilizes a multikernel learning framework to capture the complex relationships among cells based on gene expression profiles. SC3^24,25 addresses cell heterogeneity by integrating multiple clustering results to obtain a consensus clustering solution. CIDR²⁶ is an ultrafast algorithm for clustering by inference and dimensionality reduction that uses implicit inference to interpolate zeros during distance computation to reduce the impact of dropout in scRNA-seq data. These methods yield good clustering performance but have high computation and storage costs and suffer from high complexity and limited scalability.

Due to the excellent performance of deep learning, numerous deep clustering techniques have been introduced by researchers for scRNA-seq data analysis. scDeepCluster²⁰ can utilize the representation learning ability of deep autoencoders to capture complex patterns in scRNA-seq data. By combining deep learning and clustering, deep learning methods can handle high-dimensional data. ADCluster²⁷ can simultaneously achieve anomaly detection and clustering analysis, aiming to identify outliers in the dataset and cluster normal samples. scCAEs²⁸ is a scRNA-seq clustering algorithm that utilizes convolutional autoencoder embedding and soft K-means deep embedding and can learn latent clustered cell populations in space. scDCCA²⁹ employs denoising autoencoders (DAEs) and a contrastive learning module to extract valuable features. scDASFK³⁰ uses DAEs and self-attention mechanisms within a comparative learning framework to improve its robustness and extract additional critical features. DREAM³¹ combines a variational autoencoder and GMM to visually analyse scRNA-seq data while introducing a zero-inflated layer for dimensionality reduction. These algorithms based on autoencoders (AEs) focus on analysing the data and do not explicitly consider the relationships among cells or the intrinsic characteristics of the cells. Consequently, the algorithms cannot effectively learn features.

Most existing methods rely primarily on gene expression information during the representation learning process and do not explicitly share topological information among cells. To capture the complex relationships among cells and their intrinsic properties, several novel algorithms based on graph neural networks (GNNs), such as the scGNN³², scGAC³³, GraphSCC³⁴, and scDFC³⁵ algorithms, have been proposed. The scGNN combines graph convolutional networks (GCNs)³⁶ with single-cell clustering to capture complex relationships among cells. However, the constructed graphs may contain noise connecting different types of cells, thus leading to differences in cell types that may be confused and hence misleading the clustering results. scGAC overcomes these limitations by utilizing a graph attentional autoencoder to learn the latent representation of cells. GraphSCC solves the higher-order structural relationships among cells. scDFC uses attribute information and attention mechanism-based structural information to accurately construct cell-to-cell graphs to address complex biological situations. However, existing GNN-based clustering methods often suffer from representation collapse and tend to map nodes of different categories to similar representations during the cell-gene expression encoding process, making them ineffective at distinguishing different types of cells.

To overcome these issues, we propose a dual correlation reduction network-based extreme learning machine (DCRELM). First, the scRNA-seq data are mapped to low-dimensional and dense result features through the ELM random mapping space. The ELM graph distortion module is used for data augmentation in the feature space and structural space to improve the robustness of the model. Second, in the autoencoder fusion module, the AE and IGAE dynamic fusion mechanisms are used to obtain consistent latent representations, and the dual information correlation reduction module filters redundant information and noise. Last, a triplet self-supervised learning mechanism is employed to further improve the clustering performance.

Materials and methods

Datasets

We constructed comparative experiments on 12 real scRNA-seq datasets to verify the effectiveness of our DCRELM. The detailed information of these datasets is shown in Table 1, where #Cell is the number of cells, #Genes is the number of genes, #Cell types is the number of cell subtypes, and #References is the source of the dataset. We use datasets with small, medium, and large-scale samples, as well as datasets with significant features ranging from low to high dimensions.

Table 1 Characteristics of experimental datasets.

Full size table

Table 2 Notation summary.

Full size table

Python package SCANPY³⁷ is used to preprocess scRNA-seq data. scRNA-seq data is a single-cell gene expression matrix, where rows and columns represent cells and represent genes, respectively, with each cell having the same number of genes. In the data preprocessing step, 95% of cells with zero values of gene expression are deleted to reduce the impact of useless genes on model calculation and clustering accuracy, and the mean and variance of the normalized data range are set to 0 and 1.

Framework of the DCRELM

The overall framework of the DCRELM is illustrated in Fig. 1. The DCRELM consists mainly of five modules: the ELM module, ELM graph distortion module, autoencoder fusion module, dual information correlation reduction module, and triplet self-supervision strategy clustering module. To better address the problem of high-dimensional sparse scRNA-seq data, first, we use the ELM to obtain low-dimensional and dense features of cells. Second, the graph distortion method is used for data augmentation, while the dynamic autoencoder fusion mechanism is employed to fuse the attribute information of cells and graph structure information among cells. Third, a dual information correlation reduction network was utilized to remove genes related to low-quality cells and genes with low expression. Last, different types of cells are effectively identified by minimizing the KL loss function of the triplet self-supervised strategy. Table 2 summarizes the notations in this paper.

Extreme learning machine

ELM^38,39,40 is known for its universal approximation capability and the hidden space created by random nonlinear feature mapping. ELM is a single hidden layer feedforward neural network (SLFN) that randomly assigns an input weight $\varphi _i$ and a hidden layer biase $\zeta _i$. The input cell-gene matrix is assumed to be $X=\left[ x_1,x_2,...,x_i,...,x_N\right] \in R^{N \times M}$, where N is the number of cells and M is the number of genes. The ELM hidden layer output matrix is expressed as follows:

$$\begin{aligned} H_{N\times \widetilde{M}}=\left[ \begin{array}{lll} \mathcal {T}(\varphi _1\cdot x_1+\zeta _1)&{}\cdots &{}\mathcal {T}(\varphi _{\widetilde{M}}\cdot x_1+\zeta _{\widetilde{M}})\\ \vdots &{}\cdots &{}\vdots \\ \mathcal {T}(\varphi _1\cdot x_N+\zeta _1)&{}\cdots &{}\mathcal {T}(\varphi _{\widetilde{M}}\cdot x_N+\zeta _{\widetilde{M}})\\ \end{array}\right] _{N\times \widetilde{M}} \end{aligned}$$

(1)

where $\varphi _i=\left[ \varphi _{i1,}\varphi _{i2},...,\varphi _{iM}\right] ^T$,$\ \zeta _i=\left[ \zeta _{i1},\zeta _{i2},...,\zeta _{iM}\right] ^T$, $\widetilde{M}$ is the number of random mapping nodes, and $\mathcal {T}(\cdot )$ is the activation function. In the high-dimensional and sparse feature space of scRNA-seq data, identifying cell clusters is challenging. We utilize the ELM to effectively map sparse features to low-dimensional and dense spaces, solving this problem.

ELM graph distortion module

To further improve the generalizability and robustness of the DCRELM, we use the ELM graph distortion model to learn rich representations of cells in multiple ways. We considered feature destruction and edge perturbation, two types of distortion, in the cell graphs. Feature destruction is attribute distortion, where the noise matrix $NO\in R^{N\times \widetilde{M}}$ follows the Gaussian distribution $\mathcal {N}\left( 1,0.1\right) $. The destroyed result matrix $\widetilde{H}\in R^{N\times \widetilde{M}}$ is represented as $\widetilde{H}=H\odot NO$, where $\odot $ denotes the Hadamard product.

Moreover, there are two methods for structural distortion: edge removal based on the similarity between cells and graph diffusion. First, the paired cosine similarity of cells is calculated in the latent space. Second, the lowest 10% of linking relationships are removed, generating a mask matrix $MA\in R^{N\times N}$ based on the adjacency matrix A of cells. Last, A is normalized, i.e. $A^m=D^{-\frac{1}{2}}\left( \left( A\odot MA\right) +I\right) D^\frac{1}{2}$, where the degree matrix $D=diag(d_1,d_2,...,d_N)\in R^{N\times N}.$ In the graph diffusion step, we use the PageRank (PPR) method to transform $A^m$ into a graph diffusion matrix $A^d$. The calculation of $A^d$ is formulated as $A^d=\tau \left( I-\left( 1-\tau \right) A^m\right) ^{-1}$, where $\tau $ is the balance parameter. We employ a siamese network to obtain the feature representations of cells from two perspectives to enhance the clustering performance of the DCRELM.

Autoencoding fusion module

As shown in Fig. 2, the autoencoder fusion module obtains the attribute information of cells and the graph structure information among cells via AE and IGAE⁴¹ and dynamically fuses them to obtain more suitable feature representations. An AE is a multilayer feedforward neural network with ReLU activation. The encoding and decoding of each layer are as follows:

$$\begin{aligned} {{\widetilde{H}}^{\nu _t}_{AE}}={AE}\left( {{\widetilde{H}}^{\nu _t}}\right) ,\ {AE}\left( {{\widetilde{H}}^{\nu _t}}\right) ^{\left( \ell \right) }=\sigma _{RELU}\left( {\mathcal {P}_1}^\ell {AE}\left( {{\widetilde{H}}^{\nu _t}}\right) ^{\left( \ell -1\right) }+{{\mathcal {B}}_1}^\ell \right) ,\ {{\hat{H}}_{AE}}^{\left( j\right) }=\sigma _{RELU}\left( {\mathcal {P}_2}^\ell {{\hat{H}}_{AE}}^{\left( j-1\right) }+{{\mathcal {B}}_2}^\ell \right) , \end{aligned}$$

(2)

where $\nu _t$ represents the $\nu _t$th view, and $\ell $ and j represent the $\ell $th encoder layer and jth decoder layer, respectively. $\mathcal {P}_1$ and ${\mathcal {B}}_1$ represent the encoding weight and biase, respectively. $\mathcal {P}_2$ and ${\mathcal {B}}_2$ represent the decoding weight and biase, respectively. $\sigma _{ReLU}$ is the ReLU activation function. ${AE}\left( {{\widetilde{H}}^{\nu _t}}\right) ^{\left( 0\right) }= {\widetilde{H}}^{\nu _t}$, ${{\hat{H}}_{AE}}^{\left( 0\right) }={\widetilde{H}}^*$, and $\hat{H}_{AE}$ is the decoding of ${\widetilde{H}}^*$. To minimize the discrepancy between ${\hat{H}}_{AE}$ and H, the loss function of the autoencoder (AE) is $\mathfrak {T}_{AE}=\sum {\left\| {\hat{H}}_{AE}-H \right\| }_2^2$.

IGAE is a multilayer feedforward neural network with a nonlinear activation function $\sigma $. The encoding and decoding of each layer are as follows:

$$\begin{aligned} \widetilde{H}^1_{IGAE}={IGAE}\left( {{\widetilde{H}}^1}\right) ,\ \widetilde{H}^2_{IGAE}={IGAE}\left( {{\widetilde{H}}^2}\right) ,\ {{\hat{H}}_{IGAE}}^{\left( {j }\right) }=\sigma \left( A^m{{\hat{H}}_{IGAE}}^{\left( {j}-1\right) }{\hat{\mathcal {W}}}^{\left( {j}\right) }\right) . \end{aligned}$$

(3)

${IGAE}\left( {{\widetilde{H}}^1}\right) ^{\left( \ell \right) }=\sigma \left( A^m{IGAE}\left( {{\widetilde{H}}^1}\right) ^{\left( \ell -1\right) }\mathcal {W}^{\left( \ell \right) }\right) $, ${IGAE}\left( {{\widetilde{H}}^2}\right) ^{\left( \ell \right) }=\sigma \left( A^d{IGAE}\left( {{\widetilde{H}}^2}\right) ^{\left( \ell -1\right) }\mathcal {W}^{\left( \ell \right) }\right) $. $\mathcal {W}^{\left( \ell \right) }$ and $\ {\hat{\mathcal {W}}}^{\left( j\right) }$ represent the learnable parameters of the $\ell $-th encoder layer and j-th decoder layer, respectively. $\sigma $ represents a nonlinear activation function. ${IGAE}\left( {{\widetilde{H}}^1}\right) ^{\left( 0\right) }={\widetilde{H}}^1$, ${IGAE}\left( {{\widetilde{H}}^2}\right) ^{\left( 0\right) }={\widetilde{H}}^2$ and ${{\hat{H}}_{IGAE}}^{\left( 0\right) }={\widetilde{H}}^*$. IGAE employs a mixed loss function to minimize the weighted attribute matrix and adjacency matrix, i.e. $\mathfrak {T}_{IGAE}=\mathfrak {T}_m +\gamma \mathfrak {T}_n$. $\mathfrak {T}_m=\frac{1}{2N}{\left\| {A_{norm}}H-{\hat{H}}_{IGAE} \right\| }_F^2$, $\mathfrak {T}_n=\frac{1}{2N}{\left\| {A_{norm}}-\hat{A} \right\| }_F^2$, and $A_{norm}=D^{-\frac{1}{2}}{A}D^\frac{1}{2}\in R^{N\times N}$. $\hat{H}_{IGAE}$ is the decoding of ${\widetilde{H}}^*$, $\hat{A}$ is the reconstructed adjacency matrix, and $\gamma $ is a predefined hyperparameter.

We adopt a dynamic fusion mechanism to integrate the attribute information $H_{AE}$ of each cell and the graph structure information $H_{IGAE}$ among cells, i.e. $H_I=\tau H_{AE}+\left( 1-\tau \right) H_{IGAE}$, where $H_{AE}=0.5\times \left( {\widetilde{H}}_{AE}^1+{\widetilde{H}}_{AE}^2\right) $, and $H_{IGAE}=0.5\times \left( {\widetilde{H}}_{IGAE}^1+{\widetilde{H}}_{IGAE}^2\right) $. To fully consider the local and global relationships among cells, first, we introduce the adjacency matrix A into $H_I$ to obtain the embedding feature $H_L$ for local structure-enhanced fusion. Second, the normalized self-correlation matrix $\mathcal {S}$ can obtain the global correlation feature $H_G$, where $H_L=A_mH_I$, $\mathcal {S}_{ij\ }=\frac{e^{\left( H_LH_L^T\right) _{ij}}}{\sum _{k=1}^{N}e^{\left( H_LH_L^T\right) _{ik}}}$, and $H_G=\mathcal {S}H_L$. Last, we use a local structure to enhance features and global correlation features to extract latent features, i.e. ${\widetilde{H}}^*=\beta {H_G}+H_L$, where $\beta $ is a learnable parameter.

Dual information correlation reduction

We use a dual information correlation reduction network (DICR) to remove redundant information and improve the discriminative ability of the learned embedded features. Specifically, dual information correlation reduction is reduced in two ways: sample-level correlation reduction and feature-level correlation reduction.

First, we calculate the cross-view sample correlation matrix $\mathcal {S}^\mathcal {C}$. $\mathcal {S}_{ij}^\mathcal {C}=\frac{\left( H_i^{\nu _1}\right) \left( H_j^{\nu _2}\right) ^T}{\left\| H_i^{\nu _1} \right\| \left\| H_j^{\nu _2} \right\| },\forall i,j\in \left[ 1,N\right] $, where $H^{\nu _1}$ and $H^{\nu _2}$ are two view nodes embedded through the siamese network. The cross-view correlation matrix $\mathfrak {T}_\mathcal {C}$ is normalized, i.e. $\mathfrak {T}_\mathcal {C}=\frac{1}{N^2}\sum \left( \mathcal {S}^\mathcal {C}-I\right) ^2 \nonumber $. The purpose was to pull in two samples of the same dimension and pull out samples of different dimensions.

Second, the correlation reduction of feature levels is divided into three steps. (1) The readout function $\mathcal {R}\left( \cdot \right) $ is used to embed $H^{\nu _1}$ and $H^{\nu _2}$ from $R^{N \times d}$ mapped to $R^{k \times d}$: ${\widetilde{H}}^{\nu _t}=\mathcal {R}\left( H^{\nu _t}\right) ,t=1,2.$ (2) The cosine similarity is calculated based on ${\widetilde{H}}_i^{\nu _1}$ and ${\widetilde{H}}_j^{\nu _2}$: $\mathcal {S}_{ij}^\mathfrak {F}=\frac{\left( {\widetilde{H}}_{:j}^{\nu _1}\right) ^T\left( {\widetilde{H}}_{:i}^{\nu _2}\right) }{\left\| {\widetilde{H}}_i^{\nu _1} \right\| \left\| {\widetilde{H}}_j^{\nu _2} \right\| },\forall i,j\in \left[ 1,...,d\right] $, where ${\widetilde{H}}_{:j}^{\nu _1}$ represents the j-th column of ${\widetilde{H}}^{\nu _1}$ and ${\widetilde{H}}_{:i}^{\nu _2}$ represents the i-th column of ${\widetilde{H}}^{\nu _2}$. (3) We perform normalization processing to pull in two features of the same dimension and pull out features of different dimensions, i.e. $\mathfrak {T}_\mathfrak {F}=\frac{1}{d^2}\sum \left( \mathcal {S}^\mathfrak {F}-\widetilde{I}\right) ^2.$ We obtain the latent features $H=\frac{1}{2}\left( H^{\nu _1}+H^{\nu _2}\right) $. Therefore, considering information reduction from two dimensions can further remove redundant information.

Clustering module

The DCRELM employs a triplet self-supervised strategy to enhance clustering performance, which simultaneously leverages the target distribution to enhance the guidance for the AE and IGAE.

We utilize the t-distribution to compute the similarity between the samples and the clustering centres in the fusion embedding $\widetilde{H}$. This similarity measurement helps capture the relationship between the samples and the clustering centres during the clustering process. Fusion embedding $\widetilde{H}$ integrates AE and IGAE information to generate a target distribution. The calculation process is described as follows: $q_{ij}=\frac{{\left( 1+{\left\| {{\widetilde{H}}^*}_i-\mu _j \right\| }^2/\upsilon \right) }^{-\frac{\upsilon +1}{2}}}{{\sum }_{j^\prime }{\left( 1+{\left\| {{\widetilde{H}}^*}_i-\mu _{j^\prime } \right\| }^2 / \upsilon \right) }^{-\frac{\upsilon +1}{2}}}$, where the degree of freedom for the Student’s t-distribution is denoted by $\upsilon $, while $q_{ij}$ represents the probability of assigning the i-th node to the j-th centre. This probability, which is referred to as a soft assignment, quantifies the likelihood of the i-th node belonging to the j-th centre. We normalize the frequency of each cluster based on $q_{ij}$ and obtain the calculation method for $p_{ij}$ as follows: $p_{ij}=\frac{{q^2}_{ij}/\sum _{i} q_{ij}}{\sum _{j^\prime }\left( {q^2}_{ij^\prime }/\sum _{i} q_{ij^\prime }\right) }$. The distribution $q^\prime $ of $H_{AE}$ and the distribution $q^{\prime \prime }$ of $H_{IGAE}$ are calculated in the same way as the distribution of ${\widetilde{H}}^*$ is calculated. We adopt the KL-divergence and designate the triplet self-supervised strategy clustering loss as:

$$\begin{aligned} \mathfrak {T}_{KL}=\sum _{i}\sum _{j}{p_{ij}log\frac{p_{ij}}{\left( q_{ij}+{q^\prime }_{ij}+{q^{\prime \prime }}_{ij}\right) /3}} \end{aligned}$$

(4)

Objective function

As shown in Eq. (5), the learning objective function of the DCRELM comprises three main components: the reconstruction loss of AE and IGAE, the DICR module, and the clustering model. These components collectively contribute to the learning process of the DCRELM. The DICR module includes $\mathfrak {T}_\mathcal {C}$ loss, $\mathfrak {T}_\mathfrak {F}$ loss, and $\mathfrak {T}_\mathcal {R}$ loss, where $\mathfrak {T}_\mathcal {R}=JSD(H,\widetilde{A}H)$, and it is aimed at alleviating oversmoothing. $JSD (\cdot )$ refers to the Jensen–Shannon divergence. $\mathfrak {T}_{KL}$ is the clustering loss function. $\varepsilon $ and $\lambda $ are hyperparameters.

$$\begin{aligned} \mathfrak {T}={\underbrace{{\mathfrak {T}_{AE}+\mathfrak {T}_{IGAE}}}_{Reconstruction}}+{\underbrace{{\mathfrak {T}_\mathcal {C}+\mathfrak {T}_\mathfrak {F}+\varepsilon \mathfrak {T}_\mathcal {R}}}_{DICR}}+{\underbrace{{{\lambda \mathfrak {T}}_{KL}}}_{Clustering}} \end{aligned}$$

(5)

Time complexity analysis

DCRELM consists of five parts: ELM module, ELM graph distortion module, dual information correlation reduction module, autoencoder module, and autoencoder fusion module. These five parts correspond to time complexities $O(N*M*\widetilde{M})$, $O(N^2)$, $O(N^2*d)$, $O(N*M*d)$, and $O(N^2*d)$, where N is the number of cells, M is the number of genes, $\widetilde{M}$ is the number of random mapping nodes, and d is the embedding size. Therefore, the total time complexity of DCRELM is $O(N*M*\widetilde{M})+O(N^2*d)+O(N*M*d)$, where $\widetilde{M}$ and d are much smaller than M. Overall, DCRELM significantly reduces the dimensionality of gene representation and can better handle larger scale scRNA-seq datasets.

Implementation and parameter settings

This paper conducts experiments using PyTorch to execute the DCRELM in a Python 3.8 environment. The number of randomly mapped nodes is selected from $\{200, 500, 1000, 1500, 2000\}$. The number of nodes in the first three layers of the AE encoding layer is selected from $\left\{ 256, 512, 1024, 2048\right\} $, and the number of nodes in the last layer is equal to the number of randomly mapped nodes. The number of nodes in the first two layers of the IGAE encoding layer is selected from $\left\{ 256, 512, 1024, 2048\right\} $, and the number of nodes in the last layer is equal to the number of randomly mapped nodes. The DCRELM is trained using Adam with 2000 epochs and a learning rate of 0.0001. All the experiments were conducted on an NVIDIA A40 (48G).

Evaluation metrics

We use three evaluation metrics—the normalized mutual information (NMI), adjusted rand index (ARI), and $F_1$—to measure the clustering performance of the clustering methods. The NMI is utilized to measure the similarity of the clustering results and combines the concepts of information entropy and mutual information. The ARI is employed to quantify the agreement between the predicted clusters and the true clusters. $F_1$ measures the classification performance of the algorithms.

Results and discussion

Comparison of algorithm clustering performance

In this section, we conduct clustering experiments on 12 real scRNA-seq datasets and compare them with six state-of-the-art, single-cell clustering methods, namely, scDeepCluster²⁰, GraphSCC³⁴, scGNN³², DREAM³¹, scDCCA²⁹, and scDFC³⁵. Furthermore, we employ three evaluation metrics, namely, the NMI, ARI, and $F_1$, to assess the performance of each method.

Tables 3, 4 and 5 show the experimental results of seven methods on 12 real scRNA-seq datasets. The best results are highlighted in bold. As shown in Tables 3, 4 and 5, the DCRELM achieves the best clustering performance in most datasets. With the exception of three datasets, the DCRELM has the highest NMI and ranks second in terms of the ARI among all the algorithms. Although the DCRELM is not the highest on the Kolo, WB, or CNIK datasets, it still performs in the top three. In terms of $F_1$, the DCRELM significantly outperforms all the other algorithms. On the Klein and Muraro datasets, the DCRELM exhibits significant improvements in terms of the NMI and ARI compared to the scGNN. Overall, the DCRELM outperforms the other methods.

Table 3 NMI of the DCRELM and six comparison methods on 12 datasets.

Full size table

Table 4 ARI of the DCRELM and six comparison methods on 12 datasets.

Full size table

Table 5 $F_1$ of the DCRELM and six comparison methods on 12 datasets.

Full size table

To visualize the clustering results of the seven clustering methods, we choose a smaller scale real dataset Lawlor and a larger scale real dataset Klein, and use t-SNE⁴² to project the clustering results of each clustering method into two-dimensional space. We compared the DCRELM with the other six clustering methods on the Lawlor and Klein datasets. As shown in Fig. 3, the different cell subtypes predicted by the DCRELM exhibit distinct boundaries, enabling a distinction among different cell subtypes with only a few remaining samples and mixtures. In contrast, many cell clusters identified by the other seven methods have yet to be identified and include a greater mixture of different cell subtypes. The analysis indicates that the DCRELM can effectively reduce the distance among cells within clusters of the same class.

Model stability

To further verify the stability and robustness of the DCRELM, we conduct dropout experimental analysis on the dataset Klein and randomly select 20%, 30%, 40%, and 50% of the nonzero values to set zero in X. We use two evaluation metrics, namely, the NMI and ARI, to measure the clustering performance of the DCRELM and six comparison methods. The experimental results are shown in Fig. 4, which reveals that all the algorithms are affected by noise interference. Moreover, scDeepCluster and scDCCA are more affected by noise, resulting in significant degradation of their clustering performance. GraphSCC is generally relatively stable, but its performance is not optimal. The DCRELM has less interference from noise, demonstrating strong stability and robustness.

Parameter analysis

To obtain low-dimensional and dense cell gene expression features, we use the parameter $\widetilde{M}$ to control the number of hidden layer nodes. The parameter selection range of $\widetilde{M}$ is $\{100, 200, 500, 1000, 1500, 2000\}$. Figure 5 shows the effect of $\widetilde{M}$ in terms of the NMI, ARI, and $F_1$ of the DCRELM on four datasets: Human, Yeo, Klein, and Muraro. Figure 5 shows that the clustering performance of the DCRELM varies with respect to $\widetilde{M}$ on the four datasets. For example, the DCRELM is not very sensitive to $\widetilde{M}$ on the Muraro dataset, while it is sensitive to $\widetilde{M}$ on the Human dataset. Therefore, the selection of the appropriate $\widetilde{M}$ value plays an important role in the clustering performance of the DCRELM.

To obtain the effective attributes and graph structure information of cells, we use embedding dimensions to control the number of nodes in the network layer for the AE and IGAE. The selection range for embedding dimensions is $\{128, 256, 512, 1024,2048\}$. Figure 6 shows the impact of the parameter embedding dimension on the clustering results of the DCRELM on the four datasets. Based on Fig. 6, for datasets with sample sizes smaller than 1000, the optimal embedding dimension size for the AE and IGAE network layers is set to 256. For datasets with sample sizes larger than 1000, the optimal embedding dimension size for the AE and IGAE network layers is set to 2048. Therefore, the appropriate embedding dimension is related to the sample number of datasets.

Ablation experiments

We conduct ablation experiments on four datasets: Human, Yeo, Klein, and Muraro. Dual information correlation reduction, dynamic autoencoder fusion, and graph distortion modules play important roles in improving the clustering performance of the DCRELM. To analyse the impact of each module on the clustering performance of the DCRELM, four variants of the DCRELM are constructed. DCRELM-CR refers to the variant of the DCRELM in which the dual information correlation module is removed. DCRELM-DF refers to the variant of the DCRELM in which IGAE is removed while retaining the AE. DCRELM-N refers to the variant of the DCRELM in which feature destruction from the graph distortion module is removed. DCRELM-E refers to the variant of the DCRELM where edge disturbances from the graph distortion module are removed.

As shown in Fig. 7, due to the removal of the dual information correlation reduction module, DCRELM-CR could not effectively remove low-quality cells or genes with low expression. Therefore, the NMI, ARI, and $F_1$ of DCRELM-CR are lower than those of the DCRELM. Due to the removal of the dynamic autoencoder fusion, DCRELM-DF cannot effectively utilize the graph structure information of the fused cells. Therefore, the NMI, ARI, and $F_1$ of DCRELM-DF are lower than those of the DCRELM. Due to the removal of feature destruction and edge disturbance in the graph distortion module, DCRELM-N and DCRELM-E exhibit lower robustness than the DCRELM.

Conclusion

In this paper, we propose a new deep clustering method, the DCRELM, for scRNA-seq data. This method obtains low-dimensional and dense gene representations through an ELM random mapping space and then uses a graph distortion module to improve the robustness and uncertainty of the model. The dynamic fusion of dense-cell gene representations with cell attribute information and graph structure information helps establish connections among cells and among genes. We employ dual information correlation reduction to filter out redundant information and noise at both the cellular level and gene level. Additionally, we utilize a triple, self-supervised learning mechanism to further enhance the clustering performance. Extensive experiments demonstrate that the DCRELM outperforms the other comparison methods. In the future, we will consider multimodal data clustering, integrating data from different levels to more comprehensively describe the heterogeneity of single cells.

Data availability

These six scRNA-seq datasets analysed during the current study are available in the Gene Expression Omnibus (GEO) repository with accession numbers of GSE36552 (Human), GSE85908 (Yeo), GSE64016 (Ning), GSE86473 (Lawlor), GSE65525 (Klein), and GSE85241 (Mauro). The Kolo dataset analysed during the current study are available in the ArrayExpress repository with an accession number of E-MTAB-260. The BEAM, CNIK, WB, CD19, and CD8 analysed during the current study are available in the 10X Genomics website repository, https://www.10xgenomics.com/datasets/2k-transgenic-hel-mouse-splenocytes-beam-ab-2-standard (BEAM), https://www.10xgenomics.com/datasets/5k-human-pancreatic-tumor-isolated-with-chromium-nuclei-isolation-kit-3-1-standard (CNIK), https://www.10xgenomics.com/datasets/whole-blood-rbc-lysis-for-pbmcs-and-neutrophils-granulocytes-5-3-1-standard (WB), https://www.10xgenomics.com/datasets/cd-19-plus-b-cells-1-standard-1-1-0 (CD19), and https://www.10xgenomics.com/datasets/cd-8-plus-cytotoxic-t-cells-1-standard-1-1-0 (CD8).

References

Shi, Y., Wan, J., Zhang, X. & Yin, Y. CL-Impute: A contrastive learning-based imputation for dropout single-cell RNA-seq data. Comput. Biol. Med. 164, 107263 (2023).
Article PubMed Google Scholar
Lee, J. et al. Deep single-cell RNA-seq data clustering with graph prototypical contrastive learning. Bioinformatics 39, 1367–4811 (2023).
Article Google Scholar
Qiu, Y., Yan, C., Zhao, P. & Zou, Q. SSNMDI: A novel joint learning model of semi-supervised non-negative matrix factorization and data. Brief. Bioinform. 24, 1477–4054 (2023).
Article Google Scholar
Yang, F. et al. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).
Article Google Scholar
Chen, J. et al. Deep transfer learning of cancer drug responses by integrating bulk and single-cell RNA-seq data. Nat. Commun. 13, 6494 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Qiao, Y. et al. Identification of a hypoxia-related gene prognostic signature in colorectal cancer based on bulk and single-cell RNA-seq. Sci. Rep. 13, 2503 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, M. J. et al. Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nat. Genet. 54, 1572–1580 (2022).
Article CAS PubMed PubMed Central Google Scholar
Huang, Y. et al. Characterizing cancer metabolism from bulk and single-cell RNA-seq data using METAFlux. Nat. Commun. 14, 4883 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, B. et al. Single-cell massively-parallel multiplexed microbial sequencing (M3-seq) identifies rare bacterial populations and profiles phage infection. Nat. Microbiol. 8, 1846–1862 (2023).
Article CAS PubMed PubMed Central Google Scholar
Català, P., Groen, N., LaPointe, V. L. S. & Dickman, M. M. A single-cell RNA-seq analysis unravels the heterogeneity of primary cultured human corneal endothelial cells. Sci. Rep. 13, 9361 (2023).
Article ADS PubMed PubMed Central Google Scholar
Kan, T. et al. Single-cell RNA-seq recognized the initiator of epithelial ovarian cancer recurrence. Oncogene 41, 895–906 (2022).
Article CAS PubMed Google Scholar
Buettner, F. et al. Single cell analyses identify a highly regenerative and homogenous human CD34+ hematopoietic stem cell population. Nat. Commun. 13, 2048 (2022).
Article ADS PubMed PubMed Central Google Scholar
Qi, R., Wu, J., Guo, F., Xu, L. & Zou, Q. A spectral clustering with self-weighted multiple kernel learning method for single-cell RNA-seq data. Brief. Bioinform. 22, bbaa216 (2022).
Article Google Scholar
Grün, D. et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015).
Article ADS PubMed Google Scholar
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Article PubMed PubMed Central Google Scholar
duVerle, D. A., Yotsukura, S., Nomura, S., Aburatani, H. & Tsuda, K. Cell Tree: An R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data. Cell Stem Cell 17, 363 (2016).
Google Scholar
Yu, B. et al. scGMAI: A gaussian mixture model for clustering single-cell RNA-seq data based on deep autoencoder. Brief. Bioinform. 22, bbaa316 (2020).
Article Google Scholar
Suter, D. M. et al. Mammalian genes are transcribed with widely different bursting kinetics. Science 332, 472–474 (2011).
Article ADS CAS PubMed Google Scholar
Qi, J., Wang, Y. & Tang, X. Signal transduction by transcriptional bursting. Chin. J. Bioinform. 17, 207–213 (2019).
Google Scholar
Tian, T., Wan, J., Song, Q. & Wei, Z. Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat. Mach. Intell. 1, 191–198 (2019).
Article Google Scholar
Pu, J., Wang, B., Liu, X., Chen, L. & Li, S. C. SMURF: Embedding single-cell RNA-seq data with matrix factorization preserving self-consistency. Brief. Bioinform. 24, bbad026 (2023).
Article PubMed Google Scholar
Yu, Z. et al. Topological identification and interpretation for single-cell gene regulation elucidation across multiple platforms using scMGCA. Nat. Commun. 14, 400 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat. Methods 14, 414–416 (2017).
Article CAS PubMed Google Scholar
Kiselev, V. et al. SC3: Consensus clustering of single-cell RNA-seq data. Nat. Methods 14, 483–486 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kiselev, V. Y., Andrews, T. S. & Andrews, T. S. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Article CAS PubMed Google Scholar
Lin, P., Troup, M. & Troup, M. CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 18, 59 (2017).
Article PubMed PubMed Central Google Scholar
Zeng, Y. et al. A parameter-free deep embedded clustering method for single-cell RNA-seq data. Brief. Bioinform. 23, bbac172 (2022).
Article PubMed Google Scholar
Hu, H., Li, Z., Li, X., Yu, M. & Pan, X. ScCAEs: Deep clustering of single-cell rna-seq via convolutional autoencoder embedding and soft k-means. Brief. Bioinform. 23, bbab321 (2021).
Article Google Scholar
Wang, J., Xia, J., Wang, H., Su, Y. & Zheng, C. scDCCA: Deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network. Brief. Bioinform. 24, bbac625 (2023).
Article PubMed Google Scholar
Su, Y., Lin, R., Wang, J., Tan, D. & Zheng, C. Denoising adaptive deep clustering with self-attention mechanism on single-cell sequencing data. Brief. Bioinform. 24, bbad021 (2023).
Article ADS PubMed Google Scholar
Jiang, J. et al. Dimensionality reduction and visualization of single-cell RNA-seq data with an improved deep variational autoencoder. Brief. Bioinform. 24, bbad152 (2023).
Article PubMed Google Scholar
Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-seq analyses. Nat. Commun. 12, 1882 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Cheng, Y. & Ma, X. scGAC: A graph attentional architecture for clustering single-cell RNA-seq data. Bioinformatics 38, 2187–2193 (2022).
Article CAS PubMed Google Scholar
Zeng, Y., Zhou, X., Rao, J., Lu, Y. & Yang, Y. Accurately clustering single-cell RNA-seq data by capturing structural relations between cells through graph convolutional network. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 519–522 (2020).
Hu, D. et al. scDFC: A deep fusion clustering method for single-cell RNA-seq data. Brief. Bioinform. 24, bbad216 (2023).
Article PubMed Google Scholar
Jiang, B., Zhang, Z., Lin, D., Tang, J. & Luo, B. Semi-supervised learning with graph learning convolutional networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11305–11312 (2019).
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: Large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Huang, G., Zhu, Q. & Siew, C.-K. Extreme learning machine: Theory and applications. Neurocomputing 70, 489–501 (2006).
Article Google Scholar
Huang, G., Zhou, H., Ding, X. & Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Cybern. 42, 513–529 (2012).
Article Google Scholar
Huang, G., Huang, G., Song, S. & You, K. Trends in extreme learning machines: A review. Neural Netw. 61, 32–48 (2015).
Article PubMed Google Scholar
Tu, W. et al. Deep fusion clustering network. Proc. AAAI Conf. Artif. Intell. 35, 9978–9987 (2021).
Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar

Download references

Acknowledgements

This work was supported in part by the Basic Research Project of Education Department of Liaoning Province in China (JYTMS20230929). We thank all anonymous reviewers for their helpful comments, which improved the quality of this paper.

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
Qingyun Gao & Qing Ai

Authors

Qingyun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Qing Ai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Qingyun Gao, Qing Ai; Methodology: Qingyun Gao, Qing Ai; Writing-original draft preparation: Qingyun Gao; Writing-review and editing: Qingyun Gao, Qing Ai; Funding acquisition: Qing Ai; Supervision: Qing Ai.

Corresponding author

Correspondence to Qing Ai.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, Q., Ai, Q. DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering. Sci Rep 14, 13541 (2024). https://doi.org/10.1038/s41598-024-64217-y

Download citation

Received: 26 March 2024
Accepted: 06 June 2024
Published: 12 June 2024
DOI: https://doi.org/10.1038/s41598-024-64217-y

Subjects

Abstract

Similar content being viewed by others

Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

scCAN: single-cell clustering using autoencoder and network fusion

Introduction

Materials and methods

Datasets

Framework of the DCRELM

Extreme learning machine

ELM graph distortion module

Autoencoding fusion module

Dual information correlation reduction

Clustering module

Objective function

Time complexity analysis

Implementation and parameter settings

Evaluation metrics

Results and discussion

Comparison of algorithm clustering performance

Model stability

Parameter analysis

Ablation experiments

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links