Research
Open access
Published: 12 February 2025

Haematology dimension reduction, a large scale application to regular care haematology data

Huibert-Jan Joosse¹,
Chontira Chumsaeng-Reijers¹,
Albert Huisman¹,
Imo E. Hoefer¹,
Wouter W. van Solinge¹,
Saskia Haitjema¹ &
…
Bram van Es¹

BMC Medical Informatics and Decision Making volume 25, Article number: 75 (2025) Cite this article

138 Accesses
Metrics details

Abstract

Background

The routine diagnostic process increasingly entails the processing of high-volume and high-dimensional data that cannot be directly visualised. This processing may provide scaling issues that limit the implementation of these types of data into research as well as integrated diagnostics in routine care. Here, we investigate whether we can use existing dimension reduction techniques to provide visualisations and analyses for a complete bloodcount (CBC) while maintaining representativeness of the original data. We considered over 3 million CBC measurements encompassing over 70 parameters of cell frequency, size and complexity from the UMC Utrecht UPOD database. We evaluated PCA as an example of a linear dimension reduction techniques and UMAP, TriMap and PaCMAP as non-linear dimension reduction techniques. We assessed their technical performance using quality metrics for dimension reduction as well as biological representation by evaluating preservation of diurnal, age and sex patterns, cluster preservation and the identification of leukemia patients.

Results

We found that, for clinical hematology data, PCA performs systematically better than UMAP, TriMap and PaCMAP in representing the underlying data. Biological relevance was retained for periodicity in the data. However, we also observed a decrease in predictive performance of the reduced data for both age and sex, as well as an overestimation of clusters within the reduced data. Finally, we were able to identify the diverging patterns for leukemia patients after use of dimensionality reduction methods.

Conclusions

We conclude that for hematology data, the use of unsupervised dimension reduction techniques should be limited to data visualization applications, as implementing them in diagnostic pipelines may lead to decreased quality of integrated diagnostics in routine care.

Peer Review reports

Introduction

The diagnostic process in a routine healthcare setting increasingly produces data in high volume, dimensionality and in multiple modalities, both structured and unstructured. Examples of these diagnostic data are ‘omics’ data such as transcriptomics, proteomics and metabolomics as well as imaging data, yet routine haemocytometer data of a complete blood count (CBC) can also be considered high-dimensional data. Visualisation of the data in a comprehensive way can be a challenge due to the high dimensionality. More importantly, to help healthcare professionals interpret these data for the benefit of individual patients, integration of the different types of data into integrated diagnostics models is warranted. One of the modelling challenges in the development and deployment of these models is the combination of vast data volumes and their high dimensionality, which may lead to computational performance issues. There is thus a need to ensure feasibility of integrated diagnostics models. One of the ways to achieve this is by using a low-dimensional representation of these data rather than the full dataset. Such a representation can be generated using dimension reduction techniques.

Dimension reduction has historically been performed by the use of principal component analysis (PCA). This linear transformation technique assumes normally distributed variables, and is primarily focused on establishing a dimension reduction that is preserving the global structure. Global structure preservation aims at preserving the global patterns in the data, such as obvious clusters that are present in the data, whereas the local structure preservation aims at preserving more intrinsic patterns in the data, i.e. preserving the neighbourhood for each point. Several more recent dimension reduction approaches aim to also preserve local structure. One way to do this is through (non-linear) manifold approximation, which is based on learning the underlying structure of the data, mostly based on nearest neighbours. Some examples for these type of methods are Uniform Manifold Approximation and Projection (UMAP) [1], Pairwise Controlled Manifold Approximation (PaCMAP) [2], and TriMap, a triplet-based approach [3].

Applying these methods to high-dimensional biological data has been performed before, including flow cytometry workflows, transcriptomics data, RNA sequencing data, and protein structure analysis among others [4,5,6,7,8,9]. However, to the best of our knowledge, comparative work on robustness of dimension reduction on large haematological data has not been performed before.

A complete blood count (CBC) assessing red and white blood cells and platelets, is one of the most frequently performed diagnostic procedures. Haemocytometers, on the basis of flow-cytometry, use proprietary algorithms to combine cell characteristics such as size, granularity, lobularity and viability into clinically relevant parameters like hemoglobin levels or white blood cell differentiation patterns. However, next to these parameters currently reported to the clinic, each routine haematology measurement actually encompasses research-only values and raw cell characteristics of red and white blood cells and platelets that are currently not used in clinical care. In the University Medical Center Utrecht (UMCU), Utrecht, the Netherlands, the raw hematology data of over 3 million samples that were measured on Abbott CELL-DYN Sapphire hematology analyzers were stored in the Utrecht Patient Oriented Database (UPOD) since 2005. The full content and extent of the database is described elsewhere [10]. Previous UPOD research shows there is biologically and clinically relevant information hidden in the unreported hematology measurements of these samples [11,12,13,14,15,16]. Using dimension reduction methods to enable processing of raw CBC and visualising or combining it into integrated diagnostics models may therefore eventually improve clinical practice.

Considering this vast amount of haematological data, and its high number of dimensions, we set out to find a robust approach in reducing the dimensions of the data, so that it can be better processed but also better visualized. By investigating the performance of the dimension reduction methods, we aim to ensure their usability in routine haematological data to improve clinical care, for example in diagnostic pipelines. As a dimension reduction should be a good representation of the original data, we not only compared the preservation of global and local data structure by several current dimension reduction techniques (PCA, UMAP, TriMap, PaCMAP, and Gaussian Random Projection as negative control), but also assessed their ability of preserving any clinical, diagnostic or biological relevancy of the data.

Methods

Descriptives

We extracted all available CBC measurements from the Abbott CELL-DYN Sapphire from 2005 to 2020 from the UPOD. We filtered out samples for cases where a negative age was reported for a sample. We then applied rigorous quality control based on metadata retrieved from the CELL-DYN Sapphire machines, based on in-house knowledge, gained from clinical chemists and data managers. Examples of such quality control included the handling of erroneous measurements, or measurements that were otherwise suspicious. As some of the CBC measurements are only available if the sample was measured in reticulocyte mode, we imputed these missing variables using the miceforest package in Python, based on the Multiple Imputation with Chain Equations (MICE) [17] approach using a gradient boosting approach [18]. In our data, samples were measured in reticulocyte mode by default from 2013 onwards, providing the opportunity to impute missing data before 2013, since these data could be considered Missing At Random (MAR).

Considering the possibility that extreme outliers would distort the overall quality of any dimension reduction model, we transformed white blood cell count parameters to log scale. Additionally, we decided to clip the bounds of each parameter to limit the effect of outliers, while preserving the clinical relevance of the samples. A list of the analysed variables that required clipping thresholds can be found in table S1. In addition, we applied score scaling to all variables, so that the mean is 0 and the standard deviation is 1.

Dimension reduction

Dimension reduction methods

One of the most frequently used dimension reduction models historically is PCA, which tries to capture data in linear combinations, using vector decomposition. It creates perpendicular components, meaning that components are not correlated to each other, and using this principle, PCA can reduce the original data into a reduced space by explaining the variance in the original data. This method is very useful when working with collinear features, as these features will be captured in the same components, since they explain the same variance in the original data. For assessing the performance of a PCA, the cumulative explained variance is often used, and this will naturally increase when the number of components are increased. PCA assumes linear relationships between variables, and assumes normally distributed variables.

Yet, as the probability exists that the original data might contain non-linear relationships, we decided to use manifold dimension reduction techniques, which are based on the theory that any space can be reduced to lower dimensions based on the shape of the data. In order to achieve this, each data point should be placed in a similar neighbourhood compared to the original space. This makes sure that local structure of the data is better preserved, i.e., that data that is similar in the original space is also similar in the reduced space. Examples of non-linear dimension reduction techniques include Uniform manifold approximation (UMAP), Triplets Manifold Approximation (TriMap), and Pairwise Controlled Manifold Approximation (PaCMAP). In addition to PCA, these methods were used in the current study to capture the large and complex CELL-DYN Sapphire dataset in lower dimension. Finally, we used Gaussian Random Projection (GRP) as a negative control. We will provide a brief overview of these techniques in this section.

Although UMAP, PacMAP and TriMap are initialised with PCA by default, the individual components of UMAP, TriMap and PaCMAP have no specific meaning, unlike PCA. For PCA, the additional explained variance diminishes when a higher number of components are used.

UMAP

UMAP estimates the shape of the data in the higher dimensionality using a weighted graph and then projects the graph onto the lower dimension for dimensionality reduction [1] (see Fig. 1). UMAP constructs a high-dimensional graph by extending branches from individual points with a radius r to connect the points to their neighbourhood in high-dimension. These branches then become a graph of various shapes to be projected onto the lower dimension, irrespective of distance between points. The k-nearest neighbours in r can be set, where a low k preserves the local structure, and a higher k preserves the global structure of the original data. Finally, the high-dimensional graph is projected onto a lower dimension using a force-directed graph approach, pulling together points that are close and pushing apart points are further away. This is done based on the weighted connectivity, meaning that points are drawn towards groups of points with which it has multiple connections, rather than points/clusters with singular connections. Clusters are formed based on some threshold, which also depends on the number of nearest neighbours. Increasing the k-nearest neighbours will result in larger groups of interconnected points, at the cost of increased computational complexity.

TriMap

TriMap is another manifold approach, and is primarily built around triplets constraints [3]. TriMap constructs triplets per point (i) and pairs this to \(n\_inliers\) (j) according to the distance metric used. For each of these pairings, \(n\_outliers\) are sampled (k) resulting in \(n\_inliers*n\_outliers\) triplets per point (i, j, k). Additionally, \(n\_random\) triplets are constructed. TriMap then creates a low dimensional representation of the data where the ordering of the distances of these triplets is preserved (\(d(i,j) \le d(i,k)\)), by weighting the triplets, according to the relative distance of j and k to i (Fig. 2).

PaCMAP

Similarly to TriMap, PaCMAP samples both neighbours and non-neighbours (Near Pairs and Further Pairs respectively) in order to establish a low-dimensional representation of the original data. Contrary to TriMap, it also focusses on Mid-Near Pairs [2]. Near Pairs are the nearest neighbours based on a scaled distance metric. Mid-near pairs are established by sampling 6 points per observation and then selecting the second-closest point based on distance. The amount of Mid-Near Pairs is set by the \(MN\_ratio\). Finally, Further Pairs are non-neighbours, and the amount of pairs is set using the \(FP\_ratio\). After initializing with PCA, PaCMAP uses a weighted loss function to optimize the low dimensional representation. The loss function is primarily driven by the Near Pairs and Mid-Near Pairs, but gradually is mostly influenced by the Near Pairs. This means that the loss is highly increased if close points in original space are set further away in the reduced space.

Gaussian random projection

Gaussian Random Projection (GRP) is a dimension reduction technique that is based on the Johnson-Lindenstrauss lemma, which states that any high-dimensional Euclidean space can be reduced onto a lower-dimensional Euclidean space with minimal distortion (at most \(1+\epsilon\)) of the pairwise distance [19], and a result by Hecht-Nielsen [20] who showed that a random selection of vectors in a high-dimensional space can be considered an orthogonal projection. Gaussian Random Projection does this by projecting original data on a randomly generated matrix with Gaussian distributions. However, the accuracy of the projection and the amount of required components for dimension reduction is highly dependent on the amount of samples and the permitted error (\(\epsilon\)), specifically \(n\_components \ge 4 ln(n\_samples) / (\epsilon ^2/2 - \epsilon ^2/3)\) [21]. This means that GRP can require more components than available dimensions when the number of dimensions is sufficiently low and the number of observations is high. To that end, we included GRP as a negative control for the dimension reduction quality metrics, because we would expect that this method would perform worst when dimension reduction the data to a low number of dimensions (\(\le\)10) because of this constraint, since our data consists of over 3 million samples.

Parameter tuning

We tuned the amount of neighbours used for UMAP, TriMap, PaCMAP (\(n\_neighbours\)). For UMAP and PaCMAP we were interested in the number of neighbours, but for TriMap we were interested in the number of outliers and inliers, since this is important for the construction of triplets in TriMap. Both PCA and GRP do not require any tuning on nearest neighbours, since they are not neighbour-based. Additionally, we also investigated the number of dimensions (\(n\_components\)) that were generated by all the dimension reduction methods, as this might increase the amount of information stored in the dimension reduction. For example, in PCA, the amount of total variation explained increases when the amount of components is increased. As computing numerous distinct dimension reductions and their performance is computationally expensive using a nearest-neighbours approach, we also investigated the number of samples we could use for dimension reduction purposes.

Distance metrics

One important step in the assessment of dimension reduction techniques is the distance metric with which we assess the distances between data points and with which we perform the dimension reduction for the manifold approaches. As mentioned above, the number of dimensions of the reduced data with Euclidean distance is dependent on the number of samples and the permitted distortion (\(\epsilon\)). For a dataset with roughly three million samples, and roughly one hundred dimensions, this means that we are not able to project the data to a lower-dimensional Euclidean space while preserving the distortion \(0<\epsilon <1\). This practically excludes using the Euclidean distance metric from the perspective of distance preservation, and a fractional distance metric is best suited for the description of distances in high dimensionality (\(d>30\)) [22]. We decided to pursue the Manhattan distance as the simplest expression of the fractional distance. The Manhattan distance is defined as: \(\sum _{i=1}^{n }|a_i-b_i|\).

Dimension reduction quality metrics

Two main ways that are used for dimension reduction quality metrics are evaluating the global and local structure [2]. Local structure metrics evaluate neighbourhoods of points and how well these are preserved in the reduced data, while global structure metrics evaluate how well the reduced data preserved the relationships between groups of points. In this study, both global and local distance metrics were used to find a balanced representation of the CELL-DYN Sapphire data in lower dimension. The metrics are generally rank-based, since these are insensitive to scaling. One unifying framework for rank-based metrics is the co-ranking matrix (Q-matrix) [23]. The Q-matrix compares the pairwise ranks of the original data versus the reduced data, showing the preservation of local and global distances. Calculating a Q-matrix consists of two steps. Firstly, a ranking of distances between points in both original and reduced data is calculated. Thereafter, a single matrix is constructed combining both rankings, explaining rank preservation in the low-dimensional data.

For local preservation measures, we further used the proportion of neighbouring points being preserved (the neighbourhood-kept-ratio), and the trustworthiness score. The neighbourhood-kept-ratio [6] is computed using the number of nearest neighbours \(\textbf{N}(i)\) for all \(i\) in high-dimensional space and the \(k\)-nearest neighbours \(\textbf{N}'(i)\) for all \(i\) in low-dimensional space, where \(i\) is each data point. Consequently, \(\textbf{N}(i)\) and \(\textbf{N}'(i)\) are compared to see the intersection between their neighbourhoods. The degree of overlap is calculated, and divided by the number of \(k\) to calculate a ratio for each \(i\). Subsequently, this ratio is divided by the number of samples to get the average neighbourhood preservation. The trustworthiness score ranks neighbourhood points in accordance with how close they are to the observations \(i\) in low- and high- dimensional spaces [24]. If the ranks of neighbourhood points are misaligned in the reduced space, the metric will penalise these shifts, resulting in a lower score. A version of the trustworthiness score [24] was used in this study with help of the Q-matrix framework [25].

For global preservation measures, we used random triplet score and spearman rank correlation. The random triplet score is calculated by retrieving sets of two points \((j,k)\) at random per \(i\) in the original data to form triplets \((i,j,k)\) [2]. After this, it finds the same set of triplets in the reduced space and calculates the distance from \(i\) to \(j\) (\(d_{ij}\)) and \(k\) (\(d_{ik}\)) for both the original and the reduced data. It then orders \(d_{ij}\) and \(d_{ik}\) based on their distance in both datasets. The degree of order preservation indicates global structure preservation by the dimension reduction method. Five triplets per \(i\) were used in this study. Finally, pairwise distances can be measured using the Spearman rank correlation to assess distance preservation in the reduced data. Another strength of this method is that distance correlation is easily visualized in a graph (e.g., Figure S1 and S2) to assess the correlation of distances between low- and high-dimensional spaces. To compare the different dimension reduction methods with regards to their quality metrics, we performed the quality assessments in 10-fold, and used a T-test for comparison.

Preservation of biological representation

Cluster preservation

Because biological relevance and meaning of the data should be maintained in the dimension reduction, we assessed preservation of biological relevance from four different angles. As a first angle, we studied the preservation of clusters of similar patients in the reduced data. We analyzed both the raw and the reduced data using HDBSCAN [26] and k-means clustering [27] and retrieved information on the preservation of clustering methods after using dimension reduction. For this analysis, we were interested in the number of clusters extracted, the Normalised Mutual Information (NMI) and Adjusted Rand Index (ARI) scores (higher is better). The NMI and ARI scores are ways to report the extent of cluster preservation in the reduced data, by taking the clusters in the original data as ground truth. K-means clustering retrieves a predefined number of clusters (k) based on the Euclidean distance towards a cluster centre, and tries to minimize the sum of distances over these k clusters. In practice, this can result in clusters that are of equal size and density, but are unintuitive for interpretation. HDBSCAN is assigning clusters based on the density of the data, and is therefore more suitable to retrieve clusters with varying densities. This increases the possibility of potentially retrieving meaningful clusters. For our analysis, we used a pipeline with a z-score scaler, a dimension reduction method, and a clustering model (HDBSCAN). As a default we used the Manhattan distance, with 50 neighbours and a random selection of 100.000 samples from the haematology set for dimension reduction in this analysis.

Diurnal patterns

The second and descriptive angle was by studying diurnal patterns in the reduced dataset, as the size of the original dataset allowed us to investigate large patterns within the data. One of the broad epistemic features of, at least part of, the hematology parameters, is the presence of a diurnal pattern [28, 29]. We expected that dimension reduction algorithms preserve such broad qualitative features. We assessed the diurnal patterns in the reduced data with the use of a cosine fit, as implemented in the CosinorPy library [30]. We assessed the diurnal patterns with 100,000 random samples, and based the hour of day on the time of blood draw.

Age and sex

The third angle was to assess biological relevance by two classification tasks that should be identifiable in the data: firstly, sex prediction in samples of patients between the age of 20 to 50, as during this age-range a clear distinct difference of hemoglobin between men and women exists [31]. Secondly, prediction of samples of patients below 20 versus patients above 60 years old, as the haematological characteristics of young people are known to be distinct from older people [32]. For this purpose, we used Gradient Boosting (GB) model to capture any non-linear associations. To assess the performance of the resulting models, we decided to focus on the accuracy and the Matthews Correlation Coefficient (MCC) metric. The accuracy is the correct prediction of positive and negative cases, divided by the total amount of positives and negatives, i.e. \(\frac{TP+TN}{TP+FP+TN+FN}\), where TP, FP, TN and FN are true and false positives, and true and false negatives respectively. The MCC, or the \(\phi\) coefficient, is a measure of the quality of a binary classification model that takes into account true and false positives and negatives, i.e., is a summary measure for the confusion matrix, comparable to the F1 metric. MCC is calculated as follows: \(\frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}}\). The data were analysed using 10-fold cross validation with an inner validation set (as a result of the folds) and a dedicated outer validation set. 170,000 random samples were used for training, 30,000 random samples were used for the dedicated validation set. Sampling was performed for computational reasons, with regards to dimensionality of the original data. We assessed the significance of performance change using a T-test.

Identification of leukemia-like patients

As a final angle to assess performance of preservation of biological relevance, we investigated a specific population that is completely divergent from the general population in terms of CBC. To this end, we used samples from patients with chronic lymphocytic leukemia that were diagnosed based on CBC characteristics together with clinical experts, more specifically: based on very high lymphocyte counts. If dimension reduction would preserve biological relevance, these samples should be clearly distinguishable in the lower dimension representation. Failure to detect these patients would significantly impact the use of the dimension reduction methods in clinical practice. To detect potentially significant differences between the populations, we used an unpaired T-test, and considered a p-value below 0.001 to be significant.

Software and hardware

All analyses were performed with the Python programming language (version 3.9). Imputation was performed using the miceforest package. Dimension reduction was done using the scikit-learn package for PCA and GRP. UMAP was performed using the umap-learn package, TriMap was performed using the trimap package, PacMAP was performed using the pacmap package. Sex and age classification was performed using the xgboost package. All calculations were performed on CPU, namely the Xeon W-2125 at 4GHz and 8 logic cores with 64GB memory. The code for this project is available from https://github.com/UPOD-datascience/celldyn_embedder.