pdf2106.11343.pdf 6
pdf2106.11343.pdf 6
5
Centre for Medical Imaging, University College London, London, UK
Abstract
Purpose: To develop a semi-automated, AI-assisted workflow for segmentation of inflammatory lesions on STIR MRI
of sacroiliac joints (SIJs) in adult patients with axial spondyloarthritis.
Methods: Baseline human performance in manual segmentation of inflammatory lesions was first established in eight
patients with axial spondyloarthritis recruited within a prospective study conducted between April 2018 and July 2019.
To improve readers’ consistency a semi-automated procedure was developed, comprising (1) manual segmentation of
‘normal bone’ and ‘disease’ regions (2) automatic segmentation of lesions, i.e., voxels in the disease region with outlying
intensity with respect to the normal bone, and (3) human intervention to remove erroneously segmented areas. Segmen-
tation of disease region (subchondral bone) was automated via supervised deep learning; 200 image slices (eight subjects)
were used for algorithm training with cross validation, 48 (two subjects) – for testing and 500 (20 subjects) – for evalua-
tion based on visual assessment. The data, code, and model are available at https://github.com/c-hepburn/Bone_MRI.
Human and model performance were assessed in terms of Dice coefficient.
Results: Intra-reader median Dice coefficients, evaluated from comparison of manual segmentation trials of inflamma-
tory lesions, were 0.63 and 0.69 for the two readers, respectively. Inter-reader median Dice was in the range of 0.53 to
0.56 and increased to 0.84 using the semi-automated approach. Deep learning model ensemble showed average Dice of
0.94 in subchondral bone segmentation.
Conclusions: We describe a semi-automated, AI-assisted workflow which improves the objectivity and consistency of
radiological segmentation of inflammatory load in SIJs.
1
impractical in clinical practice. The drawback of manual to consider supervised deep learning to automate segmenta-
segmentation of specific areas of bone [10] was addressed in tion [12]. Given the challenges in inflammation assessment,
a recent study, which also aimed to demonstrate computa- the goal of this study was to investigate the reliability of
tional feasibility of a fully automated segmentation of BME segmentation as means to provide a robust reference stan-
[11]. Ultimately, a tool that can automatically detect and dard for algorithm training. As a result, a semi-automated,
quantify inflammation in a reproducible, fast manner is de- AI-assisted workflow to improve observer’s precision and re-
sirable. With advances in artificial intelligence, it is natural duce workload is proposed.
2
30 subjects with confirmed axSpA, qualitative MRI (per subject)
• 30 pre-biological treatment scans
• 30 post-biological treatment scans
Scans to standardize procedures Based on visual Scans for studies
Manual segmentation guidelines (5 different subjects) assessment by reader 1 Manual segmentation of inflammatory lesions on STIR MRI (8 different subjects)
• 2 pre-biological treatment scans • 4 pre-biological treatment scans
• 3 post-biological treatment scans • 4 post-biological treatment scans
Final model training: 248 image slices evaluation (500 image slices; remaining 20 subjects (40 scans))
2.3 Supervised Deep Learning Segmenta- epoch. To reduce individual models’ errors, the optimal
tion of Disease Region on T1W MRI model was re-trained three times and performance of the
ensemble was evaluated on the test data. Finally, the model
248 T1W image slices were used for model training and test- was re-trained three times using all the data and evaluated
ing. The reference standard for algorithm training is de- further on the remaining 40 scans from 20 subjects. A pub-
scribed in the SupMat B. Data was partitioned at subject licly available implementation of the U-Net in Pytorch was
level at random into two sets: 200 image slices for training used (https://github.com/jvanvugt/pytorch-unet).
with four-fold cross validation [14] and 48 image slices for
testing. Further data partition into subsets for training and
folds for validation was performed at subject level to avoid 2.5 Evaluation Metrics
correlation between image slices, used to train and validate
a model. The final model averaging ensemble was further The similarity of a pair of binary segmentations was evalu-
evaluated, based on visual assessment, on the remaining 500 ated with the Dice coefficient, defined for the class of interest
image slices (20 subjects) (Figure 1). (abnormal or background) as the ratio of number of pixels
(voxels) having identical location in both segmentations to
the average of number of pixels (voxels) in each segmenta-
2.4 Model Training tion [20]:
A convolutional neural network with two-dimensional U-Net
architecture [15] (SupMat E) was trained on mini batches by 2|S1 \ S2 |
Dice = (1)
optimizing binary cross entropy loss [16] using the Adam op- |S1 | + |S2 |
timizer [17]. The architecture included batch normalization
to keep the distribution of convolution layers outputs’ fixed, Si,i2{1,2} represents a point set, containing pixel (voxel) co-
allowing faster convergence [18]. The network was trained ordinates, subscript refers to the first or second segmenta-
with pre-processed data augmented on the fly, which allowed tion and S1 \ S2 is the intersection of the sets. We refer to
a substantial increase of the diversity in the training sam- the Dice coefficient as the area or volume overlap, depending
ples (SupMat F). To keep the number of network parameters on whether two segmented areas or volumes were compared.
updates independent of the data set size, 350 augmentation To evaluate model performance during training, a di↵eren-
steps were performed at each training epoch [19]. At each tiable generalization of Dice was used, implemented as [21]:
step, a random batch was selected from the available pool, PN
augmented, and fed into the model. Data shu✏ing ensured 2 i ri pi
Dice = PN PN (2)
that the same batch contained di↵erent image slices every 2
i ri + i p2i
3
where summation runs over pixels of reference standard, of the compared volumes [20]:
ri 2 R and network probability map, pi 2 P . The simi-
larity between two binary segmentation volumes is defined ||S1 | |S2 ||
in terms of absolute volume di↵erence divided by the sum VS =1 (3)
|S1 | + |S2 |
Figure 2: 3D rendering of four manual segmentation trials from both readers for 3 subjects (top to bottom).
4
Figure 3: STIR MRI slices in oblique coronal plane for the same subjects as in the figure 2 (top to bottom). Sum of
manual segmentation trials (second column) indicates pixels that were labelled as abnormal once (dark blue), twice (light
blue), trice (yellow) and four times (red). Corrected automatic segmentations by the two readers (column 3 and 4) with
pixels, segmented at lower (yellow) and both (red) thresholds.
5
Figure 4: Volume overlap evaluated from comparison of di↵erent manual segmentation trials of the same reader for each
reader (a), di↵erent readers (b) and corrected automatic segmentations (thresholding at lower intensity limit), (c). Rij
stands for reader with the first subscript corresponding to the reader and second - to manual segmentation trial, when
applicable. Scatter plot of overlap vs volume similarity from comparison of some trials from both readers (d). Vertical
line corresponds to similarity value when one volume, V is twice larger than the other, V ⇤ .
6
3.2 Semi-automated Segmentation Sensi- pre- and post-treatment scans respectively, indicating a re-
tivity to Change in Inflammatory Load duction in inflammation. Regions of acute inflammation
showed a reduction in extent and intensity after treatment,
Figure 5 shows automatic segmentations corrected by the although there is a persistent, slight hyperintensity com-
second reader on pre- and post-biologic treatment scans pared to the normal interforaminal bone, which manifests
for one subject. Possible inflammatory volume ranges are as an increase in the proportion of inflammation captured
(106.3, 115.3) ⇥103 mm3 and (31.3, 46.3) ⇥103 mm3 for by the lower of the two thresholds.
Figure 5: STIR MRI slices in oblique coronal plane for the same subject (left), pre-biological treatment (top) and
post-biological treatment MRI scan (bottom). Super-imposed corrected by the second reader automatic segmentations
(middle), as well as 3D rendering (right) qualitatively show reduction in inflammation.
7
3.3 Automatic Segmentation of Disease area overlap was chosen. Model averaging ensemble yielded
Region 0.94 average area overlap on the test data, ranging from
0.85 to 0.98 (Figure 6c). Examples of automatically seg-
Optimal hyper-parameters were searched with cross valida- mented disease region and corresponding reference standard
tion procedure by varying network depth, number of kernels are shown in the SupMat K. Model averaging ensemble was
per convolution layer, kernel size and number of epochs. A further evaluated based on visual assessment by the second
batch size of four and a learning rate of 0.001 were kept reader, using the remaining 20 subjects (40 scans). Model
constant. Training was stopped when average area overlap failed or its performance was worse for three subjects with
(section 2.5), evaluated on training data, reached a plateau very abnormal bone, i.e., very high fat content or strong
(Figure 6a). The optimal model did not overfit; however, sclerosis; such features were absent in the training data. For
there were fluctuations in its performance on the fourth val- the rest of the scans, model performance was either perfect
idation fold (Figure 6b). This may be attributed to the fact or subject to minor corrections (SupMat K). Consequently,
that one scan was acquired with di↵erent angulation with three subjects with very abnormal bone were added to the
respect to the other scans. The optimal network was re- training data and model was retrained three times for public
trained three times using all training data (200 T1W image access.
slices). From each run a model yielding the highest average
Figure 6: Mean area overlap vs training epoch for di↵erent training data subsets (a) and validation folds (b). Each point
represents a score, averaged over classes (foreground & background), samples in a mini batch and 350 augmentation
steps. Area overlap from pair-wise comparison of reference standard and rounded prediction on the test data from model
averaging ensemble (three runs using all training data) (c).
8
to that of blood vessels or spinal fluid)” [7]; “At present it on reference standard provided by human observers is not
is not possible to give a more precise definition of the min- desirable. Therefore, a major advantage of the proposed ap-
imum size (area) of the BMO which is necessary for it to proach is that intensity-based judgements are removed from
be described as ‘positive’ ”[2]. The reader’s image interpre- the observer. The use of an intensity-based thresholds de-
tation relies on a subjective perception of brightness, which rived from normal marrow means that the choice of voxels is
is at the origin of variability either in manual segmentation primarily influenced by the physical properties of the tissue
(section 3.1) or scoring [24]. The idea of determining an (with increased signal in areas of inflammation) and allows
optimal threshold to quantify BME has been addressed in for a possible range of inflammatory load. In conclusion,
several studies [10], [25]. However, such analysis relied on we propose a workflow that could be turned into a tool to
manual segmentation and as our data suggest, using man- improve the objectivity and consistency of radiological as-
ual segmentation as a “gold standard” does not necessarily sessment of inflammation in the sacroiliac joints and in a
yield correct answers especially in cases when inflammation long-term goal, prepare a large high-quality dataset needed
is subtle or precise lesion boundary cannot be identified due for deep learning algorithm training to fully automate seg-
to intensity gradient. Furthermore, a recent study [11] re- mentation procedure with the potential to substantially im-
established the threshold value developed in earlier work prove decisions around diagnosis, disease monitoring and
[10], finding an optimal threshold of 1 compared to 1.5 in the treatment.
prior study. It seems logical that a threshold which depends
9
[18] Io↵e S, Szegedy C. Batch normalization: Accelerat- namic gadolinium-enhanced MR imaging. Radiology
ing deep network training by reducing internal covari- 1995;194(2):529–36
ate shift. 32nd Int Conf Mach Learn ICML 2015.
2015;1:448–56 [23] Bollow M, Fischer T, Reißhauer H, Backhaus M,
Sieper J, Hamm B, et al. Quantitative analyses of
[19] Billot B, Bocchetta M, Todd E, Dalca A V., Rohrer JD, sacroiliac biopsies in spondyloarthropathies: T cells and
Iglesias JE. Automated segmentation of the hypothala- macrophages predominate in early and active sacroili-
mus and associated subunits in brain MRI. Neuroimage. itis - Cellularity correlates with the degree of enhance-
2020;223 ment detected by magnetic resonance imaging. Ann
Rheum Dis. 2000;59(2):135–40
[20] Taha AA, Hanbury A. Metrics for evaluating 3D med-
ical image segmentation: Analysis, selection, and tool. [24] Landewé RBM, Hermann KGA, Van Der Heijde
BMC Med Imaging. 2015;15(1) DMFM, Baraliakos X, Jurik AG, Lambert RG, et
al. Scoring sacroiliac joints by magnetic resonance
[21] Milletari F, Navab N, Ahmadi SA. V-Net: Fully con- imaging. A multiple-reader reliability experiment. J
volutional neural networks for volumetric medical im- Rheumatol. 2005;32(10):2050–5
age segmentation. Proc - 2016 4th Int Conf 3D Vi-
sion;565–71 [25] Chronaiou I, Thomsen RS, Huuse EM, Euceda LR,
Pedersen SJ, Ho↵ M, et al. Quantifying bone mar-
[22] Bollow M, Braun J, Hamm B, Eggens U, Schilling row inflammatory edema in the spine and sacroiliac
A, König H, et al. Early sacroiliitis in pa- joints with thresholding. BMC Musculoskelet Disord.
tients with spondyloarthropathy: Evaluation with dy- 2017;18(1):1–8
10
Supplemental Material
A Manual Segmentation of Inflammatory Lesions on STIR MRI
Manual delineation and segmentation were performed by a consultant radiologist and a musculoskeletal radiology fellow
with over 25 and 5 years of musculoskeletal MRI experience, respectively. 13 subjects with pre- and post-biological
treatment scans were selected from the available pool of MRI by the consultant radiologist based on visual assessment to
represent a broad range of inflammatory lesions and varying lesion loads. To facilitate and standardize the segmentation,
five scans were used to establish guidelines for lesion identification, with suggestions for each of the potential difficulties
in interpretation [7] (Figure A.7). Eight scans were used for the study, for which each reader provided two independent
segmentation trials. A third non-radiologist reader checked all segmentations for a presence of empty or residual voxels
and made necessary adjustments. Readers focused on potential lesions in the subchondral bone marrow abutting the
sacroiliac joints, the joint space, and the adjacent lumbar vertebra (L5, mainly seen in the endplates and facet joints).
Lesions located above the L4/5 disc were not segmented as this bone was not consistently covered by FOV. We used ITK
SNAP software (version 3.8.0) for all segmentations ([13]).
Figure A.7: Summary of difficulties, related to manual segmentation, and suggestions to assist diagnostic decision making.
11
fibrocartilaginous (anterior) portion of the SIJs, excluded the central sacral bone and foramina, spinal fluid, intervertebral
discs, and partially the ligamentous part of the SIJs, i.e., irrelevant regions for inflammation assessment (Figure B.9). The
use of the T1W sequence facilitated morphological assessment of the images, especially the facet joints (the anatomy of
these regions was more difficult to evaluate on STIR images). In cases when identification of the border between vertebra
and discs were difficult, STIR images were consulted and segmentation was adjusted accordingly. T1W and STIR images
were acquired in the same plane with the same FOV, enabling usage of disease region mask for consequent segmentation
via thresholding on STIR image.
Figure B.8: STIR MRI slice in oblique coronal plane (left) with super-imposed manual segmentation of normal bone
region (middle). 3D rendering reveals the segmented structure (right).
Figure B.9: T1W MRI slices in oblique coronal plane for one subject (top) with corresponding super-imposed manual
segmentations of disease region (bottom). 3D rendering reveals the segmented structure (right).
12
C Estimation of two Thresholds from Normal Bone Intensity Distribution
Intensities corresponding to outliers and rare observations of the normal bone intensity distribution are attributed either
to noise or increased water content due to natural variation in tissue composition. Therefore, voxels at or above these
intensity levels are more likely to be abnormal. Distributions determined from di↵erent scans were skewed to varying
degrees (Figure C.10a). To standardize the procedure, an upper limit, Lupper was defined as the maximum intensity, Imax
of the distribution. A lower limit, Llower was computed as the sum of the upper quartile, QU and a multiple, n of the
inter-quartile range, IQR of the distribution: Llower = QU + n · IQR. The multiple n was determined automatically as
the empirical lower limit value of QU + 1.5 · IQR resulted in obvious over-segmentation in certain cases (Figure C.11,
Figure C.10b). Starting with the value of 1.5, the multiple n was incremented by 0.05 until the di↵erence between upper
and lower limits was less than the half of the interquartile range, 0 < Lupper Llower < IQR/2. If the condition was
initially satisfied, no incrementation was performed.
Figure C.10: Boxplot of normal bone STIR intensity for ten subjects, showing di↵erent degree of skewness (a). Histogram
of normal bone intensity for one subject (b). Vertical lines represent di↵erent threshold values; resulting segmentations
are presented in the Figure C.11.
Figure C.11: STIR MRI slice in oblique coronal plane for one subject (left) with super-imposed automatic segmentation
via three di↵erent thresholds: QU + 1.5 · IQR (blue), QU + 2.55 · IQR (orange) and maximum intensity (red) of the normal
bone intensity distribution (Figure C.10), with QU upper quartile and IQR inter-quartile range.
13
D Manual Correction of Automatic Segmentation of Inflammatory Lesions
Based on Morphology and Anatomical Location
Regions which were deemed non-inflammatory - for example due to the presence of vessels or artifact - were removed by
readers based on morphology and anatomical location. Lesions were either left in place or removed altogether, i.e., the
boundaries of lesions were not modified, except when the posterior part of the joint or foramen were segmented along with
a potential lesion. Lesions located above the L4/5 disc were removed in consistency with manual segmentations. T1W
images were used to assist readers in identification of anatomical structures and regions of increased fat content. The
two readers discussed and agreed upon the procedure using automatically segmented images from two subjects (Figure 1,
main text).
E U-Net Architecture
Network takes as input a mini batch, containing four di↵erent T1W image slices, and outputs a batch of corresponding
probability maps. Figure E.12 details the operations and illustrates how the size of an individual image slice or a feature
map changes, as the slice / map propagates through the network.
1 16 16 32 16 16 1
concatenation
3362
Convolution 1x1, stride 1 + sigmoid activation function
Input batch size 16 32 32 32 32 Output batch size
[4,1,336,336] [4,1,336,336]
1682
1682
32 64 64 128 64 64
842
842
64 128 128
422
Figure E.12: Schematic representation of the optimal U-Net architecture. Number of channels and matrix size are,
respectively, on the top and in the left corner of a grey shaded box, representing an individual input image slice or a
feature map.
14
F Data Pre-processing and Augmentation
Images were normalized by three standard deviation of image intensity distribution, allowing the same intensity scale
between subjects and consistency in intensity levels of voxels representing the same tissue across the whole image volume
for each subject (Figure F.13). Each pre-processed image and corresponding mask slices undergone elastic deformation
(https://github.com/gvtulder/elasticdeform/tree/v0.4.9), affine transformation (rotation, scaling, shearing) and
random flip with 0.5 probability, i.e., the order of image array elements along the left-right axis was reversed (Figure F.14).
To make the network robust against between-subject variations in the intensity level of bone voxels, intensities were raised
to a random power, altering contrast and brightness. All transformation parameters (rotation angle, scaling and shearing
factors, power) were randomly sampled from uniform distribution of pre-defined ranges.
Figure F.13: T1W image slice in oblique coronal plane for two subjects: raw (a) and normalized data (b) with corre-
sponding (slice) intensity histograms.
15
Figure F.14: Original (pre-processed) T1W image slice in oblique coronal plane with corresponding mask (a). Transformed
image and mask slices (b). Corresponding (slice) intensity histograms indicate change in intensity levels population after
transformation (c).
16
G Examples of Automatic Segmentation of Inflammatory Lesions with no
Manual Correction
Figure G.15: STIR image slices in oblique coronal plane without/with super-imposed automatic segmentation for six
subjects. Pixels, segmented at lower and both thresholds are marked in yellow and red, respectively.
17
H Examples of Manual Segmentation Trials with Disagreement on Lesion
Location
Figure H.16 illustrates specific examples of manual segmentation trials, for which readers agree more on load than location,
evaluated in terms of volume overlap and volume similarity.
Figure H.16: STIR image slices in oblique coronal plane with super-imposed manual segmentation trials from both
readers for two subjects (left). 3D rendering reveals segmented structures (right). Comparison of these trials yielded
volume overlap and similarity values: 0.59 and 0.97 (top), 0.53 and 0.89 (bottom).
18
I Case with Lower Agreement
Figure I.17: STIR image slices in oblique coronal plane for one subject with super-imposed automatic segmentation,
corrected by the two readers. The main source of disagreement is on the potential inflammation in the joint space of SIJs.
J Annotation Time
Following table summarizes time in minutes, taken to manually delineate or segment inflammatory lesions on STIR MRI,
time to remove residues on automatic segmentations and consequently, to score BME, using SPARCC approach.
19
K Examples of Automatic Segmentation of Disease Region
Figure K.18 shows examples of automatically segmented disease region and corresponding reference standard. Figure K.19
and Figure K.20 show automatic segmentation of disease region from seven subjects, for which the reference standard
was not available. In cases with very abnormal bone (Figure K.19) model missed some regions. Figure K.20 shows the
examples of segmentation nearly at the level of human performance.
Figure K.18: T1W MRI slices in oblique coronal plane for two ‘test’ subjects (top to bottom). Reference standard (middle)
and model averaging ensemble (rounded) prediction of disease region (right).
20
Figure K.19: T1W image slices in oblique coronal plane (top) for three subjects with super-imposed model averaging
ensemble rounded prediction (bottom). Subjects exhibit very abnormal bone: high fat content (hyperintense signal; left,
middle) and strong sclerosis (hypointense signal; right) were partially missed by the model ensemble.
Figure K.20: T1W image slices in oblique coronal plane (top) for four subjects with super-imposed model averaging
ensemble rounded prediction (bottom).
21