[go: up one dir, main page]

0% found this document useful (0 votes)
63 views21 pages

pdf2106.11343.pdf 6

Uploaded by

tungolcild
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views21 pages

pdf2106.11343.pdf 6

Uploaded by

tungolcild
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Towards Deep Learning-assisted Quantification of Inflammation in

Spondyloarthritis: Intensity-based Lesion Segmentation


Carolyna Hepburn1 , Hui Zhang1 , Juan Eugenio Iglesias1,2,3 , Alexis Jones4 , Alan Bainbridge5 , Timothy
J.P. Bray∗5 , and Margaret A. Hall-Craggs*5
1
Centre for Medical Image Computing, Department of Computer Science, University College London, UK
2
Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Boston, USA
3
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Boston, USA
4
Department of Rheumatology, University College London Hospitals NHS Foundation Trust, London, UK
arXiv:2106.11343v1 [eess.IV] 21 Jun 2021

5
Centre for Medical Imaging, University College London, London, UK

Abstract
Purpose: To develop a semi-automated, AI-assisted workflow for segmentation of inflammatory lesions on STIR MRI
of sacroiliac joints (SIJs) in adult patients with axial spondyloarthritis.
Methods: Baseline human performance in manual segmentation of inflammatory lesions was first established in eight
patients with axial spondyloarthritis recruited within a prospective study conducted between April 2018 and July 2019.
To improve readers’ consistency a semi-automated procedure was developed, comprising (1) manual segmentation of
‘normal bone’ and ‘disease’ regions (2) automatic segmentation of lesions, i.e., voxels in the disease region with outlying
intensity with respect to the normal bone, and (3) human intervention to remove erroneously segmented areas. Segmen-
tation of disease region (subchondral bone) was automated via supervised deep learning; 200 image slices (eight subjects)
were used for algorithm training with cross validation, 48 (two subjects) – for testing and 500 (20 subjects) – for evalua-
tion based on visual assessment. The data, code, and model are available at https://github.com/c-hepburn/Bone_MRI.
Human and model performance were assessed in terms of Dice coefficient.
Results: Intra-reader median Dice coefficients, evaluated from comparison of manual segmentation trials of inflamma-
tory lesions, were 0.63 and 0.69 for the two readers, respectively. Inter-reader median Dice was in the range of 0.53 to
0.56 and increased to 0.84 using the semi-automated approach. Deep learning model ensemble showed average Dice of
0.94 in subchondral bone segmentation.
Conclusions: We describe a semi-automated, AI-assisted workflow which improves the objectivity and consistency of
radiological segmentation of inflammatory load in SIJs.

1 Introduction observer to make binary decisions and are laborious. Fur-


thermore, visual assessment is a predominantly intensity-
based approach: ”the stronger the hyperintense signal the
Axial spondyloarthritis (axSpA) is a rheumatic disease that
more likely it reflects active inflammation” [7]. Absence
primarily a↵ects axial joints of the spine and causes pain,
of the intensity threshold between normal and abnormal
sti↵ness, and disability [1]. A hallmark of early disease is
bone makes it difficult to define subtle lesions or bound-
inflammation in the subchondral bone marrow of sacroil-
aries. This impacts on whether lesions meet the minimal
iac joints, referred to as bone marrow edema (BME) [2].
areas to be determined as abnormal [2] and causes variabil-
In current clinical practice inflammation is typically de-
ity in STIR MRI interpretation. One approach to address-
tected with short tau inversion recovery (STIR) magnetic
ing the quantification problem is to segment inflammatory
resonance imaging (MRI) [3]. Accurate and precise identi-
lesions and thus estimate inflammatory load. The use of
fication and quantification of inflammation is necessary for
thresholding as a segmentation method can assist the ob-
therapeutic decision making to facilitate e↵ective treatment,
server with intensity-based judgments, reducing subjectivity
control pain and prevent disability. At present, there are
in image interpretation [8]. A further benefit of segmenta-
several methods for assessing inflammation on STIR MRI
tion is that lesion location is specified. Recently proposed
and these are based on visual assessment. In clinical prac-
semi-automated methods for detecting and quantifying only
tice, the extent of inflammation is described verbally with no
BME also addressed these issues [9], [10]. However, these
numerical metric to enable quantification. Clinical research
approaches require the user either to identify lesion location
employs semi-quantitative scoring methods of BME [4], [5],
- with the potential that some lesions might be missed - or to
[6] however there are limitations to this approach. These
manually segment bone, which is time consuming and thus
methods evaluate only a limited number of slices, force the
∗ Joint senior authorship

1
impractical in clinical practice. The drawback of manual to consider supervised deep learning to automate segmenta-
segmentation of specific areas of bone [10] was addressed in tion [12]. Given the challenges in inflammation assessment,
a recent study, which also aimed to demonstrate computa- the goal of this study was to investigate the reliability of
tional feasibility of a fully automated segmentation of BME segmentation as means to provide a robust reference stan-
[11]. Ultimately, a tool that can automatically detect and dard for algorithm training. As a result, a semi-automated,
quantify inflammation in a reproducible, fast manner is de- AI-assisted workflow to improve observer’s precision and re-
sirable. With advances in artificial intelligence, it is natural duce workload is proposed.

2 Materials and Methods uated in a quantitative manner (section 3.1). To mitigate


variability a procedure was developed, comprising the fol-
2.1 Data lowing steps (see Figure 1 for study design overview):
Data were taken from a completed prospective longitudinal - Manual segmentation of interforaminal sacral bone,
study (16 females, 14 males; mean age 42.7 years). The which is typically spared from inflammation [5], re-
study was conducted at University College London Hospital ferred to as normal bone (SupMat B)
between April 2018 and July 2019 with the aim of evalu-
ating responsiveness and response prediction using quanti-
- Estimation of two thresholds from the normal bone
tative imaging biomarkers and performed with institutional
intensity distribution, allowing for a possible range of
review board approval (REC reference 15/LO/1475). Pa-
inflammatory loads and less abrupt lesion boundaries
tients diagnosed with axSpA according to 2009 ASAS crite-
(SupMat C)
ria [7] underwent pre- and post-biologic therapy MRI scans
of SIJs (n=30). Patients were excluded if they had a con-
traindication to MRI scanning. The protocol consisted of - Manual segmentation of the subcortical bone (iliac
STIR and T1-weighted (T1W) turbo spin echo sequences, and sacral) of the fibrocartilaginous (anterior) portion
acquired in an oblique coronal plane with the same FOV of the SIJs (SupMat B), the most inflamed bone in
on a 3T Philips Ingenia scanner (Philips Healthcare, Best, axSpA, referred to as the disease region. Segmenta-
Netherlands). MRI parameters are detailed below. tion was initially performed for three subjects on T1W
images, which were used to pre-train a deep learning
- STIR: repetition time (TR) 5316ms, echo time (TE) algorithm to facilitate further segmentation. The al-
60ms, inversion time 210ms, echo train length (ETL) gorithm was subsequently re-trained once all segmen-
21, flip angle 90deg, slice thickness 3mm, spacing be- tations had been corrected (section 2.3, section 3.3)
tween slices 3.3mm, pixel spacing 0.59 ⇥ 0.59 (mm),
image matrix 336 ⇥ 336, number of slices 23-25 - Thresholding within the disease region: voxels in the
mask were assigned labels of 0, 1 and 2 if voxel inten-
- T1-weighted TSE: TR 1012ms, TE 8ms, ETL 3, flip sity at the corresponding location in the STIR image
angle 90deg, slice thickness 3mm, spacing between was below, equal, or above lower and upper intensity
slices 3.3mm, pixel spacing 0.59 ⇥ 0.59 (mm), image limit.
matrix 336 ⇥ 336, number of slices 23-25
- Automatic removal of erroneously segmented regions,
2.2 Manual and Semi-automated Segmen- containing < 4 pixels (= 1.39mm2 ), on each slice, due
to noise and small vessels
tation of Inflammatory Lesions on
STIR MRI
- Manual correction of the final segmentation by a hu-
Human performance in manual segmentation of inflamma- man observer based on morphology and anatomical
tory lesions from two readers (SupMat A, [13]) was eval- location (SupMat D)

2
30 subjects with confirmed axSpA, qualitative MRI (per subject)
• 30 pre-biological treatment scans
• 30 post-biological treatment scans
Scans to standardize procedures Based on visual Scans for studies
Manual segmentation guidelines (5 different subjects) assessment by reader 1 Manual segmentation of inflammatory lesions on STIR MRI (8 different subjects)
• 2 pre-biological treatment scans • 4 pre-biological treatment scans
• 3 post-biological treatment scans • 4 post-biological treatment scans

Procedure to remove residues on automatic segmentations of Reader 1 Reader 2


inflammatory lesions (2 subjects: pre-biological treatment scans) Two independent trials (𝑅11 , 𝑅12 ) Two independent trials (𝑅21 , 𝑅22 )

Intra-Reader Variability Study Inter-Reader Variability Study


• 𝑅11 𝑣𝑠 𝑅12 • 𝑅11 𝑣𝑠 𝑅21
Scans for deep learning algorithm training and testing • 𝑅21 𝑣𝑠 𝑅22 • 𝑅11 𝑣𝑠 𝑅22
• 𝑅12 𝑣𝑠 𝑅21
Automating segmentation of disease region on T1W MRI (10 different subjects)
• 𝑅12 𝑣𝑠 𝑅22
*
* Data split at
training: 200 image slices test
subject level
subset 1
*
fold 1 Automatic segmentation of inflammatory lesions on STIR MRI
via two thresholds within disease region
Hyper-parameters fold 2
search fold 3 Reader 1 Reader 2
Removal of residues (𝑅1 ) Removal of residues (𝑅2 )
fold 4
subset 4 Inter-Reader Variability Study
• 𝑅1 𝑣𝑠 𝑅2
Optimal model training testing (48 image slices)

Final model training: 248 image slices evaluation (500 image slices; remaining 20 subjects (40 scans))

Figure 1: Dataflow and use in study.

2.3 Supervised Deep Learning Segmenta- epoch. To reduce individual models’ errors, the optimal
tion of Disease Region on T1W MRI model was re-trained three times and performance of the
ensemble was evaluated on the test data. Finally, the model
248 T1W image slices were used for model training and test- was re-trained three times using all the data and evaluated
ing. The reference standard for algorithm training is de- further on the remaining 40 scans from 20 subjects. A pub-
scribed in the SupMat B. Data was partitioned at subject licly available implementation of the U-Net in Pytorch was
level at random into two sets: 200 image slices for training used (https://github.com/jvanvugt/pytorch-unet).
with four-fold cross validation [14] and 48 image slices for
testing. Further data partition into subsets for training and
folds for validation was performed at subject level to avoid 2.5 Evaluation Metrics
correlation between image slices, used to train and validate
a model. The final model averaging ensemble was further The similarity of a pair of binary segmentations was evalu-
evaluated, based on visual assessment, on the remaining 500 ated with the Dice coefficient, defined for the class of interest
image slices (20 subjects) (Figure 1). (abnormal or background) as the ratio of number of pixels
(voxels) having identical location in both segmentations to
the average of number of pixels (voxels) in each segmenta-
2.4 Model Training tion [20]:
A convolutional neural network with two-dimensional U-Net
architecture [15] (SupMat E) was trained on mini batches by 2|S1 \ S2 |
Dice = (1)
optimizing binary cross entropy loss [16] using the Adam op- |S1 | + |S2 |
timizer [17]. The architecture included batch normalization
to keep the distribution of convolution layers outputs’ fixed, Si,i2{1,2} represents a point set, containing pixel (voxel) co-
allowing faster convergence [18]. The network was trained ordinates, subscript refers to the first or second segmenta-
with pre-processed data augmented on the fly, which allowed tion and S1 \ S2 is the intersection of the sets. We refer to
a substantial increase of the diversity in the training sam- the Dice coefficient as the area or volume overlap, depending
ples (SupMat F). To keep the number of network parameters on whether two segmented areas or volumes were compared.
updates independent of the data set size, 350 augmentation To evaluate model performance during training, a di↵eren-
steps were performed at each training epoch [19]. At each tiable generalization of Dice was used, implemented as [21]:
step, a random batch was selected from the available pool, PN
augmented, and fed into the model. Data shu✏ing ensured 2 i ri pi
Dice = PN PN (2)
that the same batch contained di↵erent image slices every 2
i ri + i p2i

3
where summation runs over pixels of reference standard, of the compared volumes [20]:
ri 2 R and network probability map, pi 2 P . The simi-
larity between two binary segmentation volumes is defined ||S1 | |S2 ||
in terms of absolute volume di↵erence divided by the sum VS =1 (3)
|S1 | + |S2 |

3 Results 1 and 2 and in the range 0.53-0.56, respectively. There were


occasions when the volumes from manual trials were dis-
3.1 Manual vs Semi-automated Seg- similar, a↵ecting Dice score (Figure 4d) and very similar,
mentation of Inflammatory Lesions: however, readers disagreed on lesion location (SupMat H).
Intra/Inter-Reader Variability Metric values depended on the trials compared. For the
semi-automated pipeline, the between-reader median vol-
Figure 2 shows the 3D rendering of some manual segmen- ume overlap was 0.84, representing an increase of 28-31%
tation trials of inflammatory lesions from both readers. It compared to pure manual segmentation (Figure 4c). There
demonstrates qualitatively di↵erences in lesion location, de- was one outlier where the agreement was reduced; review of
gree of fragmentation and load between trials. Figure 3 the images indicated that the disagreement mostly related to
shows the sum of manual trials, indicating the difficulties the presence of inflammation in the joint space, where blood
in image interpretation, namely, subtle inflammation and vessels can be misinterpreted (SupMat I). There was a time
unclear lesion boundaries, and its’ semi-automated coun- gain for the semi-automated pipeline of 13.34min on aver-
terpart. Examples of automatic segmentation via thresh- age for reader 1 and 3.12min on average for reader 2 com-
olding with no correction by readers are presented in the pared to pure manual delineation. Visual scoring of BME
SupMat G. Readers’ self-consistency and agreement, evalu- (SPARCC method [4]), using automatic segmentations with
ated in terms of volume overlap, are shown in Figure 4a, b. removed residues on pre-defined slices took 3 min on average
Overlap values ranged from 0.28 to 0.87. Intra- and inter- for reader 1 (SupMat J).
reader median overlap values were 0.63 and 0.69 for reader

Figure 2: 3D rendering of four manual segmentation trials from both readers for 3 subjects (top to bottom).

4
Figure 3: STIR MRI slices in oblique coronal plane for the same subjects as in the figure 2 (top to bottom). Sum of
manual segmentation trials (second column) indicates pixels that were labelled as abnormal once (dark blue), twice (light
blue), trice (yellow) and four times (red). Corrected automatic segmentations by the two readers (column 3 and 4) with
pixels, segmented at lower (yellow) and both (red) thresholds.

5
Figure 4: Volume overlap evaluated from comparison of di↵erent manual segmentation trials of the same reader for each
reader (a), di↵erent readers (b) and corrected automatic segmentations (thresholding at lower intensity limit), (c). Rij
stands for reader with the first subscript corresponding to the reader and second - to manual segmentation trial, when
applicable. Scatter plot of overlap vs volume similarity from comparison of some trials from both readers (d). Vertical
line corresponds to similarity value when one volume, V is twice larger than the other, V ⇤ .

6
3.2 Semi-automated Segmentation Sensi- pre- and post-treatment scans respectively, indicating a re-
tivity to Change in Inflammatory Load duction in inflammation. Regions of acute inflammation
showed a reduction in extent and intensity after treatment,
Figure 5 shows automatic segmentations corrected by the although there is a persistent, slight hyperintensity com-
second reader on pre- and post-biologic treatment scans pared to the normal interforaminal bone, which manifests
for one subject. Possible inflammatory volume ranges are as an increase in the proportion of inflammation captured
(106.3, 115.3) ⇥103 mm3 and (31.3, 46.3) ⇥103 mm3 for by the lower of the two thresholds.

Figure 5: STIR MRI slices in oblique coronal plane for the same subject (left), pre-biological treatment (top) and
post-biological treatment MRI scan (bottom). Super-imposed corrected by the second reader automatic segmentations
(middle), as well as 3D rendering (right) qualitatively show reduction in inflammation.

7
3.3 Automatic Segmentation of Disease area overlap was chosen. Model averaging ensemble yielded
Region 0.94 average area overlap on the test data, ranging from
0.85 to 0.98 (Figure 6c). Examples of automatically seg-
Optimal hyper-parameters were searched with cross valida- mented disease region and corresponding reference standard
tion procedure by varying network depth, number of kernels are shown in the SupMat K. Model averaging ensemble was
per convolution layer, kernel size and number of epochs. A further evaluated based on visual assessment by the second
batch size of four and a learning rate of 0.001 were kept reader, using the remaining 20 subjects (40 scans). Model
constant. Training was stopped when average area overlap failed or its performance was worse for three subjects with
(section 2.5), evaluated on training data, reached a plateau very abnormal bone, i.e., very high fat content or strong
(Figure 6a). The optimal model did not overfit; however, sclerosis; such features were absent in the training data. For
there were fluctuations in its performance on the fourth val- the rest of the scans, model performance was either perfect
idation fold (Figure 6b). This may be attributed to the fact or subject to minor corrections (SupMat K). Consequently,
that one scan was acquired with di↵erent angulation with three subjects with very abnormal bone were added to the
respect to the other scans. The optimal network was re- training data and model was retrained three times for public
trained three times using all training data (200 T1W image access.
slices). From each run a model yielding the highest average

Figure 6: Mean area overlap vs training epoch for di↵erent training data subsets (a) and validation folds (b). Each point
represents a score, averaged over classes (foreground & background), samples in a mini batch and 350 augmentation
steps. Area overlap from pair-wise comparison of reference standard and rounded prediction on the test data from model
averaging ensemble (three runs using all training data) (c).

4 Discussion the lower threshold. In practice, the observer could adjust


the lower intensity limit. This, however, would introduce
We first investigated the reliability of segmentation of in- further variability, that needs to be quantified. Secondly,
flammatory lesions on qualitative STIR MRI of adult pa- although the approach mitigates the variability, it does
tients with axSpA to provide a robust reference standard not eliminate it. Therefore, preparation of a high-quality
for deep learning algorithm training. Based on quantita- dataset for definitive algorithm training for use in clinical
tive evaluation of human performance in manual segmenta- practice requires radiologists’ consensus. Accurate detec-
tion we developed a semi-automated procedure to mitigate tion and quantification of inflammatory load on STIR MRI
between-reader variability with the following steps (1) seg- is difficult, in large part because there is no defined intensity
mentation of subchondral bone region, where inflammation threshold between normal and abnormal water content re-
typically occurs, and normal (sacral) bone region, spared sulting from increase in cellularity and/or expansion of the
from inflammation, (2) automatic segmentation of inflam- extracellular space [22], [23]. Furthermore, several sources
matory lesions within subchondral bone via two thresholds, influence the way we interpret the signal: small increases
determined from the intensity distribution of ‘normal bone’ in water content indistinguishable from natural variations
voxels, and (3) human intervention to remove erroneously in bone marrow composition, confounding factors such as
segmented areas. We showed that this procedure produced a sclerosis and fat metaplasia masking abnormal signal, qual-
substantial improvement in segmentation consistency com- ity of fat suppression, sensitivity of the receiving coil, noise
pared to manual segmentation. To ensure the practicality of etc. These give rise to ambiguity in definition of any type
the approach, we automated segmentation of the subchon- of inflammatory lesion visible on STIR MRI: “The stronger
dral bone region via supervised deep learning. One of the the hyperintense signal the more likely it reflects active in-
limitations of the proposed approach is the fixed value of flammation (intensity of the hyperintense signal is similar

8
to that of blood vessels or spinal fluid)” [7]; “At present it on reference standard provided by human observers is not
is not possible to give a more precise definition of the min- desirable. Therefore, a major advantage of the proposed ap-
imum size (area) of the BMO which is necessary for it to proach is that intensity-based judgements are removed from
be described as ‘positive’ ”[2]. The reader’s image interpre- the observer. The use of an intensity-based thresholds de-
tation relies on a subjective perception of brightness, which rived from normal marrow means that the choice of voxels is
is at the origin of variability either in manual segmentation primarily influenced by the physical properties of the tissue
(section 3.1) or scoring [24]. The idea of determining an (with increased signal in areas of inflammation) and allows
optimal threshold to quantify BME has been addressed in for a possible range of inflammatory load. In conclusion,
several studies [10], [25]. However, such analysis relied on we propose a workflow that could be turned into a tool to
manual segmentation and as our data suggest, using man- improve the objectivity and consistency of radiological as-
ual segmentation as a “gold standard” does not necessarily sessment of inflammation in the sacroiliac joints and in a
yield correct answers especially in cases when inflammation long-term goal, prepare a large high-quality dataset needed
is subtle or precise lesion boundary cannot be identified due for deep learning algorithm training to fully automate seg-
to intensity gradient. Furthermore, a recent study [11] re- mentation procedure with the potential to substantially im-
established the threshold value developed in earlier work prove decisions around diagnosis, disease monitoring and
[10], finding an optimal threshold of 1 compared to 1.5 in the treatment.
prior study. It seems logical that a threshold which depends

References Initial experience with a new method. Am J Roentgenol.


2004;182(6):1399–403
[1] Sieper J, Braun J, Dougados M, Baeten D. Axial
spondyloarthritis. Nat Rev Dis Prim. 2015;1(7):1-17 [9] Zarco P, Almodóvar R, Bueno Á, Molinero LM, Moreno
M, Juanola X, et al. Development and validation of
[2] Rudwaleit M, Jurik AG, Hermann KGA, Landewé R, SCAISS, a tool for semi-automated quantification of
Van Der Heijde D, Baraliakos X, et al. Defining active sacroilitis by magnetic resonance in spondyloarthritis.
sacroiliitis on magnetic resonance imaging (MRI) for Rheumatol Int. 2018;38(10):1919–26
classification of axial spondyloarthritis: A consensual
approach by the ASAS/OMERACT MRI group. Ann [10] Kucybala I, Tabor Z, Polak J, Urbanik A, Wo-
Rheum Dis. 2009;68(10):1520–7 jciechowski W. The semi-automated algorithm for
the detection of bone marrow oedema lesions in pa-
[3] Maksymowych WP. The role of imaging in the diag- tients with axial spondyloarthritis. Rheumatol Int.
nosis and management of axial spondyloarthritis. Nat 2020;40(4):625–33
Rev Rheumatol. 2019;15(11):657–72
[11] Rzecki K, Kucybala I, Gut D, Jarosz A, Nabaglo T, Ta-
[4] Maksymowych WP, Inman RD, Salonen D, Dhillon bor Z, Wojciechowski W. Fully automated algorithm for
SS, Williams M, Stone M, et al. Spondyloarthritis the detection of bone marrow oedema lesions in patients
Research Consortium of Canada magnetic resonance with axial spondyloarthritis – Feasibility study. Biocy-
imaging index for assessment of sacroiliac joint inflam- bernetics and Biomedical Engineering 2021;41(2):833-
mation in ankylosing spondylitis. Arthritis Care Res. 53
2005;53(5):703–9
[12] Lundervold AS, Lundervold A. An overview of deep
[5] Maksymowych WP, Dhillon SS, Chiowchanwisawakit learning in medical imaging focusing on MRI. Z Med
P, Pedersen SJ, Martinez B, Østergaard M et al. De- Phys. 2019;29(2):102–27
velopment and Validation of Web-based Training Mod-
[13] Yushkevich PA, Piven J, Hazlett HC, Smith RG,
ules for Systematic Evaluation of Active Inflammatory
Ho S, Gee JC, et al. User-guided 3D active con-
Lesions in the Spine and Sacroiliac Joints in Spondy-
tour segmentation of anatomical structures: Signifi-
loarthritis. J Rheumatol. 2009;84:48-57
cantly improved efficiency and reliability. Neuroimage
[6] Navallas M, Ares J, Beltrán B, Lisbona MP, Maymó 2006;31(3):1116–28
J, Solano A. Sacroiliitis associated with axial spondy-
[14] Casella G, Fienberg S, Olkin I. An Introduction to Sta-
loarthropathy: New concepts and latest trends. Radio-
tistical Learning. Springer Texts in Statistics 2013, 618
graphics. 2013;33(4):933–56
p.
[7] Sieper J, Rudwaleit M, Baraliakos X, Brandt J, Braun
[15] Ronneberger O, Fischer P, Brox T. U-net: Convo-
J, Burgos-Vargas R, et al. The Assessment of Spondy-
lutional networks for biomedical image segmentation.
loArthritis international Society (ASAS) handbook: A
MICCAI 2015;9351:234–41
guide to assess spondyloarthritis. Ann Rheum Dis.
2009;68(SUPPL. 2):ii1-ii44 [16] Bishop CM. Neural Network for Pattern Recognition.
Clarendon Press, Oxford, 1995
[8] Mayerhoefer ME, Breitenseher M, Hofmann S, Aigner
N, Meizer R, Siedentop H, et al. Computer-assisted [17] Kingma DP, Ba J. Adam: A Method for Stochastic Op-
quantitative analysis of bone marrow edema of the knee: timization. arXiv:1412.6980 [cs] 2017

9
[18] Io↵e S, Szegedy C. Batch normalization: Accelerat- namic gadolinium-enhanced MR imaging. Radiology
ing deep network training by reducing internal covari- 1995;194(2):529–36
ate shift. 32nd Int Conf Mach Learn ICML 2015.
2015;1:448–56 [23] Bollow M, Fischer T, Reißhauer H, Backhaus M,
Sieper J, Hamm B, et al. Quantitative analyses of
[19] Billot B, Bocchetta M, Todd E, Dalca A V., Rohrer JD, sacroiliac biopsies in spondyloarthropathies: T cells and
Iglesias JE. Automated segmentation of the hypothala- macrophages predominate in early and active sacroili-
mus and associated subunits in brain MRI. Neuroimage. itis - Cellularity correlates with the degree of enhance-
2020;223 ment detected by magnetic resonance imaging. Ann
Rheum Dis. 2000;59(2):135–40
[20] Taha AA, Hanbury A. Metrics for evaluating 3D med-
ical image segmentation: Analysis, selection, and tool. [24] Landewé RBM, Hermann KGA, Van Der Heijde
BMC Med Imaging. 2015;15(1) DMFM, Baraliakos X, Jurik AG, Lambert RG, et
al. Scoring sacroiliac joints by magnetic resonance
[21] Milletari F, Navab N, Ahmadi SA. V-Net: Fully con- imaging. A multiple-reader reliability experiment. J
volutional neural networks for volumetric medical im- Rheumatol. 2005;32(10):2050–5
age segmentation. Proc - 2016 4th Int Conf 3D Vi-
sion;565–71 [25] Chronaiou I, Thomsen RS, Huuse EM, Euceda LR,
Pedersen SJ, Ho↵ M, et al. Quantifying bone mar-
[22] Bollow M, Braun J, Hamm B, Eggens U, Schilling row inflammatory edema in the spine and sacroiliac
A, König H, et al. Early sacroiliitis in pa- joints with thresholding. BMC Musculoskelet Disord.
tients with spondyloarthropathy: Evaluation with dy- 2017;18(1):1–8

10
Supplemental Material
A Manual Segmentation of Inflammatory Lesions on STIR MRI
Manual delineation and segmentation were performed by a consultant radiologist and a musculoskeletal radiology fellow
with over 25 and 5 years of musculoskeletal MRI experience, respectively. 13 subjects with pre- and post-biological
treatment scans were selected from the available pool of MRI by the consultant radiologist based on visual assessment to
represent a broad range of inflammatory lesions and varying lesion loads. To facilitate and standardize the segmentation,
five scans were used to establish guidelines for lesion identification, with suggestions for each of the potential difficulties
in interpretation [7] (Figure A.7). Eight scans were used for the study, for which each reader provided two independent
segmentation trials. A third non-radiologist reader checked all segmentations for a presence of empty or residual voxels
and made necessary adjustments. Readers focused on potential lesions in the subchondral bone marrow abutting the
sacroiliac joints, the joint space, and the adjacent lumbar vertebra (L5, mainly seen in the endplates and facet joints).
Lesions located above the L4/5 disc were not segmented as this bone was not consistently covered by FOV. We used ITK
SNAP software (version 3.8.0) for all segmentations ([13]).

Figure A.7: Summary of difficulties, related to manual segmentation, and suggestions to assist diagnostic decision making.

B Manual Segmentation of Normal Bone on STIR and Disease Region on


T1W MRI
Manual segmentation of normal bone and disease region was performed by a non-radiologist reader for ten subjects on
STIR and three subjects on T1W images, respectively. A supervised deep learning algorithm was pre-trained with 75
image slices (three subjects) and used to segment disease region on the remaining T1W images from seven subjects. All
segmentations were verified and corrected by the two radiologists. The normal bone region was segmented on multiple
slices and included the largest possible area of normal interforaminal bone to minimise sampling error and more accurately
represent its intensity distribution (Figure B.8). The disease region, i.e., the subcortical bone (iliac and sacral) of the

11
fibrocartilaginous (anterior) portion of the SIJs, excluded the central sacral bone and foramina, spinal fluid, intervertebral
discs, and partially the ligamentous part of the SIJs, i.e., irrelevant regions for inflammation assessment (Figure B.9). The
use of the T1W sequence facilitated morphological assessment of the images, especially the facet joints (the anatomy of
these regions was more difficult to evaluate on STIR images). In cases when identification of the border between vertebra
and discs were difficult, STIR images were consulted and segmentation was adjusted accordingly. T1W and STIR images
were acquired in the same plane with the same FOV, enabling usage of disease region mask for consequent segmentation
via thresholding on STIR image.

Figure B.8: STIR MRI slice in oblique coronal plane (left) with super-imposed manual segmentation of normal bone
region (middle). 3D rendering reveals the segmented structure (right).

Figure B.9: T1W MRI slices in oblique coronal plane for one subject (top) with corresponding super-imposed manual
segmentations of disease region (bottom). 3D rendering reveals the segmented structure (right).

12
C Estimation of two Thresholds from Normal Bone Intensity Distribution
Intensities corresponding to outliers and rare observations of the normal bone intensity distribution are attributed either
to noise or increased water content due to natural variation in tissue composition. Therefore, voxels at or above these
intensity levels are more likely to be abnormal. Distributions determined from di↵erent scans were skewed to varying
degrees (Figure C.10a). To standardize the procedure, an upper limit, Lupper was defined as the maximum intensity, Imax
of the distribution. A lower limit, Llower was computed as the sum of the upper quartile, QU and a multiple, n of the
inter-quartile range, IQR of the distribution: Llower = QU + n · IQR. The multiple n was determined automatically as
the empirical lower limit value of QU + 1.5 · IQR resulted in obvious over-segmentation in certain cases (Figure C.11,
Figure C.10b). Starting with the value of 1.5, the multiple n was incremented by 0.05 until the di↵erence between upper
and lower limits was less than the half of the interquartile range, 0 < Lupper Llower < IQR/2. If the condition was
initially satisfied, no incrementation was performed.

Figure C.10: Boxplot of normal bone STIR intensity for ten subjects, showing di↵erent degree of skewness (a). Histogram
of normal bone intensity for one subject (b). Vertical lines represent di↵erent threshold values; resulting segmentations
are presented in the Figure C.11.

Figure C.11: STIR MRI slice in oblique coronal plane for one subject (left) with super-imposed automatic segmentation
via three di↵erent thresholds: QU + 1.5 · IQR (blue), QU + 2.55 · IQR (orange) and maximum intensity (red) of the normal
bone intensity distribution (Figure C.10), with QU upper quartile and IQR inter-quartile range.

13
D Manual Correction of Automatic Segmentation of Inflammatory Lesions
Based on Morphology and Anatomical Location
Regions which were deemed non-inflammatory - for example due to the presence of vessels or artifact - were removed by
readers based on morphology and anatomical location. Lesions were either left in place or removed altogether, i.e., the
boundaries of lesions were not modified, except when the posterior part of the joint or foramen were segmented along with
a potential lesion. Lesions located above the L4/5 disc were removed in consistency with manual segmentations. T1W
images were used to assist readers in identification of anatomical structures and regions of increased fat content. The
two readers discussed and agreed upon the procedure using automatically segmented images from two subjects (Figure 1,
main text).

E U-Net Architecture
Network takes as input a mini batch, containing four di↵erent T1W image slices, and outputs a batch of corresponding
probability maps. Figure E.12 details the operations and illustrates how the size of an individual image slice or a feature
map changes, as the slice / map propagates through the network.

1 16 16 32 16 16 1

concatenation

Padded convolution 3x3, stride 1 + batch normalization + ReLU activation function


Max pooling 2x2, stride 2
Bilinear interpolation + convolution 1x1, stride 1
3362

3362
Convolution 1x1, stride 1 + sigmoid activation function
Input batch size 16 32 32 32 32 Output batch size
[4,1,336,336] [4,1,336,336]
1682

1682

32 64 64 128 64 64
842

842

64 128 128
422

Figure E.12: Schematic representation of the optimal U-Net architecture. Number of channels and matrix size are,
respectively, on the top and in the left corner of a grey shaded box, representing an individual input image slice or a
feature map.

14
F Data Pre-processing and Augmentation
Images were normalized by three standard deviation of image intensity distribution, allowing the same intensity scale
between subjects and consistency in intensity levels of voxels representing the same tissue across the whole image volume
for each subject (Figure F.13). Each pre-processed image and corresponding mask slices undergone elastic deformation
(https://github.com/gvtulder/elasticdeform/tree/v0.4.9), affine transformation (rotation, scaling, shearing) and
random flip with 0.5 probability, i.e., the order of image array elements along the left-right axis was reversed (Figure F.14).
To make the network robust against between-subject variations in the intensity level of bone voxels, intensities were raised
to a random power, altering contrast and brightness. All transformation parameters (rotation angle, scaling and shearing
factors, power) were randomly sampled from uniform distribution of pre-defined ranges.

Figure F.13: T1W image slice in oblique coronal plane for two subjects: raw (a) and normalized data (b) with corre-
sponding (slice) intensity histograms.

15
Figure F.14: Original (pre-processed) T1W image slice in oblique coronal plane with corresponding mask (a). Transformed
image and mask slices (b). Corresponding (slice) intensity histograms indicate change in intensity levels population after
transformation (c).

16
G Examples of Automatic Segmentation of Inflammatory Lesions with no
Manual Correction

Figure G.15: STIR image slices in oblique coronal plane without/with super-imposed automatic segmentation for six
subjects. Pixels, segmented at lower and both thresholds are marked in yellow and red, respectively.

17
H Examples of Manual Segmentation Trials with Disagreement on Lesion
Location
Figure H.16 illustrates specific examples of manual segmentation trials, for which readers agree more on load than location,
evaluated in terms of volume overlap and volume similarity.

Figure H.16: STIR image slices in oblique coronal plane with super-imposed manual segmentation trials from both
readers for two subjects (left). 3D rendering reveals segmented structures (right). Comparison of these trials yielded
volume overlap and similarity values: 0.59 and 0.97 (top), 0.53 and 0.89 (bottom).

18
I Case with Lower Agreement

Figure I.17: STIR image slices in oblique coronal plane for one subject with super-imposed automatic segmentation,
corrected by the two readers. The main source of disagreement is on the potential inflammation in the joint space of SIJs.

J Annotation Time
Following table summarizes time in minutes, taken to manually delineate or segment inflammatory lesions on STIR MRI,
time to remove residues on automatic segmentations and consequently, to score BME, using SPARCC approach.

19
K Examples of Automatic Segmentation of Disease Region
Figure K.18 shows examples of automatically segmented disease region and corresponding reference standard. Figure K.19
and Figure K.20 show automatic segmentation of disease region from seven subjects, for which the reference standard
was not available. In cases with very abnormal bone (Figure K.19) model missed some regions. Figure K.20 shows the
examples of segmentation nearly at the level of human performance.

Figure K.18: T1W MRI slices in oblique coronal plane for two ‘test’ subjects (top to bottom). Reference standard (middle)
and model averaging ensemble (rounded) prediction of disease region (right).

20
Figure K.19: T1W image slices in oblique coronal plane (top) for three subjects with super-imposed model averaging
ensemble rounded prediction (bottom). Subjects exhibit very abnormal bone: high fat content (hyperintense signal; left,
middle) and strong sclerosis (hypointense signal; right) were partially missed by the model ensemble.

Figure K.20: T1W image slices in oblique coronal plane (top) for four subjects with super-imposed model averaging
ensemble rounded prediction (bottom).

21

You might also like