Keywords

1 Introduction

Evidence of the benefits of point-of-care ultrasound (POCUS) continues to grow. For instance, ultrasound provides emergency physicians access to real-time clinical information that can help to reduce time to diagnosis [1]. Time is always a precious resource in the emergency department. Fast and accurate ultrasound examinations, particularly examination of the heart, can help avoid severe complications or accelerate transfer of the patient to specialized departments for more thorough cardiac evaluation. Typically emergency physicians perform a focused cardiac ultrasound (FOCUS) that can be used to assess the presence of pericardial effusion and tamponade [2], left ventricular ejection, ventricular equality, exit (aortic root diameter), and entrance (inferior vena cava, diameter and respirophasic variation) [3]. Typically clinicians use one or more of three imaging windows and five views: parasternal long-axis (PLAX), parasternal short-axis (PSSA), apical four-chamber (A4C), subcostal long-axis (SCLA), and subcostal four-chamber (SC4C). Additionally, an apical two-chamber (A2C) view might be used to evaluate all parts of the myocardium. Due to time constraints in the emergency department, a diagnosis will usually be made from two out of the five views, if patient mobility and habitus allows. Finding these five target views is particularly challenging for untrained physicians; it typically requires significant training and experience.

Scanning Assistant:

To assist less experienced physicians for rapid echocardiographic assessment and improve the use of ultrasound in emergency care, we propose an acquisition guidance system (Fig. 1a–b) that enables accurate placement of the ultrasound probe at the right position and orientation with respect to the heart anatomy. An intuitive user interface (Fig. 1a) provides acquisition assistance at all three commonly used imaging windows (apical, parasternal, and subcostal) and the majority of target views specified by the FOCUS protocol [1], including PLAX, PSSA, A4C, A2C and SC4C. Importantly, this navigation system is solely image-based, and does not rely on any external tracking devices.

Fig. 1.
figure 1

User interface of a scanning assistant (a) tested on the human subject (b) using a commercially available mobile ultrasound system (Lumify, Philips). A key feature of the system is that it provides feedback in addition to probe motion (c). For instance, during scanning the physician might lose acoustic coupling or position the probe directly on the rib cage. The system informs the user that image quality needs to be improved in order to provide further guidance. Additionally, the scanning assistant detects one of the commonly used imaging windows to guide the user through the exam workflow, and adjusts imaging settings, such as penetration depth, accordingly.

Deep Learning for Ultrasound:

Image-based guidance in transthoracic echocardiography is non-trivial due to the likely presence of reverberation clutter, acoustic shadow, cardiac and respiratory motion as well as patient’s anatomical and physiological variability. Deep convolutional neural networks (CNN) can be trained to extract high-level features with large spatial context, making them applicable to such complex problems. Consequently, deep learning has significant advantages over standard machine learning methods. Previous methods developed for ultrasound images that required manual selection of features [4] were recently out-performed by deep learning in tasks such as view classification [5, 6] or segmentation [7].

Here we propose an fully end-to-end solution with a multi-task CNN model, which (a) assesses whether the quality of the image is sufficient for guidance, (b) identifies one of three typical imaging windows, including apical, parasternal, and subcostal, and (c) predicts motion of the transducer towards the desired imaging plane (see Fig. 1c).

Key Contributions:

As far as the authors are aware, this paper is the first to propose a solely image-based user guidance system for point-of-care transthoracic echocardiography ready to be deployed on commercial mobile ultrasound scanners, such as Lumify, Philips. This fully end-to-end solution uses a multi-task deep convolutional neural network to predict relative motion of the transducer towards the diagnostically relevant views, as well as assesses image quality and identifies one of the commonly used imaging windows. The key contributions include:

  • a new technique dedicated to transthoracic echocardiography to achieve entirely image-based navigation with millimeter-level accuracy,

  • a method that quantitatively guides the positioning of the transducer at five target views (PLAX, PSSA, A4C, A2C, and SC4C) from three different imaging windows (apical, parasternal, and subcostal),

  • a new light-weight multi-task deep convolutional neural network architecture that regresses both 3-DOF rotation and 3-DOF translation, as well as classifies ultrasound images based on the quality and imaging window,

  • a solution with potential clinical deployment on a mobile device or similar hardware with limited memory and computing capabilities.

2 Methods

2.1 Data Collection and Labeling

All datasets were obtained from healthy human subjects (N = 30) using a commercial handheld, mobile, USB-based ultrasound system (Lumify, Philips) by a well-trained sonographer. Each loop consisted of a large number of frames. These frames were acquired at all three imaging windows, including apical, parasternal, and subcostal; and covered all five views defined in the FOCUS protocol [3] (i.e. PLAX, PSSA, A4C, A2C, and SC4C), see Fig. 2c. Each cardiac ultrasound frame within a dataset was automatically labelled using a custom-made data acquisition system based on optical tracking. A schematic representation of our acquisition system is shown in Fig. 2a. For ease of annotations, each acquisition was started at one of three reference target views (A4C, PLAX, SC4C) that implicitly defined three standardized coordinate systems at each acoustic window. Positions of remaining frames was determined relative to this coordinate systems. For simplicity, guidance accuracy was evaluated only for these reference views. Remaining target views (A2C, PSSA) were identified by an expert echocardiographer. Importantly, optical tracking was only used to collect the ground truth data but never during the application of the system.

Fig. 2.
figure 2

(a) A schematic representation of the acquisition system. A rigid probe marker consisting of retroreflective spheres is mounted and calibrated to an ultrasound probe. An additional patient marker is attached to the patient’s chest using an adjustable belt to account for unexpected patient motion. Images from the portable ultrasound device are acquired and synchronized with the optical tracking system that estimates the 6-DOF positions of both probe and patient marker; (b) Coordinate system of the ultrasound probe; (c) Target views defined by the FOCUS protocol towards which our algorithm can guide the user. Starting from the left: four-chamber (A4C) and two-chamber (A2C) views from the apical imaging window, short axis (PSSA) and long axis (PLAX) views from the parasternal imaging window, subcostal four-chamber view (SC4C). An example of a low-quality (LQ) frame is also provided.

Two rigid markers consisting of four retroreflective spheres (NDI Medical) were attached to the ultrasound probe via a custom-made adapter as well as the patient’s chest using an adjustable belt (see Fig. 2b). The transformation between probe and image (\( ^{{\text{probe}}} T_{{\text{image}}} \)) was obtained using a custom-made wire-based ultrasound phantom, similar to the one described in [8]. A patient marker, which establishes the heart coordinate system, was used to account for unexpected motion of the heart with respect to the tracking system. Pose \( T \in SE\left( 3 \right) \) of both probe (\( ^{\text{tracker}} T_{\text{probe}} ) \) and patient marker (\( ^{\text{tracker}} T_{\text{patient}} \)) was estimated via a stereoscopic optical camera (Polaris Vega, NDI Medical), and synchronized with the ultrasound images acquired with the portable ultrasound device. All images were then labelled with 3D rigid transformations calculated relative to the reference image in the heart coordinate system, as listed below:

$$ {}_{ }^{\text{patient}} T_{\text{image}} = ({}_{ }^{\text{tracker}} T_{\text{patient}} )^{ - 1} \cdot {}_{ }^{\text{tracker}} T_{\text{probe}} \cdot {}_{ }^{\text{probe}} T_{\text{image}} $$
(1)

Expert echocardiographer identified all low-quality (LQ) frames in each data set. We considered LQ images as those either with poor acoustic coupling or containing organs different than the heart. The remaining frames, accounting for the three different imaging windows, were considered high-quality (HQ), i.e. sufficient for our algorithm to extract features and make predictions. Stored datasets were divided into two separate sets: (a) development dataset (N = 27 subjects; 590,000 frames) from which 80% of cases were randomly chosen to train the weights of the CNN and 20% for validation, and (b) test dataset (N = 3 subjects; 21,000 frames, including: 10,000, 7,000, 1,500, and 2,500 frames for apical, parasternal, subcostal, and LQ class respectively) consisting of data points the model was not trained on. Accuracy of the algorithm was evaluated only on unseen test cases in order to determine the generalizability of the model. Tracking accuracy was evaluated only on HQ frames by calculating average absolute angular errors along each axis (rotation), and mean absolute distance (translation) to the target. Classification performance was assessed by the area under the receiver operator characteristic (ROC) curves (AUC).

2.2 Model Development

A primary feature extractor was a CNN model – broadly known as a SqueezeNet – with 8 so-called fire modules followed by one convolutional layer and global average pooling [9]. This CNN architecture was designed for limited-memory systems and to provide high energy efficiency on mobile devices [10]. This primary CNN simultaneously predicts rotation and translation for all five target views as well as classifies three acoustic windows, thus sharing features among all these tasks. For the rotation and translation tasks, we added two separate regression layers with a π tanh activation function after the primary feature extractor, as described in [11, 12]. For the classification task, a softmax classification layer was added after global average pooling. The total loss function was defined as:

$$ {\text{loss}}_{\text{total}} = \lambda \cdot {\text{loss}}_{\text{rotation}} +\upalpha \cdot {\text{loss}}_{\text{translation}} +\upgamma \cdot {\text{loss}}_{\text{classification}} $$
(2)
$$ {\text{loss}}_{\text{rotation}} = \cos^{ - 1} \left[ {\frac{{tr\left( {\hat{R}^{T} R} \right) - 1}}{2}} \right] $$
(3)
$$ {\text{loss}}_{\text{translation}} = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {\left( {\hat{t}_{i} - t_{i} } \right)^{2} } \left| {\left\langle {\frac{{\hat{t}}}{{\left\| {\hat{t}} \right\|}},\frac{t}{{\left\| {{t}} \right\|}}} \right\rangle - 1} \right| $$
(4)
$$ {\text{loss}}_{\text{classification}} = - \sum\nolimits_{c = 1}^{M} {{\text{y}}_{c} { \log }\left( {{\text{p}}_{c} } \right)} $$
(5)

where \( \lambda , \alpha , \gamma \) are hyperparameters used to balance between the rotation, translation, and classification loss respectively; \( R \in SO\left( 3 \right) \) stands for a rotation matrix; \( t = \left[ {t_{i} \ldots t_{N} } \right]^{T} \), and \( N = 3 \), stands for translation vector; \( \left\langle \cdot \right\rangle \) represents inner product of two vectors and \( {\text{loss}}_{\text{classification}} \) is a cross entropy loss for multi-classification task, \( M = 4 \).

The model was trained in TensorFlow using RMSprop optimizer with a batch size of 32, an initial learning rate of 0.0001, and decayed every 4.7M iterations with an exponential rate of 0.5. Batch normalization, weight decay of 0.0005, early stopping criteria were used as regularization techniques. All loss function hyperparameters \( \left( {\lambda , \alpha , \gamma } \right) \) were set to 1. Ultrasound images were converted from Cartesian to Polar space and randomly augmented during the training using various ultrasound-specific techniques, including injection of a reverberation clutter, alteration of penetration depth, change of gain and aspect ratio as well as post-modification of TGC curve.

3 Results

In the vicinity of the target views (see Fig. 2c), the average absolute angular accuracy was 2.5 ± 1.4°, 2.4 ± 1.8°, and 5.5 ± 5.0° around the x, y, z axes respectively (see Table 1).

Table 1. Accuracy of the system as a function of distance \( d \) and angle \( a_{i} \) around each axis (x, y, z), where \( i \in \left\{ {1 \ldots 3} \right\} \); \( \bar{a} \) represents the average absolute angular error, and \( \bar{d} \) represents the absolute translation error with respect to three target views: A4C, PLAX, and SC4C.

The average absolute translation accuracy was 2.0 ± 1.6 mm. The overall accuracy decreased when the distance to the target position increased. For instance, the predicted translational inaccuracy measured at the distance above 20 mm from the target view was significantly higher (p-value < 0.0001, unpaired, two-tailed t test) than below 5 mm, 5.6 ± 4.7 mm vs. 2.0 ± 1.6 mm respectively. The average classification accuracy was 98% and 89% for imaging window identification and low-quality frame detection respectively (see Fig. 3b–c). ROC curves with associated AUCs are shown in Fig. 3a.

Fig. 3.
figure 3

(a) Pairwise comparison of receiver operating characteristic (ROC) curves for classification tasks i.e. each class is shown with respect to other classes. The ROC curves are similar with AUCs ranging from 0.98 to 0.99 for apical (A), subcostal (S) and parasternal (P) respectively (mean 0.97); (b) A confusion matrix for identification of low-quality (LQ) frames with respect to high-quality (HQ) frames shows accuracy ranging from 83% to 95% for LQ and HQ, respectively, where HQ represents apical, parasternal, and subcostal views; (c) A confusion matrix for imaging window classification shows the accuracy ranging from 0.97% to 0.99% for subcostal (S), parasternal (P), and apical (A) views respectively. The confusion matrix was normalized by the number of cases in each class. Inclusion threshold was set to 0.5.

4 Conclusion and Future Work

A solely image-based scan guidance system for point-of-care transthoracic echocardiography was developed and evaluated on unseen independent in vivo datasets. Our deep learning-based algorithm was trained using a multi-task learning paradigm. A single neural network was used to (a) detect and exclude ultrasound frames where quality is not sufficient for the guidance, (b) identify one of three typical imaging windows, including the apical, parasternal, and subcostal, to guide the user through the exam workflow, and (c) predict 6-DOF motion of the transducer towards clinically-relevant views, such as the four-chamber or long-axis views. To begin with, finding an optimal acoustic window to image the heart can be challenging, especially for technically difficult patients. Our system could possibly accelerate this phase of the examination by providing an objective measure of image quality; herein we demonstrated 95% accuracy for high-quality image classification. Moreover, it was demonstrated that the ultrasound probe could be guided to three pre-defined reference target views with an average rotational accuracy of 3.3 ± 2.6°, when the probe was close to the target (<5 mm). The lowest rotation accuracy was shown around the z-axis; mostly because angles for this axis have the largest span, ranging from 0 to π. This accuracy may be sufficient to perform all assessments relevant in the acute/emergency setting, including presence of a pericardial effusion, left ventricular ejection, ventricular equality, as well as in recognizing cardiac arrest. We noticed that overall system accuracy decreased with the distance to the target. For instance, accuracy decreased to 5.4 ± 4.2° and 5.6 ± 4.7 mm, for rotation and translation respectively, when the distance to the target exceeded 20 mm. This behavior could be attributed to smaller coverage of these regions by the training instances. Due to the fact that adjustment of the probe position is performed in a step-wise iterative manner this behavior is not considered a limitation of our approach. In the future, a series of former predictions provided by the network using recurrent layers, such as Long Short-term Memory (LSTM) units could further enhance the accuracy away from the target location [6].

In addition, our CNN architecture had only 1.2M parameters and required 5 MB of the storage memory. Hence, our method could be readily deployed on commercial portable ultrasound systems. Initial tests on a premium mobile device – with TensorFlow Lite and hardware acceleration enabled – demonstrated an average frame rate of 25 Hz.

Despite promising results, the main limitation of this study is the small training dataset size and inclusion of only healthy subjects, which may limit the performance of the algorithm for technically difficult patients or patients with abnormal physiological conditions. Further work would include adding data from a larger number of subjects, including patients with impaired cardiac function.