U-Net Based Dual-Pooling Segmentation of Bone Metastases in Thoracic SPECT Bone Scintigrams ()
1. Introduction
Bone metastasis is a prevalent complication of malignant tumors, commonly observed in parenchymal tumors such as breast, prostate, and lung cancers, with an incidence ranging from 30% to 75% [1] . During metastasis, tumor cells interact with osteoblasts, osteoclasts and bone stromal cells, potentially leading to pain, fractures, dysfunction, and psychological distress as they disrupt bone tissue, significantly impacting the life of tumor patients [2] . However, early symptoms of bone metastasis are often subtle, and the onset of pain often signifies that the optimal window for treatment has passed. So far, patients can only mitigate the risk of mortality through screening, early diagnosis and prompt intervention [3] .
Single Photon Emission Computed Tomography (SPECT) is a cost-effective functional imaging modality with high detection sensitivity and has been commonly used clinically as a screening tool for bone metastases. However, due to the disparities between imaging modalities and the human body, SPECT imaging results are afflicted with drawbacks such as poor spatial resolution, low specificity, and susceptibility to noise interference [4] . Although the fidelity and quantitative accuracy of SPECT images have markedly enhanced with advancements in imaging technologies and innovations in contrast agents, discrepancies in medical personnel skills and the influence of various subjective and objective factors may lead to incorrect diagnostic outcomes. To enhance the efficiency and accuracy of doctors’ diagnoses, scientists are endeavoring to develop assisted diagnosis systems to automate medical image analysis, wherein machine learning algorithms, particularly deep learning methods, play a pivotal role. Convolutional Neural Networks (CNNs) can autonomously learn features without human intervention and have demonstrated the efficacy of this approach [5] [6] [7] [8] [9] .
Medical image segmentation is an extension of image segmentation in the medical field, serving as the foundation for extracting pathological regions, conducting clinical trials, measuring specific tissues, and facilitating three-dimensional reconstruction through the delineation of regions of interest. Researchers have explored the utilization of image segmentation techniques to segment lesions, yielding numerous achievements across various modalities such as CT [10] [11] , MRI [12] [13] , and ultrasound [14] . However, studies on SPECT images, characterized by poor imaging quality and limited manipulable objects, have been relatively scarce, mostly focusing on binary classification or binary segmentation tasks [15] [16] [17] [18] [19] . Lin et al. [18] devised a Res-U-Net model by integrating residual modules [20] into the U-Net [21] base model and conducted experiments on an extended dataset of 2280 SPECT lung cancer thorax images, achieving CPA, Recall, and IoU values of 0.7721, 0.6788, and 0.6103, respectively. In another study, Cao et al. [19] addressed the poor imaging quality of SPECT images by incorporating data fusion techniques to enhance data quality. They also introduced the concepts of dense connectivity and deep supervision into the model, resulting in improved segmentation performance.
It is noteworthy that the majority of current research efforts in the domain of SPECT images are built upon existing networks developed for traditional natural images or other modalities. Consequently, there is a paucity of designs tailored specifically for SPECT images themselves. Building upon these observations, our work focuses on adapting existing networks to SPECT and amalgamating them with existing methodologies. Specifically, we analyzed the clinical diagnostic pattern of nuclear medicine physicians and used the dual-pooling module to achieve a balance between highlighting the foreground and maintaining the background information, as physicians need to compare the hotspots with their periphery to screen the suspicious hotspots, which makes the network more in line with the clinical diagnostic patterns; secondly, we paid attention to the multiscale problem of the bone metastatic lesions, and we borrowed the solution of CE-Net [22] and tried to reconstruct the region of the receptive field; Finally, we evaluate the constructed network’s segmentation performance on clinical data. The primary contributions of this work are outlined as follows:
Firstly, we tackle the research problem of lesion segmentation in SPECT images, recognizing significant research potential in this direction. Secondly, we endeavor to enhance the segmentation performance of the network by improving its adaptability to SPECT images. Finally, we employ a set of previously unobserved SPECT images for clinical diagnosis of lung cancer patients as a test set to evaluate the model’s performance.
2. Materials and Methods
2.1. Materials
All the data used in the experiment were sourced from the clinical data generated during the bone metastasis examination of patients by the Department of Nuclear Medicine at Gansu Cancer Hospital from January 2014 to December 2019. A total of 410 images of 205 patients were collected. 99mTc-MDP bone imaging agent was injected into patients without allergic reaction, and after several hours of metabolism, the patient’s anteroposterior data were captured with Siemens multispectral gamma camera. The final results for each body position are presented as a 16-bit unsigned matrix of size 1024 × 256, where each element value represents the intensity of the radiation value at the current position and the distance between pixels is 2.16 mm. We selected 306 images from 410 images, and extracted the patients’ thoracic regions.
2.2. Method
CNNs have dominated the field of computer vision and spawned a series of semantic segmentation models. Among them, the U-Net network stands out for its effectiveness in addressing the challenge of medical segmentation with small sample sizes, and it has gradually become a standard architecture in the field. Figure 1 illustrates the proposed model architecture based on U-Net in this work.
On the whole, U-Net is an encoder-decoder architecture, consisting of a contracting path on the left and an expansive path on the right. The contracting path extracts image features by pixel by stacking multiple combinations of convolution and pooling. Similarly, the expansive path stacks the combination of multiple convolutions and up-sampling operations to restore the original size of the image, and fuses the features corresponding to the two paths through skip connections to retain more location information.
Figure 1. The overall architecture diagram of the proposed model.
2.2.1. Dual-Pooling and Feature Fusion
CNN is a special type of DNN. Generally, a pooling layer is inserted after the convolution block. Pooling layer provides some form of invariance for the network, and reduces the computational complexity of the upper layer by eliminating some connections between convolution layers, so as to reduce the computational cost and control over fitting.
In deep neural networks, various methods are employed to implement pooling operations, with max pooling and average pooling standing out as the most prevalent techniques in CNN. Average pooling performs down-sampling by dividing the input into rectangular pooled regions and calculating the average value of each region, while max pooling is to maximize the target region. The difference between the two methods is that the feature maps obtained by max pooling are more sensitive to texture feature information. Average pooling is to average the images in the pooled area, and the feature information obtained in this way is more sensitive to background information. At present, semantic segmentation networks often use max pooling for down-sampling, such as U-Net, U-Net++ [23] and other models. These networks are not designed for SPECT images originally.
However, in the routine diagnosis of SPECT images, the determination of the hotspots needs to take into account the background. For example, the ratio of target background ratio to measure the tumor tissue radioactivity and the designated normal tissue radioactivity is a semi quantitative indicator for the determination of the focus. Therefore, the use of average pooling in the down-sampling process has better adaptability to SPECT images, while taking into account that in the case of small samples, texture features can reduce the data demand for the model. Therefore, the max and average pooling are used for the network feature maps to conduct down-sampling at the same time, and the sampling results are fused to provide richer features for the model.
The classical feature fusion method generally adopts the method of addition or channel splicing. The former directly adds two features element-by-element, similar to the superposition of information; The latter splices the two feature vectors to increase the feature dimensions of the image, while the information contained in the features in each channel remains unchanged. However, considering that the two pooling results come from the same feature maps in the application process and have a certain degree of spatial consistency, the model is based on the idea of coordinate attention [24] and changed to a dual input module. Specifically, coordinate attention mechanism coordinate attention decomposes channel attention into two one-dimensional feature encoding processes, which aggregate features along two spatial directions. In this way, remote dependencies can be captured in one spatial direction, while accurate location information can be saved in another spatial direction. In order to effectively combine the two inputs, we encode the max pooling results in two directions. The weight of the results is multiplied by the average pooling results. The modified feature fusion module is shown in Figure 2.
2.2.2. Multi-Scale Feature Extraction
Bone metastatic lesions exhibit considerable variability in size. We introduced modules designed in CE-Net and redesigned the size of the receptive field.
DAC module: This module improves the convolution operation by using four cascading branches with receptive fields of 3, 7, 9, 19 to encode the high-level semantic feature mapping, and finally adding the original feature through skip connection. Generally, the convolution of large receptive fields can extract and generate more abstract features for large objects, while the convolution of small receptive fields is better for small objects. By combining cavity convolutions with different dilation rates, DAC blocks can extract features of objects with different sizes.
RMP module: One of the challenges in medical image segmentation is the huge variation in target size. For example, metaphase or advanced tumors may be much larger than early tumors. Residual multi-core pooling encodes global context information by building multiple pooling branches of different sizes. In this experiment, we used 2 × 2, 3 × 3, 4 × 4 and 5 × 5 pooling to construct the
Figure 2. Schematic diagram of the fusion module structure.
module. To reduce the dimensionality and computational cost of the weights, each pooling operation is followed by a 1 × 1 convolution to modify the number of channels to 1. The low-dimensional feature maps are then up-sampled to obtain the features of the original feature map size. These results will be combined with the original features on the channel dimension to generate feature maps with more information.
In the experiment, we add the above two modules to the bottleneck part of the U-Net network without making other major adjustments to the encoding and decoding parts. It is worth noting that the DAC module uses a large number of dilated convolutions to expand the range of receptive fields, which may lead to the existence of grid problems and loss of information continuity, which is fatal for pixel level classification tasks. For this reason, the DAC has been redesigned, as shown in Figure 3(a) and Figure 3(b). The difference between the two is that in the former we directly modified the 1 × 1 convolution in the last layer to 3 × 3 and the latter combination of convolution kernels was optimized by constructing a perceptual field combination of 3, 5, 7 and 9 and comparing it with the previous two.
3. Experimental Results and Discussion
In this section, we present the overall results of the experiment. In order to explore the reasons for the improvement of performance, we used some visual methods to display the final results in the experiment as much as possible
3.1. Experimental Setup
The experimental evaluation metrics used in the experiment include CPA, DSC (Dice Similarity Coefficient) and Recall, which are defined in Equations (1)-(3).
Figure 3. The updated DAC modules used in the experiment.
, (1)
, (2)
, (3)
where TP = True Positive, FP = False Positive, and FN = False Negative.
The initial data set was randomly divided into training set and test set according to the ratio of 8:2.224 images were used to train the model, and the remaining 62 images were used to test the performance of the model.
The parameter settings of the segmentation models used are provided in Table 1. The experiments are run in TensorFlow 2.0 platform on an Intel Core i7-9700 PC with 32GB RAM running Windows 10.
A trained segmentation model was run 5 times on the test set in order to reduce the effects of randomness as much as possible. For each of the defined evaluation metrics above, the resultant experimental results reported are the average of the 5 running results.
3.2. Results
The experimental results are shown in Table 2. Compared to the baseline model, the proposed method shows a large improvement in DSC and Recall, by 3.28% and 6.25%, respectively. The great improvement in Recall indicates that the method is superior in the detection of metastatic lesions, which can further enhance the reliability of the method. On the other hand, networks that incorporate background information are consistently superior to the network solely utilizing the max pooling layer. This enhancement is particularly noticeable in improving the recall metric, which reaches an average improvement of 3.81%. The improved performance can be attributed to the inclusion of background information,
Table 1. Parameter settings used in the experiments.
Table 2. Experimental scores of defined evaluation metrics obtained by different methods.
which furnishes valuable reference features crucial for hotspot determination, aligning more closely with the diagnostic practices of clinical physicians.
Additionally, models that incorporated the DAC module-based model provided a larger receptive field area, enhancing the model’s ability to detect multifocal lesion features. DAC_2, denoted as the design approach with no checkerboard effect and reduced redundancy, exhibits superior performance. This outcome offers a valuable reference point for optimizing future bone imaging models.
3.3. Discussion
In this section, we illustrate the efficacy of the methods through several case studies, demonstrating their capability to enhance segmentation performance. Furthermore, we conduct an analysis to elucidate the factors contributing to instances of relatively low segmentation performance by the models.
3.3.1. The Effect of Average Pooling
Figure 4 depicts the impact of average pooling on the segmentation performance in SPECT images. As shown in Case #1 in Figure 4, the human spine region has formed a large-scale concentration phenomenon, but there are many missed reports in the prediction results of the original U-Net model, such as the lesions on the upper left, which is not ideal for clinical auxiliary diagnosis. After replacing with average pooling, the model will incorporate the background information into the reference when extracting features, improving the prediction ability of the model for the lesions that are difficult to observe. But at the same time, the model’s high sensitivity to SPECT data is also easy to be amplified, resulting in more false positives, as shown in Case #2 in Figure 4. In these two figures, the U-Net model based on max pooling can better predict the location and contour of the lesions, but after changing the pooling strategy, some local hot spots will also be predicted as lesions, which improves the false alarm rate of the model results.
3.3.2. The Effect of New DAC Modules
Figure 5 compares the prediction results of different DAC module models after
Figure 4. Illustrations of how average pooling affects the segmentation performance, with the green and red circles indicating the ground truth (lesions) and model’s predictions.
Figure 5. Illustrations of how different DAC modules affect the segmentation performance, from left to right are none, raw DAC, DAC_1, and DAC_2.
feature fusion. The redesigned cavity convolution combination achieves the best results. From the prediction results, the vertebral body region is a hot area, which adds difficulties to the prediction of the model. The redesigned cavity convolution combination can improve the prediction ability of the model in this area, which is important for the cone, which is a part of the bone metastasis prone area. Secondly, the midpoint floc prediction in the model prediction results is relatively reduced, which can ensure the continuity of the prediction area. However, the redesigned cavity convolution still falls short in terms of receptive field range compared to the previous two, as a large receptive field range can better help the model predict large scale regions.
4. Conclusion
In view of the improvement of automatic segmentation performance of SPECT lung cancer bone metastases, this paper reconsiders the down-sampling strategy in the traditional U-Net network, draws on the clinical diagnostic patterns of nuclear medicine physicians, and establishes an analogous mapping between the relevant patterns and the sampling method. Using clinical SPECT data for model evaluation, the average DSC reached 0.64143, which shows the research significance of average pooling and enhanced information retention in SPECT image segmentation. In the near future, we plan to expand this work in the following directions. First, a more complete SPECT image data set is established, and the data set available for training can be further expanded through data enhancement to test the feasible development direction of the depth semantic segmentation model; Secondly, we will work on a large number of false positives in the model to improve the overall performance of the model. Finally, dilated convolution has significant advantages in segmentation, and it remains to be investigated how to efficiently design the receptive field configuration of the network to fit the research work on bone imaging.
Ethics Approval
The study was approved by the Ethics Committee of Gansu Provincial Tumor Hospital (Approval No.: A202106100014).