ECD-Net: An Effective Cloud Detection Network for Remote Sensing Images ()
1. Introduction
Optical remote sensing technology is increasingly utilized in Earth science research, yet cloud cover continues to pose a significant challenge. It is estimated that around 66% of the Earth’s surface is persistently covered by clouds [1], which severely affects the accuracy of remote sensing (RS) data and restricts the broader application of RS technology in various fields. As a result, the precise identification of clouds in RS images has become a critical task for improving image clarity and ensuring the reliability of the data.
As RS technology continues to advance, cloud detection techniques have seen substantial progress and development. From the early reliance on traditional image processing techniques, such as threshold-based classification methods [2] [3], with the emergence and rapid advancement of deep learning in recent years, cloud detection technology has experienced a revolutionary transformation. Early traditional algorithms had significant limitations in terms of processing capability and adaptability, struggling to handle complex cloud formations and variable environmental conditions. However, as deep learning advances rapidly, deep learning-based cloud detection methods [4]-[16] have gradually become the dominant approach. Cloud detection can essentially be viewed as an image semantic segmentation task, and as image segmentation techniques have continually improved, cloud detection methods have also made notable breakthroughs. Deep learning methods, by automatically learning features, have not only significantly improved the accuracy of detecting complex cloud formations but have also far surpassed traditional methods in terms of generalization ability and processing efficiency.
Recently, deep neural networks have shown impressive performance in image segmentation tasks because of their outstanding feature extraction abilities, leading to their widespread use in cloud detection within remote sensing imagery. Among the cloud detection methods based on the U-Net framework, several innovative approaches have made significant advancements. For example, CloudFCN [4] enhances the U-Net architecture by incorporating the Inception module, enabling multi-scale feature extraction, which significantly improves cloud detection performance and outperforms traditional machine learning methods and threshold-based techniques. RS-Net [11] optimizes the U-Net structure by adjusting the number of embedded channels, reducing computational complexity while maintaining similar performance, and demonstrating outstanding segmentation results. For cloud detection in RS thumbnails, CDNet [9] introduces edge refinement techniques and a feature pyramid structure, effectively improving cloud detection accuracy in low-resolution images. MSCFF, on the other hand, enhances cloud detection in high-resolution imagery through multi-scale feature fusion. CDNetV2 [10] further advances cloud detection by maintaining high accuracy even in complex cloud and snow coexistence scenarios, laying a solid foundation for the continued development of cloud detection technology. Boundary net [13] delves deeply into multi-scale cloud and cloud mask boundary refinement, combining multi-scale feature fusion modules and differentiable boundary refinement networks. Although the model is more complex, it offers significant advantages in improving segmentation accuracy. AMCD-Net [14], taking into account the variability and complexity of clouds, integrates multi-level features and various attention mechanisms based on RS-Net, enabling more precise cloud detection in complex scenarios.
The U-Net-based Convolutional Neural Network (CNN) approach has achieved certain success in local feature extraction for RS image cloud detection tasks. However, it still faces significant challenges in establishing long-range dependencies and capturing global information. These limitations hinder the model’s ability to handle complex scenes, particularly in distinguishing clouds from the surface and identifying multi-scale cloud structures.
In recent years, various models based on Transformer and Mamba (multi-modal attention mechanisms) architectures have been proposed and successfully applied to image segmentation tasks. Notable examples include Swin-UNet [17], SegFormer [18], EfficientVIT [19], U-MixFormer [20], U-Mamba [21], and CM-UNet [22], all of which have shown superior performance, particularly in capturing global information and long-range dependencies, far surpassing traditional CNN models.
Despite the strong performance of these approaches in segmentation tasks, their application to RS image cloud detection still faces several challenges. Specifically, the diversity and complexity of cloud layers, especially the similarity between thin clouds and the ground surface, make cloud mask generation a particularly difficult task. Moreover, RS image cloud detection requires multi-scale feature extraction of cloud layers, and accurate recognition of cloud structures at different scales remains an urgent problem.
To address these challenges and improve model performance in RS image cloud detection, this study proposes an enhancement to the classic U-Net framework. The architecture of its modules is restructured, with a particular focus on the multi-scale features of clouds in RS imagery. A Multi-Scale Dilated Attention (MSDA) [23] module is introduced to effectively incorporate multi-scale information and model long-range dependencies across different scales, significantly improving the model’s ability to recognize clouds at various scales. Additionally, a Multi-Head Self-Attention (MHSA) [24] [25] mechanism is incorporated into the lower-level semantic feature extraction process, enhancing the model’s ability to capture finer details, particularly in distinguishing thin clouds from the ground surface.
Building on this, the study also proposes a multi-path supervision mechanism to comprehensively supervise the cloud mask generation process. This ensures that the model learns cloud features at different scales and produces more accurate cloud masks in the output. The multi-path supervision not only improves the model’s adaptability to multi-scale cloud structures but also enhances its robustness in complex scenarios, thereby significantly improving the accuracy of distinguishing thin clouds from the surface.
In conclusion, the proposed method in this study, by redesigning the network structure and incorporating techniques such as multi-scale feature extraction, long-range dependency modeling, and multi-path supervision, provides a more effective solution for RS image cloud detection tasks. The approach demonstrates strong potential for practical application.
2. Method
In this section, ECD-Net is proposed to address the task of cloud detection in RS images across various complex scenarios. Based on the U-Net framework, ECD-Net incorporates a MSDA module to effectively capture multi-scale feature representations of different cloud structures. Meanwhile, the network leverages the advantages of the MHSA module during low-level feature learning to further extract high-level semantic information of clouds, thereby enhancing the model’s ability to distinguish features. With this design, ECD-Net is capable of generating more accurate cloud masks.
2.1. Overall Architecture
The complete framework of ECD-Net is illustrated in Figure 1. It is designed based on the classic U-Net architecture, incorporating several enhancements to improve its performance. Specifically, ECD-Net retains the fundamental encoder-decoder structure of U-Net, while introducing additional modules to better capture spatial and contextual information. These modifications enable the network to achieve more precise feature extraction and segmentation results.
The encoder comprises five stages that progressively extract features while reducing spatial resolution and enriching semantic information. Stage 1 applies a 3 × 3 convolutional layer (Conv 3 × 3), followed by batch normalization (BN) and the ReLU activation function, to extract initial features while preserving the input’s resolution. Stage 2 introduces downsampling with a Conv 3 × 3 (stride = 2) to capture deeper feature representations, complemented by additional convolutional and normalization layers for feature refinement. Stages 3 and 4 incorporate MSDA Modules, enabling the model to effectively capture multi-scale cloud features and extract comprehensive global semantics. In Stage 5, a MHSA Modules is employed to model global dependencies, extracting high-level semantic features that serve as the foundation for the decoder.
Figure 1. The architecture of the ECD-Net.
The decoder mirrors the encoder’s structure but focuses on progressively recovering the spatial dimensions of the feature maps to generate the cloud mask. Starting from the output of the encoder’s final stage, Stage 5 of the decoder utilizes transposed convolution (ConvTrans + BN, K = 2, S = 2) for upsampling and incorporates a MHSA Modules to integrate global semantic information. Stages 4 and 3 continue upsampling through transposed convolution, leveraging MSDA Modules to enhance feature refinement. Skip connections are employed to fuse features from corresponding encoder stages, ensuring information completeness and preserving spatial details. Stage 2 further restores feature resolution closer to the original image size. Finally, Stage 1 applies a 1 × 1 convolutional layer to produce the cloud mask, which is normalized to a probability distribution in the range [0, 1] using a Sigmoid function.
The depth design of the attention module in both the decoder and encoder is the same [2, 2, 1].
2.2. MSDA Module and MHSA Module
This paper introduces the MSDA [23] module and the MHSA [24] [25] module to obtain multi-scale and global features. As represented in Figure 2(a), the input first passes through Conditional Positional Encoding (CPE) [26] to incorporate positional information. This is followed by Layer Normalization (LayerNorm) to stabilize training. Then, the MSDA is applied to capture semantic information at different scales, with the output added back to the input via a residual connection to preserve the original information. Layer normalization is applied again, and an MLP is utilized to enhance the feature representation. In the end, another residual connection is applied to complete the module’s output. The MHSA module (Figure 2(b)) follows a similar architecture to the MSDA module, leveraging the MHSA mechanism to model long-range dependencies within the input features.
Figure 2. (a) MSDA module, (b)MHSA module.
As shown in Figure 3, the structure of MSDA changes such that the inputs Q, K, and V are generated through linear layers. Next, the channels are divided into n distinct attention heads., utilizing a SWDA (Sliding Window-based Sparse Attention) [23] mechanism to each head with varying dilation rates.
Figure 3. The structure of MSDA.
The operation of SWDA can be described as follows: given three matrices—Q, K and V as inputs, SWDA computes attention scores for each query vector
at position (i, j) by selecting sparse K and V vectors within a sliding window centered at that point. The sparsity is controlled by the dilation rate r.
(1)
2.3. Loss Function
This paper adopts a multi-supervision approach to train ECD-Net, focusing on improving the model’s performance. During training, the loss function uses BCE Loss [27], which calculates the pixel-wise binary classification error between the predicted results and the ground truth labels, effectively optimizing the model parameters. To further enhance the training performance, we design a multi-supervision loss function (
) based on BCE Loss. This function provides supervision at multiple-path, helping the model learn richer feature representations and improving the network’s ability to adapt to complex scenarios.
Specifically, our
computes the loss at each stage and combines them through a weighted sum, allowing the model to integrate feedback from different layers and avoid overly relying on features from any specific layer. With this multi-level and multi-scale supervision mechanism, the model can better learn both fine-grained details and global semantics in the image, further improving the accuracy and robustness of cloud detection.
The specific definition of this
is as follows:
(2)
where the
represents the cloud mask generated at each stage.
3. Dataset and Experimental Setup
3.1. Dataset
The performance of ECD-Net was evaluated using the GF1-WHU dataset [28] and the SPARCS dataset [29]. The GF1-WHU dataset includes 86 satellite images for training and 22 for testing, covering diverse cloud conditions and land types. To simplify the cloud detection task, cloud shadows in the cloud mask images were treated as background. RGB image patches with a resolution of 384 × 384 pixels were generated through cropping, resulting in 2012 training patches and 516 testing patches. During training, the dataset was randomly split into training and validation subsets in an 8:2 ratio. The SPARCS dataset consists of 80 RS images with a resolution of 1000 × 1000 pixels. These images were manually divided into 64 training images and 16 testing images based on different land surface types. The images were further cropped into patches of 384 × 384 pixels, yielding 576 patches for the training set and 144 patches for the testing set. During training, the training data was randomly split into training and validation subsets in a 9:1 ratio.
3.2. Implementation Details
In this study, the proposed ECD-Net was trained using the PyTorch framework [30] and optimized with the AdamW optimizer [31]. A cosine annealing scheduler with linear warm-up was employed to alter the learning rate. During training, the batch size was set to 24, with an initial learning rate of 4e-4 and a weight decay of 1e-3. The experiments ran for a total of 150 epochs. All experiments were conducted on a Windows 11 operating system and executed on an NVIDIA GeForce RTX 4090 GPU. This paper uses accuracy [14], Jaccard Index (Jaccard) [32] and F1-Score [33] to evaluate the proposed model.
4. Experiments
4.1. Results from the Quantitative Evaluation on the GF1-WHU
Dataset
According to the quantitative comparison results presented in Table 1 and Figure 4, ECD-Net excels in key evaluation metrics such as accuracy (97.37%), Jaccard index (87.09%), and F1 score (93.10%), demonstrating its high precision and superior segmentation performance in cloud detection tasks. Despite its higher computational cost (GFLOPs of 55.711), which requires more computational resources, its outstanding performance makes it the model of choice for tasks requiring high accuracy. In comparison, AMCD-Net, with a GFLOPs of 37.187, strikes a better balance between performance and computational efficiency. While it slightly trails ECD-Net in accuracy and F1 score, it still delivers excellent results. U-Net, as a traditional network architecture, has a higher computational complexity (GFLOPs of 89.955), but still performs well regarding accuracy (96.92%) and F1 score (91.92%), making it a widely used baseline model for cloud detection tasks. RS-Net, with a reduced computational load (GFLOPs of 38.183), performs exceptionally well in optimizing computational efficiency. Its accuracy (96.98%) and F1 score (91.93%) are similar to U-Net’s, making it a more efficient alternative.
Table 1. The results of different methods.
Method |
Parmas (M) |
GFLOPs |
FPS |
Accuracy (%) |
Jaccard (%) |
F1-Score (%) |
U-Net |
17.263 |
89.955 |
160.26 |
96.92 |
85.04 |
91.92 |
RS-Net |
9.389 |
38.183 |
177.45 |
96.98 |
85.07 |
91.93 |
AMCD-Net |
10.025 |
37.187 |
102.86 |
97.24 |
86.53 |
92.78 |
ECD-Net (ours) |
13.558 |
55.711 |
88.58 |
97.37 |
87.09 |
93.10 |
Figure 4. Performance of different methods.
However, ECD-Net’s inference speed (88.58 FPS) is significantly lower compared to other methods. This can be attributed to the computationally intensive self-attention mechanism and the multi-path supervision strategy employed in the model. While these techniques effectively enhance detection accuracy, they also increase the computational burden, resulting in slower inference speed.
4.2. Visualization Results of Different Methods on the GF1-WHU
Dataset
This paper compares the visualization results of four models across three different scenes, as shown in Figure 5. In the thin cloud regions over the water scene, all methods display inconsistent rates of missed and false detections. Among them, ECD-Net performs the most accurately, with clear boundaries between the cloud layers and the background, showing minimal missed and false detections. Although AMCD-Net is slightly inferior to ECD-Net, it still maintains high segmentation accuracy, with well-defined boundaries and fewer errors. In contrast, U-Net and RS-Net show larger areas of missed detections.
Figure 5. Visualization results on different methods.
In the land scene, where clouds and background are distinctly different in color and texture, all methods perform well in cloud detection. However, RS-Net’s performance is somewhat lacking, with less precise segmentation of the clouds.
In the grass scene, U-Net and RS-Net show significant areas of missed detections, while AMCD-Net experiences only a few missed detections. ECD-Net delivers the best segmentation results in this scene, accurately identifying most of the cloud layers.
Overall, ECD-Net performs the best across all scenes, followed by AMCD-Net. U-Net and RS-Net show some limitations in both accuracy and detail handling.
4.3. Results from the Quantitative Evaluation on the SPARCS
Dataset
As Table 2 shows that, consistent with its performance on the GF1-WHU dataset, ECD-Net achieves the best results in terms of accuracy (96.32%), Jaccard index (81.75%), and F1-Score (89.96%), fully demonstrating its exceptional performance in cloud detection tasks. However, despite having parameters and GFLOPS similar to other models, the inference speed of ECD-Net (33.70 FPS) is significantly lower.
Table 2. The results of different method.
Method |
Parmas (M) |
GFLOPs |
FPS |
Accuracy (%) |
Jaccard (%) |
F1-Score (%) |
U-Net |
17.263 |
89.955 |
61.43 |
96.11 |
80.93 |
89.46 |
RS-Net |
9.389 |
38.183 |
63.61 |
96.07 |
80.56 |
89.23 |
AMCD-Net |
10.025 |
37.187 |
49.45 |
96.14 |
81.22 |
89.64 |
ECD-Net (ours) |
13.558 |
55.711 |
34.70 |
96.32 |
81.75 |
89.96 |
Overall, while ECD-Net excels in accuracy and detection capability, its slower inference speed may limit its applicability in real-time or high-efficiency scenarios. Therefore, future work should focus on optimizing the model for lightweight design and faster inference to enhance its practicality further.
4.4. Visualization Results of Different Methods on the SPARCS
Dataset
As shown in Figure 6, all methods accurately detect the cloud regions in the first row, though ECD-Net and AMCD-Net exhibit a small number of false positives. In the second row, ECD-Net demonstrates the highest precision in capturing the boundaries and shapes of the cloud regions. In the third row, the edge detail detection of U-Net, RS-Net, and AMCD-Net appears somewhat coarse, while
Figure 6. Visualization results on different methods.
ECD-Net more clearly restores the distribution details of the clouds. The fourth row illustrates a scene with sparse and isolated thin clouds, where ECD-Net accurately captures the thin cloud edges but shows false positives in the background of cloud-free areas. Overall, ECD-Net demonstrates superior performance in detecting edge details and thin clouds.
4.5. Ablation
To evaluate the effectiveness of multi-supervision in the ECD-Net model, we compared the performance of using BCE loss to supervise the cloud mask generated by the final layer with that of using multi-path supervision. The data in Table 3 reveals that integrating multi-path supervision yielded a 0.18% higher F1-score and 0.28% improvement in the Jaccard index compared to training with BCE loss alone. This improvement suggests that multi-path supervision provides more comprehensive guidance during training, significantly enhancing the model’s performance in cloud detection tasks.
Table 3. Impact of loss function on the GF1-WHU dataset (%).
ECD-Net |
|
Accuracy |
Jaccard |
F1-Score |
Π |
Ο |
97.32 |
86.81 |
92.92 |
Π |
Π |
97.37 |
87.09 |
93.10 |
5. Conclusion
This study proposes an advanced cloud detection network, which is based on the U-Net architecture to address cloud identification in RS imagery. The method enhances the model’s cloud detection capability in complex scenarios by introducing the MSDA and MHSA modules. Additionally, the designed multi-path supervision mechanism further improves the accuracy of cloud mask generation at multiple scales. Experimental findings using the GF1-WHU and SPARCS dataset demonstrate that the proposed model performs exceptionally well in complex scenarios, significantly improving cloud detection accuracy and showcasing strong potential for practical applications. Looking ahead, we aim to extend this method to other RS image datasets and explore its broad applications in cloud detection and cloud layer estimation. We also intend to investigate lightweight network architectures and apply knowledge distillation techniques to further reduce computational costs.
Acknowledgements
This research was supported by the Gansu Provincial Science and Technology Program (22YF7FA166).
NOTES
*Corresponding author.