Unsupervised Classification of Sea Surface Temperature (SST) in the Tropical Atlantic Using Spatial and Functional Data Analysis ()
1. Introduction
Nowadays, various extreme environmental events such as droughts, floods, and fires are being observed in different parts of the Earth, significantly impacting millions of people worldwide [1]-[3]. These events lead to the destruction of fauna and pose a threat to marine life by causing a decrease in oxygen concentration [4]-[11]. The decline in oxygen levels in seawater is primarily attributed to climate change, which is a critical concern. This phenomenon of deoxygenation poses a severe threat to marine life and undermines the benefits that humans derive from marine ecosystems [7] [12]-[14]. Although the prevention of such events is not currently feasible, their prediction across various time and spatial scales can help mitigate potential damages stemming from their occurrence [15].
Sea surface temperature (SST), in conjunction with pollution and climate change, serves as a robust indicator of marine resource productivity [16]-[18]. SST refers to the temperature of a significant layer near the sea surface, playing a crucial role in the development of meteorological systems as well as the biomass of diverse marine organisms at different depths. This includes vital organisms like phytoplankton and pelagic fish. Additionally, SST facilitates energy exchanges between the sea and the atmosphere, making it an essential parameter to monitor and understand. Hence, understanding SST holds significance for weather prediction, offering insights into the prospective development of systems and aquatic organisms [18]-[21]. In the specific study area under consideration (West African region, with a particular focus on Benin), there exists a notable dearth of research on environmental challenges, despite the abundant potential issues concerning the enhancement of the quality of aquatic and agricultural resources, which hold immense importance in the lives of the population. Due to its influence on the growth and spatial distribution of species, SST anomalies have the potential to impose stress on fish populations [19] [22]-[24].
To enhance the monitoring of fisheries resources, Sea Surface Temperature (SST) is modeled in relation to other climatic variables and fish abundance, within the context of climate change. Oceanographers have dedicated significant attention to modeling SST as a climatic variable using an ecosystemic approach. Linear inverse models have been employed to predict SST, as demonstrated by [25] in the Niño 3 regions and [26] off tropical Atlantic. In their work, [27] has presented various modeling methods (interpolation, spectral analysis, filtering estimation, gradient, regression model, etc.) to analyze, among other aspects, how SST responds to the damages caused by the effects of climate change. Additionally, [28]-[30] have utilized supervised machine learning tools to predict microbial diversity and composition in response to SST.
However, the methods mentioned above do not fully consider the spatial and temporal information inherent in SST data. Analyzing interactions within oceanological systems in marine ecosystems also necessitates the consideration of air-ocean interactions. Extensive and complex data with dynamic spatial and/or temporal components have been generated to study interactions within oceanological systems in marine ecosystems (refer to [31] [32]). Such data are abundant across various fields, particularly in the description of oceanological systems. Understanding the relationships between variables represented as high-dimensional vectors and/or functional components is crucial for comprehending the functioning of natural systems.
Therefore, robust methods capable of harnessing the wealth of information contained within such big data are of paramount importance in enhancing the monitoring of SST’s response to the effects of climate change. Functional Data Analysis (FDA) presents a suitable methodology for studying such SST data.
FDA pertains to the analysis and theory of data represented as functions, curves, images, shapes, or even more intricate mathematical objects, conceived as smooth realizations of stochastic processes. Functional data possess an intrinsic, infinite dimensionality. The notable high dimensionality of these data presents challenges in both theoretical understanding and computational handling, with the nature of these challenges varying based on how the functional data were sampled. Functional data can be observed within temporal as well as spatial/spatio-temporal contexts.
FDA utilizes statistical tools to tackle various inquiries, including prediction tasks [33]-[35], estimation of relationships between a primary variable and other variables, and the classification of diverse sets of curves through unsupervised methods or discrimination rules [36]-[39].
Over the past decade, Functional Data Analysis (FDA) has experienced substantial growth across a diverse spectrum of scientific domains. Notably, fields such as medicine [40] [41], ecology and marine biology [33] [35] [42]-[45], as well as environmental sciences and oceanography [46]-[52], have witnessed its profound development. FDA techniques have proven valuable in monitoring networks concerning weather and pollutants (e.g. [53]-[56]), as well as in gas, oil, and petroleum sciences [57] [58], among others.
As alluded to earlier, the application of FDA tools has extended to spatial settings, where data exhibit spatial dependence. Recent research works in this area are exemplified by studies in [59]-[61]. Recognizing the need for advancements in spatially correlated functional data, [62] has extended the spatial autoregressive model and the spatial moving average model to stochastic processes taking values in Hilbert spaces. The utilization of the eigenfunctions basis of the autocovariance operator for projection purposes has been demonstrated in works such as [63] and [64]. In a different vein, [50] expanded hierarchical classification approaches to account for spatial functional correlation, while others have measured similarity between curves using variograms, incorporating spatial correlation through mode and density, as exemplified in [65]. Various methodologies for spatial functional data clustering are presented as well [66], as highlighted in the recent monograph by [66].
The objective of this study is to analyze Sea Surface Temperature (SST) through unsupervised classification using an FDA methodology grounded in Functional Principal Component and clustering analyses. This approach aims to reveal potential heterogeneity in SST across the tropical Atlantic Ocean. The structure of this work is outlined as follows: the spatial functional data analysis and clustering methodology employed are detailed in 2. 3 encompasses the presentation of the SST data from the tropical Atlantic Ocean and the subsequent application of the methodology to this dataset. Finally, 4 is dedicated to the conclusion and discussion of the findings.
2. Methodology
We are addressing a measurable spatial process
defined over some probability space
and observed on some spatial region
of cardinal
,
,
. We assume that for each location
, the random variables
takes its values in a semi-metric space
. The space
is an infinite dimension space and the random variables,
, are locally identically distributed. This means that when a spatial location
is sufficiently close to another one
, the variables
and
have identical or similar distributions. This hypothesis is less restrictive than strict stationarity. It is motivated by the fact that it is possible to imagine that variables located on neighbouring sites may be similar and have the same local distribution that may be different to the local distribution of another set of variables at other locations. In the classical framework of FDA, the space
is a space of functions, typically the space of squared integrable functions defined on some finite interval
. Let denote with
the set of the
curves,
(renamed in an arbitrary way as
in the following).
2.1. Model-Based Clustering for Spatial Functional Data
In this section, we apply a model-based clustering developed by [66] to the SST data described in the upcoming section. Clustering is an unsupervised learning technique that aims to identify clusters with homogeneous characteristics. Within the clustering framework, the model-based techniques assume the existence of a latent categorical random variable
defining
clusters of data. This variable
leads to a probability distribution of data as a mixture of cluster distributions. Let
denote the probability distribution of
and by
represent the probability distribution of
given
. Consequently, the mixture model is expressed as
(1)
where
represents the prior probability of cluster
.
In the context of spatial dependency, the model given in Equation (1) has been extended to incorporate the location
into the prior probabilities of clusters. This modification transforms the mixture model into:
(2)
where
represents a parametrization of the spatial prior. Consequently, given the cluster
, the distribution of observations within the cluster becomes independent of location. All spatial dependencies are accounted for the priors
. This concept is utilized in [67] for clustering spatio-temporal data. This paper introduces multinomial logistic regression as a model for the
:
(3)
Within a parametric framework, the conditional distribution
depends on parameters
. For instance, in the Gaussian model,
represents the mean and the covariance matrix of cluster
. Let
denote the set of all parameters, which also encompasses those defining the
. As a result, the model is transformed into:
(4)
In a finite-dimensional context, the multivariate probability density function serves as the primary tool for estimating such a model using the EM algorithm. However, for functional random variables, the concept of a probability density isn’t well-defined due to the infinite dimension of the data. To address this challenge, [66] employs the expansion coefficients of
with respect to a finite basis of functions. This approach enables the derivtion of a well-defined probability density function based on these coefficients. The use of functional principal component analysis helps define an approximation of the probability density for functional data.
Assuming a spatial autoregressive dynamic for the random effect, [66] introduces a functional classification criterion to identify local spatially homogeneous regions. In the subsequent section, we assume that given
,
follows a Gaussian process. Within cluster
, a pseudo-density is employed [68]:
(5)
where
represents the probability density of the j-th major component
of
within cluster
. The random variables
are independent Gaussian variables with zero mean and variances equal to the eigenvalues
of the covariance operator of
. Similarly, the random variables
are independent Gaussian variables with zero mean and variances equal to the mean eigenvalues
of eigenvalues
of the
covariance operator. Consequently, the parameters
and
must be appropriately chosen.
Indeed, the surrogate density proposed can be regarded as an actual density when the functional data belong to a finite-dimensional space of functions spanned by a basis
, i.e.
Hence, we will choose
as the dimension of the basis used for data smoothing. In this scenario, the principal components
of the functional PCA can be derived by conducting PCA on the expansion coefficients of
in the metric
defined by the inner product of the basis functions.
2.2. The Expectation-Maximization (EM) Algorithm
Let us now outline the EM algorithm for estimating
and, consequently. similar to the finite setting, and based on Equation (5), the likelihood of the sample of curves
is:
(6)
A common approach for maximizing the likelihood when data are missing (such as the variable
) is to employ the iterative EM algorithm to maximize the likelihood (6), and modify it for update the principal components scores of each group as well as the parameters
define
in (3).
The algorithm involves maximizing the approximate complete log-likelihood. Let
denote the indicator random variable for the cluster
at location
. Thus, the completed log-likelihood is as follows:
(7)
This version is known to be easier to maximize than its incomplete counterpart. Let
represent the estimated value of the parameter at iteration
of the algorithm [66].
E Step:
Since the groups to which
belong unknown, the E step involves calculating the conditional expectation of the approximated completed log-likelihood:
(8)
where
represents the probability that the curve
belongs to the cluster
given
at iteration
:
(9)
M step:
The M step involves maximizing the conditional expectation of the completed log-likelihood with respect to
(10)
Observe that
is obtained as a solution of a weighted logistic regression. The EM algorithm commences with an initial random partition of the data
into
clusters. It’s important to note that in homoscedastic models, there’s a modification of the update
. Further details can be found in [66].
2.3. Selection Method
To determine the number of clusters
when
are known, we suggest maximizing the Bayesian Information Criterion (BIC) criterion defined as:
(11)
where
here
is the number of parameters in the model (including spatial mixing proportions, center means, principal scores and variances) and
represents the number of points involved. When the values
are unknown, they can be determined by maximizing the BIC criterion. This can be achieved through the following modified M step, which aims to maximize the conditional expectation of the BIC criterion:
(12)
where
represents the additional number of parameters needed for the model with
main components, as discussed in [66].
2.4. Determination of the Number of Clusters
In functional data analysis, directly applying cluster analysis to observations is not often recommended. This caution arises for valid reasons; the discrete measurements intervals of the observations might be irregular, and the measurement intervals could differ among different functional observations. Consequently, conducting cluster analysis directly on such data can present challenges. To address this, a practical approach is to carry out cluster analysis based on the primary functional principal component (FPC) scores [69].
K-centers functional clustering (KCFC) is a method grounded in the computation of principal components. In this approach, the elements of each cluster are drawn with consideration to better approximation by the first principal components. The method can be outlined in the following steps:
(13)
The covariance function is defined as follows:
(14)
From (13) and (14), we obtain
and
Given that the data of interest are functional in nature, dimension reduction is necessary for efficient computation. A common approach involves expanding each curve using a finite number of principal components [66] [70].
2.5. Clustering Using the Principal Component Scores
Recalling that clustering and supervised classification are valuable tools in traditional multivariate data analysis, they present challenges in the context of functional data analysis. Clustering involves grouping a dataset into configurations where data within clusters are more similar to each other than across clusters, based on a defined metric. In contrast, supervised classification assigns an individual to a predefined group or class using labeled observations.
In machine learning terms, functional data clustering is an unsupervised learning process, while supervised classification employs a discriminant function or classifier to assign new data to predetermined groups. Functional classification typically uses training data with functional predictors and associated multi-class labels for each data point.
In the application, it’s essential to determine the percentage of variance to be explained and subsequently establish the number Kc of principal components required. Equation (13) is then modified as follows:
(15)
To avoid making additional distribution assumptions, the cluster membership for an observation
is determined by:
(16)
which determines the cluster that can represent the observation with the smallest error. For the purpose of grouping, it’s essential to initially estimate the moments, eigenfunctions, eigenvalues, and functional principal component (FPC) scores. The KCFC algorithm builds upon an initial cluster assignment based on the FPC scores
, where
, one common approach is to use a standard classification procedure such as K-means clustering with
representing the number of main components considered. Once the initial clustering is established, the algorithm operates as follows:
Suppose
is the cluster membership of the i-th observation in the th iteration
all the clusters, we have:
1) Choose
and we calculate
and
based on observations with
with
.
2) Calculate the i-th predicted observation for cluster
.
(17)
3) Observation number
is assigned to the closest cluster.
4) Steps 1 to 3 are repeated until there is no further reclassification.
3. Unsupervised Classification of SST in the Tropical African Zone
3.1. Data Description
The data come from NCDC/NOAA (National Climatic Data Center/National Climatic Data Center) https://psl.noaa.gov/data/gridded/data.noaa.ersst.v4.html. They are monthly measurements of sea surface temperature (SST) off tropical African zone from January 1, 1854, to February 29, 2020. This area (
) of interest, see Figure 1, is covered by longitude −70˚ to 20˚ and latitude −26˚ to 24˚.
Figure 1. Study area.
This study area includes most of the countries of West Africa, Central Africa and especially the coastal countries.
This area is divided into 4309 geographical points. At each of these points, monthly sea surface temperatures are recorded from January 1, 1854, to February 29, 2020.
Let
,
be the locations. From January 1, 1854, to February 29, 2020, we consider monthly SST at given locations
. Then the temporal index is
; t = month. The 4309 observations recorded in these measurement sites, are transformed into a functional object using B-splines (Figure 2), see [71]-[73] for more details.
Figure 2. Smoothing of SST observations for all curves use B-splines.
We have 4309 sites where SST measurements were taken. Figure 2 illustrates that not all SST curves are overlaid. The temporal temperature variation differs across the various sites, indicating spatial temperature heterogeneity. On average, the curves share a similar. When examining the curves shapes, there appears to be a suggestive periodicity.
Figure 3. Average sea surface temperature (SST) of the tropical African zone.
This Figure 3 displays a heterogeneous spatial distribution of SST in the tropical zone, with higher temperatures observed in the central area and lower temperatures in the eastern extremes. Figure 3 corroborates the spatial heterogeneity observed in 2.
(a) Average sea surface temperature for March 1970 (b) Average sea surface temperature for March 1971
(c) Average sea surface temperature for March 2001 (d) Average sea surface temperature for March 2002
(e) Average sea surface temperature for March 2018 (f) Average sea surface temperature for March 2019
Figure 4. Average sea surface temperature off tropical African zone for March corresponding: 1970 (a), 1971 (b), 2001 (c), 2002 (d), 2018 (e) and 2019 (f).
Panels (a), (b), (c) (d), (e), and (f) of Figure 4 depict distinct SST trends for the respective years 1970, 1971, 2001, 2002, 2018 and 2019. By focusing on the month of March across these six years, it becomes apparent that the SST distribution across the off-tropical African sub-zones varies in terms of spatial scale. Notably, the spatial configuration of the off-tropical African zone in March 2018 differs from that in March 2019. The clustering method outlined in Section 2 is subsequently applied to the SST functional spatial data (as shown in Figure 2) to discern the heterogeneity of SST.
3.2. Results
In each step of the EM algorithm, and for each value of
, BIC is computed using Equation (12).
While the curves appear to share the same shape, Figure 5 depicts three distinct classes of curves. An analysis of this figure reveals that the clustering of the sea surface temperatures (SST) off tropical Atlantic consists of three groups: one distinct cluster and a combination of two clusters. To gain a clearer view of these classes, we aim to extract and represent them separately.
(a) First class temperature curves
(b) Second class temperature curves (c) Third class temperature curves
Figure 5. Clustering with three clusters.
Figure 5 displays the outcomes of the unsupervised classification involved in three groups portraying the spatial and temporal structure of SST off tropical Atlantic.
In panel (c) of Figure 5 it is observed most curves exhibit temperature variations between the ranges: 24˚ and 30˚. Similarly, in panel (d) of Figure 5 an analysis reveals that most curves undergo temperature changes within the interval of: 22˚ and 30˚. Similarly an examination of panel (e) in Figure 5 demonstrates that most curves experience temperature fluctuations within the range of 24˚ to 30˚.
An analysis of the panel Figure 6 illustrates the spatial distribution of the measurement sites for the three temperature classes.
The average curves of the three classes (Figure 7) demonstrate distinct three phases in the SST. Each phase is characterized by abrupt changes in SST. Notably, during the initial phase, the red and blue classes are intermingled, whereas in the subsequent phases, they are clearly separated. Furthermore, the red curve class dominates as the primary class, followed by the blue curve class as the intermediate class, and the green curve class as the least prominent class.
The first phase of the red curve spans from 1854 to August 1897 (at t = 500). The second phase, marked by a sharp SST decline, extends from September 1897 to April 1939. The final phase, characterized by an SST increase, covers the period from May 1939 to February 2020.
Figure 6. Scatter plot of locations by three clusters.
Figure 7. Cluster mean curves for the 3 groups clustering.
The three phases of SST variation in the green curve align with those of the red curve. A slight distinction is observed in the phases of variation of the blue curve. Notably, its first phase is longer than the first two phases of the other classes (red and green curves), extending until the year 1900. This suggests that global warming might have commenced around 1939. In summary, the descriptive analysis of Figure 6 and Figure 7 reveals the spatial distribution of measurement sites across the three distinct classes: a very hot zone (red), a moderately hot zone (blue), and a relatively less hot zone (green).
A more detailed analysis of the differences in SST curves could be beneficial through a grouping of SSTs that enables the clear differentiation of two classes (Figure 8 and Figure 9). In each class, sites with similar SST curves are grouped together. Furthermore, by considering the average curves within the classes, these can be divided into two categories: the hot class and the non-hot class (Figure 10 and Figure 11).
Figure 8. Clustering with two clusters.
An analysis of the graph in Figure 8 reveals that the SST range fluctuates between 16˚ and 28˚. Regarding the classification into three classes, Figure 9 distinctly illustrates the heterogeneous nature of SST. To enhance visibility of the two classes, they will be presented separately in two panels (Figure 9).
(a) First class temperature curves
(b) Second class temperature curves
Figure 9. Clustering with two clusters.
Figure 9 illustrates the outcomes of unsupervised classification using 2 groups to represent the spatial and temporal structure of SST of the tropical Atlantic.
Figure 10 and Figure 11 present two distinctly discernible clusters, demonstrating the heterogeneity of SST across both spatial and temporal scales. It’s noteworthy that these figures highlight the evident spatial and temporal heterogeneity of SST within the tropical zone.
Figure 10. Scatter plot of locations by two clusters.
Figure 11. Mean cluster curves for the 2 groups clustering.
The comprehensive analysis of the two curves in Figure 11 reveals three distinct phases of sea surface temperature (SST) change. The first phase spans from 1854 to August 1897 (t = 500). The second phase exhibits a sudden SST drop and covers the period from September 1897 to April 1939. The final phase extends from May 1939 to the end of February 2020. Throughout these phases, the two SST classes (represented by the red and blue curves) exhibit clear separation. The warmer class corresponds to the blue curve, while the cooler class corresponds to the red curve.
4. Conclusion and Discussion
This contribution introduces a novel technique, unsupervised classification, to analyze spatial functional data and delve into the spatial and temporal dynamics of Sea Surface Temperature (SST) off tropical Africa. Considering the range of applications involving multivariate methods and machine learning in oceanic data analysis, it is evident that unsupervised classification has transformed the traditional manual approach to SST data analysis. It has not only enhanced the efficiency of spatial functional data analysis but also provided tailored solutions for specific scientific research questions within this field.
This new method is particularly significant in identifying some possible anomalies in the ocean, using SST as an indicative factor of such physic or environmental parameter irregularities. It comprehensively encompasses temporal dynamics and spatial of the variation of SST off the tropical Atlantic, setting. The proposed approach apart from conventional multivariate space-time series analyses. The outcomes presented in Figures 4-11 depict distinct SST anomalies, highlighting by the temporal and spatial variations of SST spanning from 1854 to February 2020. These anomalies might be attributed to the influence of climate change. However, it is crucial to characterize the different phases noted in the temporal evolution of SST.
This study has revealed that the sea surface temperature from January 1854 to February 2020 can be delineated into three distinct phases. The first phase spans from 1854 to August 1897, followed by a decline in temperature observed from September 1897 to April 1939. The third phase, extending from May 1939 to February 2020, represents the most significant upward trend, signifying the contemporary climate warming. This result suggests that global warming commenced following the Second World War.
Given the significance and complexity of the results we have attained, alongside the ongoing advancements in machine learning and ocean observation technology, it would be prudent in the very near future to expand this study to encompass whole off African coast. This expansion could involve employing supervised classification methods while considering the local specifics of each country.