Outlier Detection of Air Quality for Two Indian Urban Cities Using Functional Data Analysis ()
1. Introduction
Modern lifestyles invariably include energy use and its consequences. Anthropogenic causes of air pollution include the combustion of straw, coal, and gasoline, as well as emissions from industries, vehicles, airplanes and cans of aerosols. On a daily basis, hazardous gases such as CO, CO2, Particulate Matter (PM), NO2, SO2, O3, NH3, and Pb are released into our environment. Chemical compounds and particles in air pollution have an influence on the physical well-being of animals, humans, and ecosystems. Pneumonia, lung cancer, coronary artery disease, and influenza are among some of the serious ailments that may be contracted by people as a result of exposure to air pollution. Poor air quality is the root cause of smog, aerosol formation, reduced eyesight, rising temperatures, acid rain, early death, and other contemporary environmental issues. Machine learning was implemented to forecast air pollution in Indian cities [1] [2] . The monitoring was carried out at eight areas covering rural, semi-urban, and metropolitan environments. Throughout the campaign, the semi-urban location (Sirsa) had the greatest average concentration of specific concerns. According to this multi-city study, semi-urban areas have the worst air quality during rabi crop residue burning and require special attention to solve air quality concerns in the Indo-Gangetic plain region [3] . Delhi’s indoor environment is deteriorating day by day as a result of vehicle pollution, and the effect of vehicular frequency on Delhi’s air quality remains unknown [4] . Ahmad M. et al. [5] conducted the outliers study of Yamuna River using SPC and classical statistics. Haque M. S. and Singh R.B. [6] performed a case study of human health owing to air pollution in Kolkata, India. Only 39.3% of respondents reported that their health was negatively impacted by outdoor (air) pollution, despite the fact that the pollution level has been classified as critical. Using function data analysis and the impact of the COVID-19 outbreak on air quality, the outlier identification case research was conducted in Gijon, Spain [7] . The area of statistics being examined is functional data analysis (FDA), and this research uses its techniques to model this kind of data. It was chosen to address the inefficiency of the traditional approaches for finding outliers in vectorial data, which was discovered by comparing the findings from box plots and statistical control charts, both of which were used in this research to test for the presence of outliers. Environmental engineering [8] [9] [10] [11] [12] , industrial processes [13] [14] , sensors [15] [16] , and medical research [17] are just a few of the many sectors in which FDA is used today. The advantage of functional analysis is that it allows you to investigate the identification issue from a time-related viewpoint. This is done by turning a set of discrete measurements that change over time into mathematical functions. Additionally, the best method in this investigation was determined to be the functional outlier indicator offered by Dai et al. [18] according to outlyingness. This technique can detect outliers with greater precision and robustness because it uses two variables: mean directional outlyingness, which examines a curve’s shift to the remainder, and variation of directed outlyingness, which examines a curve’s form to the rest. Despite there are various approaches for detecting outliers, that is the Grubbs test [19] and the Jäntschi test [20] , these generally contain a vectorial set. In this instance, both conventional and functional methodologies were applied, resulting in an examination and contrast of both data to identify the most effective course of action. Fewer researchers have expressed an interest in finding outliers in Indian air quality. The goal of this study was to detect outliers in Particular Matters (PM2.5 and PM10) and trace gases (NO2, NOx, SO2, and O3) in Delhi and Kolkata, India. Kolkata is located in eastern India, while Delhi is located in northern India. The functional approach of data analysis was effectively used in this study. The dataset was treated as continuous measurements, and measurements over time without discrete values, using this novel functional technique. In order to assess the trend and periodicity of the data, this approach offers functional outliers that do not consider those discrete measurement errors to be outliers. Additionally, the functional method has the benefit of not requiring a data set that is normally distributed in comparison to most criteria for outliers’ detection based on a discrete approach. As a result, further processing is not required. In this study, functional data analysis was used and compared to the traditional technique. This research is divided into sections, the second of which explains the material and techniques. Section 3 presents the methodology’s results and, lastly, the key conclusion made from the results.
2. Materials and Methods
2.1. Materials
Data
The current investigation is being done to find outliers in two Indian cities, Delhi and Kolkata. Delhi is located in northern central India and has a population of 32.94 million people. Its size is 1483 km2. This places it well ahead of Kolkata on the Asian continent. As a result, because Delhi lacks aquatic bodies that border it, temperature regulation occurs infrequently. Due to the lack of temperature regulation by the marine bodies, Delhi endures high temperatures both during the summer and winter seasons. Temperatures in Delhi often fluctuate between 2 and 45 degrees Celsius. Overall, Delhi’s climate is a combination of semi-arid conditions mixed with monsoon-influenced humid subtropical conditions, with notable changes in summer and winter temperatures as well as variability in precipitation. On the other hand, Kolkata, which is in eastern India and has a populace of 15.33 million people and an area of 1886.67 km2, is somewhat nearer to the sea and hence receives temperature regulation from the water bodies. Kolkata hence has more moderate temperature changes, and the transition from summer to winter there is controlled. Delhi is located at the other extreme of the temperature range.
Data for this study was provided by the air quality station. Daily station records over a period of two years, starting on January 1, 2018, and ending on December 31, 2019, make up the investigated data. As a consequence, there are 720 days since every month has been linearly extrapolated to have 30 days in order to meet the criteria for converting discrete data to functional data. Both cities’ PM2.5, PM10, NO2, NOx, O3, and SO2 levels were examined. μg/m3 units are used to measure each variable. The data was obtained from India’s Central Pollution Control Board (CPCB).
2.2. Methods
2.2.1. Classical Statistics
The objective of the conventional quantitative analysis is to examine the tendency, decide if any of them exceeds the limit, and track changes in air quality using descriptive statistics, such as mean, quartiles (Q1, Q2, Q3), time series, box plots, etc. A boxplot, a standardized technique that utilizes a five-number summary (“minimum, “Q1”, median, “Q3”, and “maximum”), is used to display the distribution of data. This can draw conclusions regarding the values of the outliers. We may discover that certain distributions or data sets require more details than the measures of central tendency. We need to understand the fluctuation and dispersion of the data. A boxplot is an illustration that successfully depicts the distribution of the values contained in the data.
2.2.2. Functional Data Analysis
FDA is a collection of methods for studying curves and functions to analyze data across time [21] . Begin by converting vector samples into functional samples. The beginning points, which come from the study’s generated discrete values, are used to create the curves. Smoothing is the process of transforming vector points into a continuous function over time. This data composition is valuable in the research of air pollution since it takes all of the values from the day as a single unit. As a result, a day with NO2, NOx, SO2, O3, PM2.5, and PM10 values of varying variability may have an average identical to the other days, and the vectorial approach detect the outliers. These days would be identified as possible outliers by the functional analysis. For outlier detection in these types of investigations, functional techniques have always been shown to be superior.
Let
represent the initial observations,
signifies the time steps, and p represents the number of observations (
). The individual value of the function
, where F is a functional space, can be observed. The functional space
is used to estimate x(t), where
is the set of basis functions (
) and p is the number of basis functions necessary to generate a functional sample. In statistics, there are various types of bases, but the Fourier basis is the most commonly employed. Furthermore, for periodic data like the ones we have in our study, the Fourier basis is the best option [22] .
(1)
Where x is the observing point at
,
is the random noise with zero mean,
is the level of regulization and
is penalized, operator.
(2)
where
is the coefficient that multiplied the basis function. We can write the problem of smoothing as
(3)
, the expansion of vector coefficient
, a
-matrix
whose elements are
; and a
-matrix R whose elements are:
(4)
The problem can be solved with
.
The functional data allows us to determine whether or not different time intervals, such as days, weeks, or months, are higher than the mean feature and how far they differ. It also enables the removal of outliers that aren’t real but are caused by system failure. The notion of depth allows you to sort a collection of data in Euclidian space by how close it is to the sample core. In multivariate analysis, the concept of depth emerged and was generated to calculate a point centrality among a cloud of them. This idea started to be incorporated into practical data analysis over the course of the year. In this region, the centrality of a certain curve xi is defined by depth, and the center of the sample is the mean curve. The two-depth measurement Fraiman-Muniz depth (FMD) and H-model depth (HMD) [22] are most usual in the sense of functional data.
Through the estimation of depths, it is also possible to classify outliers with a practical approach. In this case, it will take into account elements which have various behavioral designs than the rest. Instead of summarizing the curve observations into a single point, such as the average, the definition of depth makes it possible to deal with observations identified at a given interval in curve types. The depth technique uses for the identification of outlier and significance: there will be a low depth of an element that is distant from the sample. Thus, practical outliers are the curves with the least depth.
Firstly, the
is the cumulative empirical distribution function of the values of the curves
in a certain time
it is contemplated. It can be defined as:
(5)
where I(.) is an indicator function, next, the FMD for curve xi is calculated as:
(6)
where
. The functional mode in HMD, on the other hand, is the element or curve that is most densely surrounding by the other curves in the dataset. HMD is written as:
(7)
In a functional space, with a kernel function
, a bandwidth parameter h and
as the norm. In a vast majority of cases, it is norm L2, expressed as:
(8)
There are also a number of parameters for the kernel functions K(∙). The truncated Gaussian kernel is a popular one, can be expressed as:
(9)
A functional sample set may have elements that, although not containing error, exhibit characteristics that are distinct from the rest of the set. Instead of only comparing the mean values over the measurement time interval, the depth measurements mentioned above allow sets of observations over time fitted to curves to be compared in order to find outliers in functional samples. An outlier in a functional sample will therefore have significantly less depth because depth and outlier are opposite terms. In order to find functional outliers, the deepest curves are sought after. The value of bandwidth h was chosen as the 15th percentile of the empirical distribution using the HMD to create the outlier selection criterion
[23] . The cut-off C was chosen so that around 1% of accurate observations were incorrectly classified as outliers (type I error) [24] :
(10)
Unfortunately, the distribution of the selected functional depth is unknown, necessitating an estimate of C. For the purposes of this study, we selected a method based on bootstrapping [24] [25] [26] the curves of the original set with a probability proportional to depth out of the several approaches to estimate this value. As an overview, the bootstrapping strategy is as stated:
1) By using sampling with replacement, a new sample is taken from the previous sample (each element is replaced after extraction so it can be chosen again). In addition, order 10 has been chosen for resampling.
2) The populational parameter of interest is estimated using this new sample as a basis to generate a statistic.
3) Repeat the steps overhead a significant number of times.
4) Finally, determine the empirical statistical distribution.
3. Results
The process of obtaining and assessing the findings took place across two stages. In the first stage, box plots were used to visually examine each database variable for air quality. The construction of the functional chart with daily measurement preferred over monthly continued in the second phase. The daily information gathered from the CPCB for a 24-month period (from January 1 2018 to December 31, 2019). Hence the result, our research focuses on the study of daily NO2, NOx, SO2, O3, PM2.5, and PM10 data in Delhi and Kolkata. The boxplot in Figure 1(a), Figure 1(b) presents the data group of NO2, NOx, SO2, O3, PM2.5, and PM10 concentrations by quartiles. The upper line of the box represents the third quartile (Q3), the second line is the median, the bottom line represents the first quartile (Q1), and the red dots represent the outlier for both Delhi and Kolkata.
The variable that we were selected in this study, our sample
corresponded to the 24 months (January 2018 to December 2019) for all variable, where
is the emission measurement for day i of month j,
. Following the smoothing mentioned above, a sample
is created, where each
is now a function, taking into account a set of 1000 basis elements. This procedure yields a correlation score of 99% for the discrete values and the measured values of each function at each location. In this regard, the produced functional sample has a 99% correlation with the discrete sample. Figure 2 (left side) shows that the functional representation of the NO2, NOx, O3, SO2, PM2.5, and PM10 and functional outlier detection Figure 2 (right side). There are numerous outliers obtained using functional approach and is shown in Table 1, this table shows that Delhi has more outlier than Kolkata. These curves were identified as outliers due to weather circumstances. Outliers are detected as a result of several causes such as traffic pollution, gasoline consumption, firework exploding during Diwali, stubble burning, and road dust. In certain conditions, the conventional technique discovers a huge number of outliers, even finding outliers in more than 50% of the data. This issue occurs regardless of whether the station or compound is being reviewed. The number of outliers detected by functional analysis is substantially fewer in both Delhi and Kolkata (NO2, NOx, O3, SO2, PM2.5, and PM10), as shown in the figure that precedes functional outlier identification.
Figure 1. Box plot representation of Delhi (a) and Kolkata (b) NO2, NOx, SO2, O3, PM2.5, and PM10.
Table 1. Functional Data Analysis outlier results from the collected dataset (Delhi and Kolkata).
Figure 2. Functional representation of data (left) and functional outlier (right) [a-x] for Delhi and Kolkata NO2, NOx, SO2, O3, PM2.5, and PM10. The black dot curve represents the outlier.
4. Conclusion
Due to the environment’s ongoing changes and unpredictable nature, and because contaminants vary in location and time, forecasting air quality is a tough task. Continuous air quality monitoring and research are required, particularly in developing countries, due to the harmful effects of air pollution on people, animals, plants, historical sites, the climate, and the environment. Nevertheless, researchers have shown little interest in AQI prediction for India. A traditional approach and a functional technique that consider data as functions are employed in this study endeavour. Compared to the vector technique that examines means and cannot take into account temporal fluctuations, the functional strategy has the benefit of allowing more details to be obtained from the data. Functional Data Analysis, which offers significant benefits over conventional methods in the detection of particular variability if data differ from a normal distribution, may be used to discover outliers in non-normal data sets. To assess the quality of the air in metropolitan areas, outlier identification using functional data analysis might be utilized. Pollutant emissions often rise along with economic and population expansion and fall during economic contractions. In this study, a novel approach for detecting outliers in the collection of gas emissions was devised. While other cities with comparable or dissimilar sources of pollution can use these strategies, it is always important to consider the unique characteristics of each place.
Funding
This work was supported by the National Natural Science Foundation of China (Grant Number 11801019) and Beijing Natural Science Foundation (Grant Number Z190021).
Data Availability Statement
The data has been made publicly available by the Central Pollution Control Board: https://cpcb.nic.in/ which is the official portal of Government of India. They also have a real-time monitoring app: https://app.cpcbccr.com/AQI_India/.
Acknowledgements
The authors would like to thank the editor as well as the reviewers for their many insightful and helpful comments and suggestions that greatly helped the paper. I would like to convey my heartfelt gratitude to Weihu Cheng for his insightful and helpful advice during the inception and implementation of this research project. The excellent efforts and support of Mohammad Nuruzzama and Zhao Xu were likewise greatly appreciated.