Describing Fuzzy Membership Function and Detecting the Outlier by Using Five Number Summary of Data ()
1. Introduction
Fuzzy logic is used to describe fuzziness which is characterized by its membership function. Simply membership function represents the degree of truth in fuzzy logic. Membership function was first introduced by Lotfi A. Zadeh in his paper “Fuzzy sets” [1]. The most common fuzzy membership functions are impulsive fuzzy membership function, triangular fuzzy membership function, trapezoidal fuzzy membership function and Gaussian fuzzy membership function [2]. Membership functions can be defined as a technique to solve practical problems by experience rather than knowledge. Fuzzy logic is used to reduce uncertainty in many sectors such as agriculture [3], medicine [4], power systems [5], production [6], transportation [7] etc. where membership function plays an important role in those implementations. Also it has a great impact on solving various problems in fuzzy mathematics [8] [9].
There are several ways to define membership functions [10] [11]. Some of them are horizontal method of membership estimation, vertical method of membership estimation, pairwise-comparison method of membership function estimation, problem specification based membership determination, membership estimation via fuzzy clustering, artificial neural network, and genetic algorithm [10].
The number of articles is increasing in the field of cognitive systems and artificial intelligence where neural network and reasoning systems like fuzzy system were used. In these cases, the membership functions are generally tuned in a cyclic fashion and are closely tied to their associated rule structure [11]. In the case of Genetic Algorithms, the number of research works is very few [12]. There is huge growth made in the strength of genetic algorithms to find optimum solutions. Both neural network and genetic algorithm approaches to determining membership functions generally make use of associated rules in the knowledge base.
Every method has its own merits and demerits depending on the different scheme. Our main aim is to describe the linguistic variables in a proper way. Membership functions of the fuzzy sets attempt to capture the concepts of linguistic variables. Several membership functions representing linguistic concepts such as low, medium, high, and so on are often employed to define states of a variable. Such a variable is usually called a fuzzy variable [13]. Methods applied to build a membership function are erratic to many reasons because of the terminologies used in defining the membership function varies from man to man. Membership functions can either be chosen by the user arbitrarily, based on the user’s experience, or can be designed using machine learning methods (e.g., artificial neural networks, genetic algorithms, etc.). In this paper, a new approach is introduced to formulate the five states fuzzy membership functions by using five number summaries of data. Linear representation of these membership functions is in semi trapezoidal, trapezoidal and triangular shape that will also help to identify the outliers in a data set. All the graphs of membership function are plotted by using MATLAB. The method is very easy to understand and apply in any outlier detecting problem.
Outlier is an observation whose value exceeds the values of other observations in the sample by a large amount, perhaps three or four standard deviations away from the mean value of all the observations [14]. Many inferential procedures are based on the assumption that the population distribution is normal (a certain type of bell curve) [15]. Even a single extreme outlier in the sample warns the investigator that such procedures may be unreliable, and sometimes the presence of several mild outliers conveys the same message [15]. Hence, researcher should be detected before analyzing data whether their existence of any outlier. The most common way is box plot also called box and whisker plot method [16] [17]. Outliers can also be detected by fuzzy clustering [18], fuzzy discriminant analysis [19]. But these methods are very complex and time consuming.
Therefore, in this paper, a different approach is created to detect outliers by using membership functions of data. Firstly, fuzzy membership functions are constructed by using six points: the lower outer fence, lower inner fence, first quartile, third quartile, upper inner fence and the upper outer fence. In this process five states fuzzy membership functions are developed. Outlier can be detected by the degree of membership of first state and the last state. The results obtained by this process match the results obtained by box plot. The novelty of this method is to simultaneously create membership functions of data and identify if there is any outliers. The procedure is very simple and calculates less than neural networks, and genetic algorithm.
The organization of the remaining parts of the paper is as follows. Section 2 contains the preliminary definition of fuzzy set and membership function. In Section 3, algorithm of five number summary and box plot is discussed. Proposed method will be introduced in Section 4. Implementation of new approach with several real life examples are presented in the Section 5. Conclusion of the paper will be pointed out in the last section.
2. Fuzzy Set and Membership Function
The word “fuzzy” means “vagueness (ambiguity)”. Fuzziness occurs when the boundary of a piece of information is not well-defined [13]. For example, words like young, tall, good or high are fuzzy. There is no single quantitative value which defines the term young. For some people, age 25 is young and for others, age 35 is young. The concept young has no clean boundary. Fuzzy set theory is an extension of classical set theory where elements have degree of membership between 0 and 1. In traditional set theory, an element is either in or not in a set A, that is
or
; this kind of set is called a crisp set [2]. A fuzzy set is a set that is characterized by a fuzzy membership function
. If
, it implies that
. On the other hand, if
then
[2].
2.1. Definition
Two distinct notations are most commonly employed in the literature to denote membership functions. In one of them, the membership function of a fuzzy set “A” is denoted by
; that is
. In the other one, the membership function is denoted by
and has, of course, the same form
. A membership function for a fuzzy set “A” on the universe of discourse X is defined as
, where each element of X is mapped to a value between 0 and 1 [13].
2.2. Definition
The membership function fully defines the fuzzy set. Suppose X is a universal set. Then a fuzzy set “A” can be defined as the set of ordered pairs such that
[20]. The nonempty set of objects X is called referential set and [0, 1] (the unit interval) is called valuation set and
;
represents the grade of membership of x [13].
2.3. Example
We consider three fuzzy sets
and
that represent the concepts of a young, middle-aged, and old person respectively. A reasonable expression of these concepts by trapezoidal membership functions
and
is shown in Figure 1. These functions are defined on the interval
, where X is the set of ages of human beings such that
Young-aged,
Middle-aged,
Old-aged,
Linear representation of these membership functions is given in Figure 1.
Figure 1. Membership functions representing the concepts of a young, middle-aged, and old person.
3. Five Number Summary and Box Plot
The five number summary gives a rough idea about what data set looks like. The five number summary consists of 5 items of a data set [21]. They are:
The minimum,
(The first quartile, or 25th percentile),
The median,
(The third quartile, or 75th percentile),
The maximum.
The box plot also called box and whisker plot method [16] [17] is a visual representation of a data set based on five number summary [16] [21] or six summaries [22]. It allows for the identification of outliers [15] [23] [24] [25] and [26]. A box plot consists of the following steps [16] [23] [26]:
Step 1: Sort the data on a primary attribute.
Step 2: Calculate the Median, Quartiles
(25th percentile),
(75th percentile), and Inter-quartile range:
.
Step 3: Calculate the points that are 1.5 × IQR below
and 1.5 × IQR above
. These two points are called the lower and the upper inner fences, respectively.
Step 4: A central box extends from the 25th to the 75th percentiles. This box is divided into two compartments at the median value of the data set.
Step 5: The line segments projecting out from the box extend in both directions to the adjacent value. The adjacent values are the points that are 1.5 times the length of the box beyond either quartile. All other data points outside this range are represented individually by little circles; these are considered to be outliers or extreme observations that are not typical of the rest of the data.
The observations that fall outside the two inner fences are called outliers. These outliers can be classified into two kinds of outliers: mild and extreme outliers. To do so, we define two outer fences: a lower outer fence at 3 × IQR below the first quartile and an upper outerfence at 3 × IQR above the third quartile. If an observation is outside either of the two inner fences but within either of the two outer fences, it is called a mild outlier. An observation that is outside either of the two outer fences is called an extreme outlier [26].
4. The New Approach
In a fuzzy membership function, there can be multiple states depending on the domain of the data set. States of the fuzzy variable are fuzzy sets representing linguistic concepts: very low, low, medium, high, very high etc. Now we are going to define five states fuzzy membership functions by using five number summary of a data set discussed in the above section. At first we execute the algorithm of five number summary to produce the preliminary investigation of data set. From summary we will select six points: lower outer fence, lower inner fence, first quartile, third quartile, upper inner fence and upper outer fence. Using these points we construct five states of membership function as follows. We also draw the graph of membership functions by using MATLAB programming and identify the outliers from this process.
If
is a set of n observations, and
,
,
,
and
are five fuzzy sets defined on X representing the concept of the smallest, small, medium, large and the largest value respectively then the proposed five states fuzzy membership functions are given as follows:
Smallest,
Small,
Medium,
Large,
Largest,
Where Q1 = First quartile, Q3 = Third quartile, and Iqr = Inter-quartile range.
The defined fuzzy membership function follows two properties [13] [20]:
• Cross over point: for all states cross over point is 0.5 which indicates these sets are symmetric.
• Height of the fuzzy set is
.
Membership functions allow us to graphically represent a fuzzy set. The x axis represents the universe of discourse, whereas the y axis represents the degrees of membership in
interval. Graphs of these functions have trapezoidal, semi trapezoidal and triangular shapes which are most common in current applications. For the outliers we consider the membership functions
and
only. As outliers are the extremely smallest or the extremely largest value of a data set so we define following two conditions for outliers:
•
; If
or
then x is an outlier.
•
; If
or
then x is a mild outlier.
5. Implementation
There are three real life problems that have been discussed in this section to show the capability of the proposed method. To evaluate the proposed method, real world data sets are used. The list of variables used in this section is given in Table 1. Both even and odd numbers of data sets are taken from Bangladesh University of Business and Technology, and Bangladesh Meteorological Department [27]. At first the five number summary of the data set will be extracted then fuzzy membership functions of the data will be defined using these. Outliers will be identified by plotting membership function on the graph. Box plots are also drawn for each problem to show that the results of the proposed method are effective for outliers.
5.1. Height of People
Below is a list of the height (in centimeter) of 42 male students of the first year 2020 of the Bangladesh University of Business and technology:
170.18, 179.83, 161.38, 172.72, 175.26, 175.26, 170.18, 170.18, 182.88, 173.00, 152.40, 170.69, 179.83, 167.64, 175.26, 172.72, 163.83, 172.72, 170.18, 201.17, 177.80, 160.02, 152.40, 172.72, 177.80, 167.64, 167.64, 160.02, 172.72, 155.45, 170.18, 181.61, 167.64, 142.54, 165.10, 175.26, 167.64, 172.72, 175.26, 177.80, 167.64, 172.72.
Our aim is to investigate the shortest, medium and the tallest students from this data set by using the new approach. Here the number of observations,
is even. The five number summary of these given data is estimated by usual statistical procedure [16] [23] [26]. We get,
The minimum value,
First quartile,
Median,
Table 1. List of variables used for the evaluation of proposed method.
Third quartile,
The maximum value,
Inter Quartile Range,
Inner fence:
and
.
Outer fence:
and
.
The visual representation of this summary pointing with outliers is given in Figure 2.
Now
; where X is the data set of heights of the given students, if
,
,
,
and
represent the fuzzy sets of the shortest, short, medium, tall and the tallest students respectively; then by using our proposed method the five states fuzzy membership functions are given as follows:
Shortest,
Short,
Medium,
Tall,
Tallest,
Figure 2. Box plot for height of the students.
Figure 3. Membership functions for height of the students and the outliers.
By using MATLAB the graphical representation of these membership functions indicating with the outliers is shown in Figure 3.
The box plot shows that there are two outliers in the data set X and they are 142.54 and 201.17. The membership grades of these two data are:
and
. So both of these are outliers in terms of our condition. On the other hand, box plot shows that there is a suspected outlier at
and membership grade of it is
, which meets the conditions for being a mild outlier. Thus the new method of detecting the outlier shows the same result as the box plot.
5.2. Weight of People
Below is a list of the weight (in lbs.) of 15 female students of fifth semester in the year 2020 of the Bangladesh University of Business and technology:
105.8, 187.4, 132.3, 132.3, 105.8, 127.4, 114.6, 99.2, 88.2, 119.0, 110.2, 125.7, 108.0, 143.3, 108.0.
We need to select the thinnest students for a 100 meter running competition. For this, we apply our new approach to construct a five states membership functions of the thinnest, thin, average, heavy and the heaviest students. At first we arrange the data in ascending order of magnitude then the set of observations are:
The five number summary of these given data is estimated by usual statistical procedure [16] [23] [26]. We get,
The minimum value,
First quartile,
Median,
Third quartile,
The maximum value,
Inter Quartile Range,
Inner fence:
and
.
Outer fence:
and
.
The visual representation of this summary pointing with outliers is given in Figure 4.
Now
; where X represents the weights of the given female students, if
,
,
,
and
are the fuzzy sets of the thinnest, thin, average, heavy and the heaviest students respectively; then by using our proposed method the five states fuzzy membership functions are given as follows:
Thinnest,
Thin,
Average,
Heavy,
Heaviest,
By using MATLAB the graphical representation of these membership functions indicating with the outliers is shown in Figure 5.
Figure 4. Box plot for weight of the students.
Figure 5. Membership functions for weight of the students.
The box plot shows that there is a suspected outlier at
and the membership grade of it is
, which meets the conditions for being a mild outlier. So here too the new method and the box plot are showing the same result.
5.3. Intensity of Daily Temperature
The intensity of heat is not same for all objects. Water begins to boil at a temperature of 100˚C (Celsius) where the melting point of uranium is 1132˚C. So sometimes it is difficult to determine the warmest day or the coldest day of a month according to the temperature. But membership function can reduce the complexity of this problem.
The daily maximum temperature (in ˚C) for the month of May 2020 of Dhaka city is given below [27]:
28, 30, 31, 34, 35, 30, 33, 33, 35, 36, 35, 36, 33, 34, 34, 35, 35, 37, 33, 29, 29, 32, 34, 33, 33, 32, 29, 31, 33, 32, 29.
Using the new approach, we want to check whether the days of May are warm or temperate according to the temperature. At first we arrange the data in ascending order of magnitude then the set of observations are:
Here the number of observations,
is odd. The five number summary of these data is estimated by usual statistical procedure [16] [23] [26]. We get,
The minimum value,
First quartile,
Median,
Third quartile,
The maximum value,
Inter Quartile Range,
Inner fence:
and
.
Outer fence:
and
.
The visual representation of this summary is given in Figure 6.
Figure 6. Box plot of daily temperature.
Now if we define fuzzy sets
,
,
,
and
representing the temperature of very low, low, medium, high and very high respectively, then
; the five states membership functions are given as follows:
Very low,
Low,
Medium,
High,
Very high,
By using MATLAB the graphical representation of these membership functions is shown in Figure 7.
Graph of membership function does not display any outliers that box plot also didn’t. This means that in May 2020, there was not a day in Dhaka city where the temperature was excessive high or too low.
Figure 7. Membership function for daily temperature.
6. Conclusion
An efficient method for generating membership functions and identifying the outliers is discussed in this paper. By applying it, the solution of some real life related problems has also been highlighted. Each problem is presented graphically and their results are also described. The results obtained are excellent and provide adequate information of the data set. The set of height and weight of human has been successfully defined by using it. Linguistic variables can be easily captured by defining their membership function in this process. Five states membership functions of linguistic variables such as tiny, short, medium, tall and giant can be identified by this approach. In this procedure triangular, semi-trapezoidal and trapezoidal shaped membership functions have been constructed; those play an important role in the identification of outliers. There is no difference between the outliers obtained by this process and the outliers obtained in the box plot method. So this new method is effective for simultaneously creating membership functions of data and detecting outliers. This proposed method is also less computationally, less expensive and time saving than the most common methods such as neural networks, fuzzy clustering, and genetic algorithms. Though triangular and trapezoidal membership functions have been described in this paper but there are other membership functions such as sigmoidal, Gaussian, z-shape, s-shape functions. Our future aim is to define these types of functions using the proposed method.