Materialized Views Selection Problem in Decision Supporting Systems: Issues and Challenges ()
1. Introduction
Presenting data in coherent and integral situation reflecting its current state remains a challenge of traditional information systems called transactional systems using OLTP process (On-Line Transactional Processing). However, these systems are still insufficient for the Decision Support Systems (DSS). As long as they do not keep historical information on the data and the evolution of data is not available and they need to build specific systems to simplify the process of online analysis of OLAP (On-Line Analytical Processing) data. These systems, called decision support systems, are information systems dedicated to decision-supporting applications [1] [2] [3]. Figure 1 presents an example of Decision Support Systems. Such systems are centralized around what is called a “Datawarehouse” data warehouse defined according to two basic approaches: the first designates the data warehouse as a collection of integrated data, subject-oriented, non-volatile, historical, summarized and available for interrogation [3]; the second defines the warehouse as a set of Datamart data stores, each store of which represents an extract from the warehouse dealing with a given theme [1] [4].
There are several research works dealing with the design and modeling of data warehouses classified according to the type of database used for storage of data warehouse: works dealing with relational modeling [5] [6] [7] and others based on multidimensional modeling [8] [9]. Indeed, the nature, variety and volume of data in data warehouses are constantly increasing on the one hand and the need for analysts and queriers to optimize response time, storage cost and update time is becoming more and more of a challenge for developers of business intelligence systems. Thus the appearance of spatial warehouses [10] [11] [12] [13], Bigdata [14] - [19] and Datawarehouse in the cloud environment [20] [21] on the other hand.
The purpose of this paper is to present the state of the art of works based on multidimensional modeling using the hyper-cube as a unit of presentation of data stores. The hyper-cube is defined as a presentation of the data to be analyzed in a multidimensional form whose values represent the measures to be analyzed and the dimensions represent the axes of analysis [22]. Indeed, the volume of data stored in warehouses is becoming more and more important, generating hypercubes that are complex to analyze and whose updates are also becoming very difficult. There are several approaches to updating the data cube in a data warehouse. Among these approaches are those that allow the total construction of data cubes [23] and those that deal with incremental updating, which
Figure 1. Architecture of decision system support (DSS).
consists of updating only the data that undergoes changes in the data sources [5] [6] [22]. Before starting the studies on the update of hypercubes, we will first deal with different approaches to the optimization algorithms for the selection of materialized views and their construction.
The rest of this paper is organized as follows: in the first section we define the global context of materialized views selection problem and give a brief overview of some approaches dealing with materialized views, the second section describes the different operators of hypercube manipulation, and the third section presents a retrospective of existing update methods of hypercube. Finally, in the last section, we give some perspectives and a conclusion.
2. A Global Context of the Materialized View Selection
2.1. Materialized View Selection
In the decision-making systems, the size of data become more and more constantly increasing therefore the response time of analytical queries become also very high as illustrated in Figure 3. To solve this situation, the obvious idea is to analyze all analytical queries used and stock its result expressly in data warehouse but this solution is very expansive in term of storage and the time of their computation and update. To deal with the situation the reach area has been proposed to select a set of queries that can response to others, hence the birth of materialized views selection. In fact, for the rest of the document we consider the star schema of Figure 2 like base structure of our Datawarehouse and taking for example the following query:
The hardware specifications of the machine and software description used in the rest of our experimental study are listed below.
Hardware specifications: Processor: Intel(R) Core(TM) i5-2520M CPU @ 2.50 GHz (4 CPUs), ~2.5 GHz, Memory: 6144 MB RAM, Operating System: Windows 10 Professional 64-bit.
Software specifications: Oracle 11G Release 11.2.0.4.0, Oracle Sql Developer Version 18.2.0.183.
Figure 3: Illustrates the performance evaluation measured by the response time increasing the percent size of the dataset in the fact table “Fact_sales” beginning with 200k tuples until 400k tuples with the step of 10% until 100%.
According to the experimental results based on Datawarehouse with the initial size described in Table 1, Figure 3 shows that response time grows when the
Figure 2. Star Schema base structure of our datawarehouse.
Figure 3. Query response time increasing size of data as number of tuples.
Table 1. Size of our datawarehouse.
percent of dataset increase also as described above. Hence it is necessary to develop approaches to cope with this degradation of performance with the evolution of data warehouses in terms of volume and variety of data with the appearance of Spatial and Big Data Warehouse. In fact, Figure 4 show the result of performance experiment among the use of materialized views technics with two view maintenance methods like complete computation and refresh method compared with the response time of the request without materialized views process.
The results in Figure 4 show that the response time for materialized views is negligible compared to the response time for the same query without using materialized views despite the increase in data size. The results are extracted without taking into account the update time of the materialized views. However, the deployment of the materialized view technique requires additional storage space and update time as shown in Figure 5.
Hence the appearance of optimization algorithms based on materialized view techniques with the major challenge of minimizing response time under storage space constraints. The following section gives a brief study of various algorithms dealing with the materialized view selection problem.
2.2. Overview Materialized View Selection Approaches
In fact, a great deal of the research works dealing with materialized view and
Figure 4. Query response time increasing dataset size using materialized view process.
Figure 5. Materialized view with complete computation time increasing dataset size.
classed as NP-complete Problem and several algorithms are been developed in the literature which considers all possible views as an efficient representation structure to select a proper set of views for materialization [19]. The proposed structures to represent the views in the area of data warehouse are: (A)—Data cube lattice, (B)—AND-OR DAG (Directed Acyclic Graph), (C)—The MVPP (Multiple Views Processing Plan). For simplicity raisons to explain this variety of structure we used the request of example 2. Moreover, under this example the corresponding structures should be presented as shown in the following Figure 6.
For simplicity reasons, other information in the figure has been omitted for each node such as: query frequency fv (frequency of the queries on view v), space Sv, update-frequency gv(frequency of updates on v), and reading cost Rv(cost incurred in reading v). The interested researchers could refer to [24] [25] [26] [27] [28] for further details.
Among these algorithms we find:
Harinarayan, V., et al. (1996) in [6] proposed the Greedy algorithm based on space costs defined by the number of rows in the view associated with each view v. The principal of the algorithm is to select some set S of k views that maximizing the benefit B(v, S) defined as follows.
1) For each w ≤ v, define the quantity BW by:
(a) Let u be the view of least cost in S such that
w ≤ u. Note that since the top view is in S,
There must be at least one such view in S.
(A)(B)(C)
Figure 6. (A) Data cube lattice structure; (B) AND-OR DAG (Directed Acyclic Graph), structure; (C) MVPP structure.
(b) If C(v) < C(u), then BW = C(v) − C(u).
Otherwise, BW = 0.
2) Define
.
Table 2 describes the Greedy Algorithm.
Lin, W.-Y. and Kuo I.-C. (2004) in [29] proposed the Greedy Genetic Algorithm by merging the genetic algorithm presented in Table 3 below with the greed algorithm described below. The algorithm is based on two metrics: the Query Cost describing the cost to evaluate query and Maintenance Cost specifying the cost to update cube where the maintenance technique adopted is precomputation from scratch of materialized cubes. Let La lattice of N Cubes
from table Fact R, A Set of Query
, each query has its own execution frequency which is shown as Query Frequency Values
, Update Frequency of cubes
, Constraint Space S and ῼ = (L, C, R, Q, F, G, S). The objective of the algorithm is to find a solution ῼ who is a subset of C, say M, that can minimize the following cost function defined such fitness function μ(·) under the constraint that
:
(1)
First Member is cost to evaluate query qi and second member is the cost to update cube c.
Necir, H. and Drias H. (2011) in [30] proposed A Distributed Clustering with Intelligent Multi Agents System for Materialized Views Selection; the approach is divided into three phases: the pre-processing of queries, generation of views candidates and selection of a final configuration. The goal is to use multi-agents generated with k-means clustering algorithm that use coordinate agent COA to dispatch the information’s between the regional cluster CA. The pre-processing based on the Extractor-Agent that extract all restriction and joint predicates to build the predicates usage matrix of which each row and column represents one query and its predicates where the value in {0, 1} according to query qi if uses the predicate pi or no. Moreover, in the generation of views candidate’s phase the regional clustering has been created using the last predicates matrix formed in the above phase. Finally, performing the configuration having views which executing query with respect to the storage constraint and reducing the total query processing cost.
Vijay Kumar, T. and Kumar, S. (2012) in [31] proposed the Genetic Algorithm Materialized view selection using genetic algorithm, this approach is similar to Lin, W.-Y. and Kuo, I.-C. proposed in [29] but the differences are appeared on the fitness function and the probability added to the crossover and mutation function. The fitness function is defined by a total view evaluation cost (TVEC) that calculate summation of size for materialized view and a size of smallest ancestor for not materialized views described as follow:
(2)
Roy, S., et al. (2015) in [32] proposed Materialized view construction based on clustering technique. The approach based on attribute similarity measured by Jaccard Index presented for sets A and B as bellow and formation of clusters using weighted connected graphs:
(3)
For example
and
, the Jaccard similarity can be calculated as
, because
and
.
The authors calculate the Attribute similarity matrix for each couple of attributes in all queries such as for M queries with n attributes and formed an Attribute Usage Matrix (m × n). In the attribute similarity matrix, all values that are greater than or equal to an average similarity value are taken into account when constructing graphs to form clusters. This clustering technique receive as parameters the Relation R with n attributes
, the set off queries
and Cut-off Average Similarity, generate the Attribute Usage Matrix and Attribute similarity Matrix and return the cluster that present the set of selected of Materialized Views. The entire algorithm is shown in Table 4.
Anjana Gosaina, Heenaa Procedia Computer Science 79 (2016) in [33] proposed a linear cost model for materialized view selection problem based on the Particle Swarm Optimization algorithm described in Table 5, the model is initialized by an arbitrary population of N particles candidate. Each particle has characterized by certain speed and position in a space with D dimensions and continually changes their speed and position according to the best position found by itself -particle best (-pbest) and by the all population -global best (-gbest) so far. Refers to [33] for more details. The authors evaluate query processing cost under the space constraint and compared with the results provided by the genetic algorithm. Cost of answering the query defined by the number of rows to be accessed in the corresponding cube for the query. Furthermore the algorithm starts with similarity parameters that are used by Lin, W.-Y. and Kuo, I.-C. (2004) in [29] using the genetic algorithm. But the difference between these two techniques is fitness function: Anjana Gosaina use the first member of the function used by W.-Y. and Kuo, I.-C.
Table 5. Particle swarm optimization algorithm.
Azgomi, H. and Sohrabi, M. K. (2018) in [25] proposed the A game theory-based framework for materialized view selection in data warehouses considering the two opposite players of game are processing cost and view maintenance cost. The proposed algorithm described as follows:
Game: The game is a step-by-step situation to solve the problem of selecting materialized views.
First player: The first player wants to select the views with the highest maintenance costs.
Second player: The second player is greedy to choose views with higher query processing costs.
Strategy: There are two strategies possible for the game depending to the used method for materialized view selection: Greed approach for pure strategy or random method for mixed strategy. Moreover, both players start the game with all possible views as the input of their strategies.
Payoff function: PofP1 = Cvm − Cqp; and PofP2 = Cqp − Cvm.
Where Cvm = view maintenance cost and Cqp = query processing cost of a view.
Equilibrium: Equilibrium defined as a state of the game in which a player cannot improve by changing the conditions of his game, otherwise, when a payoff does not increase in a single step. Obviously, reducing the payoff value of a player or a part of the players does not mean reaching the equilibrium.
Type of game: Since the decision of the players, doesn’t depend on the previous decisions, the total time of the solution is reduced by players using simultaneous selection of views, in this case the game can be considered as a static game. In addition, the game is assumed as a non-zero-sum game in which the total benefits and losses of the players are more or less than zero.
Azgomi, H. and Sohrabi, M. K. (2019) in [8] proposed “A novel coral reefs optimization algorithm for materialized view selection in data warehouse environments. The purpose method used MVPP structure to present all views, and considered bit-string with length equal to the number of views of the MVPP is the initial population of the algorithm. Each case of the bit-string takes1 or 0 depending on if the view is a part of the solution or no. the process of the algorithm is divided on six steps described as follow in Figure 7.
Prakash, J. and Kumar, T. V. (2021) in [34] proposed the A multi-objective approach for materialized view selection using on a non-Pareto based genetic algorithm MVSVEGA, that selects Top-K views for materialization from a multidimensional lattice that minimizing simultaneously the two costs CMV and CNMV of TVEC given below:
Figure 7. A novel coral reefs optimization algorithm.
(4)
where
(5)
and:
(6)
Azgomi, H. and Sohrabi, M. K. (2021) in [19] proposed a map-reduce-based approach for creating Multiple Views Processing Plan MVPP structure in a reasonable time in data warehouses for big data applications using the Map-Reduce Model and the hashing technique that used to reduce the search space of a problem. The model map-reduce model consists of two main functions, MAPfunction and REDUCEfunction. The MAP function processes a pair of (key, value) inputs to create a set of intermediate (key, value). The REDUCEfunction integrates all intermediate data related to each key, and produces the result. The structure of the MVPP was presented formed n MVPPs from n! possible MVPPs, and selected the MVPP that had the lowest cost as the final solution, these methods was O(n), where n was the number of queries. The total cost of the MVPP in this method is calculated by the summation of the query processing cost and the view maintenance cost of all views.
3. Comparative Study of the Materialized View Approach
As already defined in the previous section, the goal of all approaches developed in the literature to resolve a problem of materialized view selection is to minimize the query processing cost under storage constraint. To achieve this aim, each approach describes its own process using heuristic algorithms, genetic algorithms, game theory or clustering techniques. These processes differ from each other by their starting parameters and the main function that allows reaching the main objectives. Table 6 presented the comparative study and summarizes the differences between these approaches.
Table 6. Comparative table of materialized view selection algorithms.
4. Conclusion
In this paper, an overview of algorithms dealing with the materialized view selection problems has been given. Starting with the determination of the global context of the materialized view defining why the query needed to be materialized, followed by the comparison of the approaches developed in the literature. In conclusion, we have found that most algorithms prefer to use the multidimensional aspect to benefit from their advantages in solving multivariate problems defined in our case by the cost of reading and maintaining the materialized views. In order to minimize this cost in data warehouse environment, each approach defines its strategy to achieve this goal with a minimal cost and a reasonable complexity level. Our research axis focuses on the multidimensional model whose success has already been demonstrated by the algorithms used and we will propose our optimal solution.