Artificial Intelligence Technique in Hydrological Forecasts Supporting for Water Resources Management of a Large River Basin in Vietnam ()
1. Introduction
An accurate hydrological forecast plays an important role in flood and water supply management of a river delta [1] [2] . In the traditional way, physicalbased models have been applied to simulate and forecast the flow of the river delta which is usually a very complicated system with dense rivers and canals [3] [4] . However, this approach has posed two difficulties: the first one is the highly required computation time leads to reduce the forecasted time then affect the end user who based on this information takes action; and the second one is it requires many data and information to develop the model then lead to the problem of financial [5] [6] [7] .
Nowadays, AI techniques can improve the problem that physicalbased model pose via machine learning or deep learning algorithms [8] [9] [10] [11] . This kind of technique can be classified as a datadriven model in which the relationship between inputs and outputs is built through the set of neurons in different layers socalled hidden layers. The parameters of the model have been tuned through the network to get their optimal value during the training process. After that, this algorithm can be used to predict the flow of a control section in the river with a very fast response. Usually, it takes a few minutes to estimate the output. Due to this advantage, many researchers developed machine learning or deep learning algorithms in hydrology.
The Radial basic function (RBF) has been applied to forecast the river flow at High Aswan DAM on the Nile River [12] . They use a Genetic algorithm for input selections that will feed to RBF to predict the flow. The results are good in terms of the coefficient of determination (R^{2}) in all cases around 0.9 and the good performance of the forecasted flows when they were compared with the observed ones. Another application developed a newly based AI Flow Difference Model (FDM) to forecast the flow in a cold–regional river. The results have been compared with the Regression Model (RM) and Base Difference Model (BDM) for better performance. All evaluated criteria such as R^{2}, Nash coefficient, and RMSE give good values. A comparison between physicalbased model and machine learning has been tested by forecasting the flow in the Bogota River basin in Colombia using HECRAS and ANN [13] . The results can help to make a better mark for ANN when comparing the performances of the forecast leveled with the observed one in terms of R^{2}, MAE, and RMSE.
Red River Delta is a dense and complex system combined with rivers and canals feeding the 13 integrated irrigation districts [14] . The dynamic of the system is based on the inflows coming from the release of three big reservoirs and some natural flow from some uncontrolled rivers in the upstream and the tide coming from the East Sea [15] . The water level of Red River in Hanoi usually used to monitor the water supply, navigation and flood control of the delta. Therefore, it needs to be forecast in advantage to inform the current state of the system and help the stakeholder make the right decision in their action of water usable.
Besides, at the basin management level, to find out the best policy to apply in the basin, it needs to run an optimal algorithm which in turn generates a set of suitable policy for the choices of decisionmakers and stakeholders [16] . In this case, using physicalbased model will pose a big problem of computation effort that cannot be helped to simulate all sets of policy for the race of policy selection. Therefore, an emulator as a datadriven model will be very helpful and reasonable to use in this case.
The main objective of this study was to develop an emulator that could be used to make shortterm river level forecasting at Hanoi in Red River Delta. A machine learning algorithm SANN and a deep learning algorithm LSTM have been tested and compared with the observation data and the result of physicalbased model MIKE 11 of the system. In addition, this research uses the input interactive feature ranking (IIFS) algorithm to select the candidates as the inputs for the network.
2. Case Study and Methodology
2.1. Case Study
Red river basin is the second biggest transboundary basin in Vietnam. Its area is about 169,000 km^{2} cover a part of China, Lao and Vietnam before reaching to East Sea (Figure 1).
Figure 1. Red river basin from HydroSHEDS.
Vietnam’s part is located at the downstream of RedThai Binh basin and accounts for about 51% of its total area. In this area, there is a dense river network which feeds by three big tributaries including Da River, Thao River and LoGam River. There are four big strategic reservoirs (Hoa Binh, Son La, Thac Ba, Tuyen Quang) among these others located in this area which has the role of managing the water resources for the entire delta.
In the delta, there are 13 irrigation districts and 11 provinces including Hanoi. Therefore, water resources are one of the biggest issues in this region. The water resources of the delta can be evaluated via the water level in the Red River at Hanoi Hydrological station (see Figure 2). The dynamic flow in this section is influenced mainly by the inflow from the upper part (from left to right in Figure 2 including Hoa Binh Reservoir, Thao River, Thac Ba Reservoir, Lo River, Tuyen Quang Reservoirs ), the flow in Thai Binh river system connected by Duong river, and the tides from 9 estuaries from East sea.
2.2. Methodology
In this study, a Shadow Artificial Newron (SANN) and Long Short Term Memories (LSTM) have been applied to forecast the water levels at Hanoi in the Red River. These models’ performance was also compared with the result from physicalbased model MIKE 11 which has been setup for the whole Red River delta.
Figure 2. The diagram of the Red River system in Vietnam part.
SANN is a simple machinelearning algorithm inspired by the structural and functional human brain. SANN is composed of one input layer, one hidden layer and one output layer. Each layer contains a number of neurons (nodes) which is the core of the system. Each neuron in the hidden layer presents a linear classifier through the input via function.
${n}_{j}={\displaystyle {\sum}_{i=1}^{k}{w}_{ij}{x}_{i}}+{b}_{j}$ (1)
In which:

${x}_{i}$ is the i^{th} input in the input layer;

${n}_{j}$ is the j^{th} neuron of the hidden layer;

${w}_{ij}$ is the weight values of the i^{th} inputs through the j^{th} neuron;

${b}_{j}$ is the bias values of the j^{th} neuron.
Then,
${n}_{j}$ will pass through an activation function to represent the nonlinear relationship between input and output and give the estimated output. In this case, the tansig function has been applied.
$y=\text{tansig}\left(n\right)=\frac{2}{1+\mathrm{exp}\left(2\ast n\right)}1$ (2)
Due to the fact that in hydrology, an input is the sequence values of a variable. Therefore, they usually have an auto correlation with its previous values. In this case, SANN takes the previous target values in the target sequences into the input vector to keep the memory of the sequences. However, this will give the challenge of optimizing the lag time steps of the sequence. That is why in this research, Long ShortTerm Memory (LSTM) has been tested and made a comparison with the SANN to evaluate the importance of the longterm memory in the hydrological sequences. LSTM is a deep learning algorithm as it considers the memory of the previous time steps of input and output sequences. In LSTM architecture, there are three gates namely input gate, forget gate, and output gate is going in one LSTM cell to produce the output of cell, as shown in Figure 3.
In this case, gates activation functions are the logistic sigmoid function as below:
$\sigma \left(x\right)=\frac{2}{1+\mathrm{exp}(\; \; x\; )}$
and the output activation function is the hyperbolic tangent function
$c\left(x\right)=h\left(x\right)=\mathrm{tanh}(\; x\; )$
Figure 3. LSTM architecture modified from Matlab documentation.
The number of inputs nodes was defined by an input selection algorithm namely Interactive Input Selection (IIS) [17] . This algorithm helps to choose the first set of input candidates based on a nonlinear statistical measure of significant and then they will be test the correlation with the target by mean of the Single InputSingle Output model to choose a further set of inputs which then feed into the Multi InputsSingle Output model to select the final set of input for modeling.
To evaluate the performance of the algorithms, statistical indicators including Root Mean Square Error (RMSES), Mean Error (ME), Absoluted Mean Error (AME), Maximum Absoluted Error (MAE), the determination coefficient (R^{2}). If we call the square deviation of the recorded is σ, the assurance level of forecast (P) which estimates the numbers of forecasts matched actual values over the forecasts considering the allowable error Scp equal to 0.2σ, and the forecast bias estimated by the fraction of RMSE over standard deviation of recorded
$\frac{S}{\sigma}$ , as shown in Table 1.
2.3. Data Set
The data was collected from 1994, when the Hoa Binh reservoir started in operation, to 2019. The Thac Ba reservoir was in operation in the 1970s, Tuyen Quang reservoir in late 2008 and Son La reservoir in late 2012. For that reason, we have had very low water level since 2011 (Figure 4). Therefore, the training dataset has been created carefully which should include “almost” the situation that happens in the “current condition”. To generate those datasets, the last three years (20172019) have been exchanging with “three more natural years” (19971999). Then, we choose the first 17 years for training, the next 5 years for validating and the last three year for testing the designed network.
Figure 4. The Hanoi water level (a) and discharge (b) from 19942019.
Table 1. The evaluated indicators for forecast assessment.
3. Result and Discussion
3.1. Input Variable Selection
There are fiftyfour input candidates for the system. They are three releases from three strategic reservoirs in the upstream part of the system including Hoa Binh, Thac Ba and Tuyen Quang; the inflows from other “natural tributeries” such as Thao river, Lo river, Pho day river, Cau river, Thuong river, Luc Nam river; the discharge in the delta at Son Tay and Ha Noi at current states and previous states; the tides from East sea via 9 estuarine of Red—Thai Binh system; and 40 lateral flows going into the system. Applying an input variable selection (IIS) [17] show the effect of 40 lateral flows on the water level in Hanoi is minor. Therefore, the first step we choose first 20 remaining candidates fed to LSTM to train the system. We got further filter to 7 candidates with a very high correlation with the output. These candidates will be the input to the SANN algorithm. The evaluation of IIS algorithm will be postponed after we see the result of all AI algorithms tested on the system.
3.2. SANN
Based on the number of input candidates, the form of SANN applied in this study is the following
${h}_{t+1}^{HN}=f\left({h}_{t}^{HN},{R}_{t}^{HB},{R}_{t}^{TQ},{R}_{t}^{TB},{Q}_{t}^{YB},{Q}_{t}^{HY},{\tau}_{t}\right)$
where
${h}_{t}^{HN}$ : is the water level at Hanoi section on Red River in day t;
${R}_{t}^{HB}$ : is the release from Hoa Binh reservoir on Da river in day t;
${R}_{t}^{TQ}$ : is the release from Tuyen Quang reservoir on Gam river in day t;
${R}_{t}^{TB}$ : is the release from Thac Ba reservoir on Lo river in day t;
${Q}_{t}^{YB}$ : is the discharge at Yen Bai on Thao river in day t;
${Q}_{t}^{HY}$ : is the discharge at Ham Yen on Gam river in day t;
${\tau}_{t}$ : is the maximum tide level at Balat in day t.
The total number of input and output is 8, therefore 10 neurons have been used in the hidden layer. The performance reported in Table 2 and Figure 5 clearly shows a model for predicting the water level at Hanoi section with the leading time of 1 day ahead.
Figure 5. The Hanoi water level estimated by SANN comparing with observed one in the test dataset: (a) for hydrograph and (b) for scatter plot.
Table 2. The evaluated indicators of SANN model.
3.3. LSTM
Taking advantage of LSTM, the 19 inputs have been used. They are the release from three reservoirs, two inflows from Thao and Lo river, the water level at Hanoi at the previous time step, 9 tides, three lateral flows to Thai Binh river system including inflows to Cau river (Gia Bay gauged station), Thuong river (Cau Son gauged station) and Luc Nam river (Chu gauged station). LSTM can keep both longterm and shortterm memory of previous states through the hidden units (neurons) to the current state. Therefore, they can take advantage of all available data and decide which information will go further and which will be left at the forget gates. We used 30 neurons for 19 inputs and one output network. The selection of the final neuron network using the training iteration with the lowest validation loss. This option gave a “perfect” validation result as shown in Figure 6.
The training and testing results are better when compared with SANN (Table 3 and Figure 7). Note that LSTM seems to be more sensitive to the dataset. Changing the normalization method will give different results from “bad” to “good”. In this study, we recommend zscore’s normalization which gives us the best network of LSTM for predicting the water level at Hanoi.
Figure 6. The Hanoi water level estimated by LSTM compared with observed one in the validated dataset: (a) for hydrograph and (b) for scatter plot.
Figure 7. The Hanoi water level estimated by LSTM comparing with observed one in the test dataset: (a) for hydrograph and (b) for scatter plot.
Table 3. The evaluated indicators of LSTM model.
3.4. Comparison and Discussion
The main purpose of this study is to compare three types of models: physicalbased model (MIKE 11), machine learning model (SANN) and deep learning model (LSTM). The results show that the best one of three is LSTM. Since LSTM can take the states from previous steps, it shows the advantage of simulating the current situation while looking back the history. That is why the estimated water level is perfectly matched the observed one in whole simulation period. The second one is SANN, the estimated profiles matched the recorded ones. However, due to the form of the network, it cannot give the memory from a long previous time to the current situation. This led to the delay onetime step at almost high peaks as shown in Figure 8.
Finally, the physical based model MIKE 11 seems simulate the system very well in recently time but not in the past (see the purple line in Figure 9). In the first years from 1994 to 2008, MIKE 11 giving the underestimated result. This is due to the fact that the data set of cross sections collected in 2010 cannot mimic the system in the past which actually should have different set of cross sections. In this case study, after the big reservoir went to operation, the situation of river bed intrusion occurs with very high speed. This make the water level reduce sharply. Therefore, the result of MIKE 11 in the year near 2010 is better and comparable with other models.
Therefore, in the similar situation, data driven model, especially machine learning and deep learning, have their advantage in simulating the changingsystem and give a better forecast while the physical based model need more data and time to correct the mimic system for the prediction problem. After this test, we recommend to use LSTM in the hydrological forecast since it can keep the advance of nonlinear correlated depending variables of a changing system.
Figure 8. Comparison between SANN and LSTM.
Figure 9. Comparison among LSTM, SANN and MIKE 11 versus observation.
4. Conclusions
The process of flow routing in the dense river delta always needs to be monitored and forecasted for the better management of water resources in such areas. However, this process is a complex problem for hydraulic modeling. River bed and morphology change lead to the performances of physically based models and conceptual models were badly used as the case analysed in this study with R^{2} got only 0.85 and the forecasted accuracy level P got only 23.6%. Recently, machine learning and deep learning have been applied for various simulation and prediction problems in hydrology. In this study, we have tested the simple SANN and a popular deep learning algorithm LSTM to forecast the water level at the control station of the Red River system, the second biggest basin in Vietnam; and compared it against the result of a physicalbased model MIKE 11. In general, LSTM model is the best among SANN and MIKE 11 due to its characteristic of long memory applied in the high correlation system as in the usual hydrology case. However, SANN model is still a good performance and much better than the physicalbased model in this case study where the upstream reservoirs system made change of riverbeds downstream and let the reduction of water level over time. This should happen in almost all intervention river basins in the world and pose the challenge to hydrologists to simulate or forecast the hydrological variables there. It is the advancement of AIbased techniques that has created revolution in hydrology [18] .
The result shows that, LSTM can predict correctly over 99% of test cases, SANN got 98% for the training and 97% for testing while MIKE 11 got only 23.5%. Other evaluation indicators such as RMSE, ME, AME of both SANN and LSTM models are smaller than 10% of the prediction and mean; and R^{2} is nearly 1 in both models compared to 0.85 of MIKE 11.
In conclusion, we recommend using LSTM and other deeplearning techniques in hydrology. LSTM can use all type of data and have the advantage of exploiting all relevant information for them to train the network for a certain problem. In the next study, we would like to try it with more complex situations such as a longer lead time in the forecasting and other studies in different fields of hydrological and hydraulic systems.
Acknowledgements
This research received funding from the Vietnam Ministry of Natural Resources and Environment entitled “Applying Artificial Intelligence (AI) for flow routing to support water resources allocation in the river basin, testing on the Red River basin”, Code number: TNMT.2021.04.05. The author would like to express their sincere thanks to anonymous reviewers for their helpful comments and review of the manuscript.