Asymptotic Evaluations of the Stability Index for a Markov Control Process with the Expected Total Discounted Reward Criterion

Abstract

In this work, for a control consumption-investment process with the discounted reward optimization criteria, a numerical estimate of the stability index is made. Using explicit formulas for the optimal stationary policies and for the value functions, the stability index is explicitly calculated and through statistical techniques its asymptotic behavior is investigated (using numerical experiments) when the discount coefficient approaches 1. The results obtained define the conditions under which an approximate optimal stationary policy can be used to control the original process.

Share and Cite:

Martínez-Sánchez, J. (2021) Asymptotic Evaluations of the Stability Index for a Markov Control Process with the Expected Total Discounted Reward Criterion. American Journal of Operations Research, 11, 62-85. doi: 10.4236/ajor.2021.111004.

1. Introduction

In a standard way (see [1] [2] for definitions), let M be a Markov control process at discrete time with infinite horizon (also called Markov decision processes) and let M ˜ be its approximation. We will use the performance criterion (objective function) called expected total discounted reward. Suppose that the optimal control problem for M ˜ has a solution, that is, we can find an optimal solution ( f ˜ ) for the approximate process M ˜ . Now, if for some reasons (some of these causes are discussed later), it is not possible to find an optimal policy for the original process M, we could use the policy ( f ˜ ) to control the original process M. The use of such approximation will cause a reduction in the total discounted reward, this reduction is measured by the stability index (Δ), see [3] [4] [5] for definition. The importance of this stability index is that it allows us to calibrate the use of ( f ˜ ) to control the original process M.

Clearly, if this stability index is very high ( Δ ), it would imply that it is not optimal to use the optimal policy ( f ˜ ) to control the process M, on the other hand if this stability index is low ( Δ 0 ), then the use of this approximation is valid.

In the available literature, both the study and calculation of the stability index has been carried out from a theoretical approach through different way: with the application of contractive operators, see for example [6] [7] [8]; with the use of certain ergodicity conditions, see [9] [10] [11]; and with the application of the use of different probabilistic metrics, see [12] for definitions of the different kinds of probabilistic metrics, so for example, in [9] the total variation metric is used, in [6] the Kantorovich metric is used, in [7] and [8] the Prokhorov metric is used.

The results obtained in all the papers mentioned above are an upper bound for the stability index, which is a function of certain parameters and some probabilistic metric, that is

Δ K μ ( , ) , (1)

where K is an explicit constant and μ is a certain probability metric.

Clearly, the discount factor (α) involved in optimization criteria is also involved in the explicit constant of K in inequality (1): Our goal is to determine the behavior of the stability index as a function of ( 1 α ) when the discount factor tends to 1 ( α 1 ).

Unlike the theoretical study of the stability index as presented in inequality (1), in this work, the stability index will be studied with a more applied perspective.

In this work, a Markov control process about consumption-investment is presented (with expected total discounted reward), for which the stability index is explicitly obtained and later we study its asymptotic behavior when the discount factor tends to 1. These asymptotic evaluations for the stability index will be carried out using techniques statistics; as mentioned above, our goal is to measure the sensitivity of the stability index as a function of ( 1 α ) when ( α 1 ).

To achieve the above, instead of using inequality (1), we will use statistical techniques to estimate the following model:

Δ ( 1 α ) = W ( 1 α ) κ , W , κ 1 , (2)

where W and κ are the (unknown) parameters of the model, but estimable from simulated data of the discount factor α and using the simple linear regression analysis technique. From Equation (2), we will say that the stability index is of order κ with respect to ( 1 α ) and we will express this as Δ ( 1 α ) ~ M ( 1 α ) κ .

Now, if α 1 clearly for high values of κ the stability index given in Equation (2) will tend to increase rapidly, which will be indicative that it will not be optimal to use policy f ˜ to control the original process M.

The numerical experiments carried out in this work have the goal of estimating the sensitivity κ of the stability index given in Equation (2) when α 1 . These asymptotic evaluations will give us information to answer the question posed above. In the rest of this document, we will refer to this sensitivity (κ) as the order of Δ, indistinctly.

As far as we go in the literature review, no numerical studies, no simulations, etc. were found that use statistical techniques to evaluate the order of the stability index with respect to the discount factor.

The results obtained in this work using the simple linear regression model technique depend on the value of a parameter involved in the discounted reward function used, however, it is clear that results show that when α 1 , then the stability index as a function of ( 1 α ) tends to increase rapidly, so it is not recommended to use an approximate optimal policy ( f ˜ ) to control the original model M. The results also suggest that the selection of the value of the parameter used in the reward function as well as the value of the discount factor are very important to validate the use of the optimal policy f ˜ to control M. From the results obtained in the estimates of κ, the largest value was −1.75, i.e., Δ ( 1 α ) ~ M ( 1 α ) 1.75 , although clearly from the Equation (2), it would seem natural that the best possible order should be at most M ( 1 α ) 1 .

Finally, we would like to comment on the reasons why we propose the model given in (2) for the asymptotic study of the stability index.

In [6] [7] and [8], the stability index is studied under expected total discounted cost criterion and the results found are stability inequalities such as the one given in (1). Furthermore, the constant K involved in inequality (1) is an explicit and inversely proportional function to the term ( 1 α ) in all cases; for example, in [6] it is found that Δ ~ M ( 1 α ) 2 using the Kantorovich metric, while in [7] it is obtained that Δ ~ M ( 1 α ) 2 with the total variation metric, and [8] shows a result in which Δ ~ M ( 1 α ) 3 using the Prokhorv metric. So, given that in this work a control process is studied with the expected total discounted reward criterion and based on the aforementioned results, it seems natural to propose the use of the model given in Equation (2) for the study of the asymptotic evaluations of the stability index. In [9] [10] and [11] there are also stability inequalities like the one given in inequality (1), but using the average-cost criterion, however in these papers the stability index presented an order of M ( 1 δ ) γ , where δ is the ergodicity parameter and γ .

This work is organized as follows. Section 2, a brief description of Markov control models (also called Markov decision processes) is presented as well as some well-known results for discounted optimal control problem with bounded reward; in Section 2.1, we present the problem of estimating the stability index as well as the assumptions that guarantee the existence of the optimal solution for the original (M) and approximate processes ( M ˜ ), respectively. In Section 3, the control process with which it will work (consumption-investment) is presented, while in Section 3.1 its stability index is explicitly obtained; in Section 3.2, the results obtained regarding the asymptotic evaluation of the stability index are presented. Finally in Section 4, the conclusions of this work are presented as well as some proposed future researches.

2. The Discounted Reward Criterion

For a topological space ( X , τ ) , B ( X ) denotes the Borel σ-algebra generated by the topology τ and measurability will always mean Borel measurability. Moreover, M ( X ) is the class of measurable functions on X whereas M b ( X ) is the subspace of bounded measurable functions endowed with the supremum norm given as u = sup x X | u ( x ) | , u M b ( X ) . The subspace of bounded continuous functions is denoted by C b ( X ) . For a subset B X , I B stands for the indicator function of B , i.e., I B ( x ) = 1 for x X and I B ( x ) = 0 for x X . A Borel space Y is a measurable subset of a complete separable metric space endowed with inherited metric.

Let

M = ( X , A , { A ( x ) : x X } , r , Q ) (3)

be the standard Markov control model (see [1] [13], for definitions). That is thought as a model of a controlled stochastic process { ( x n , a n ) } , where the state process { x n } takes values in the Borel space X and the control process { a n } takes values in the Borel space A . The controlled process involves as follows: at each time n 0 = { 0 , 1 , 2 , } , the contolled observes the system in some state x n = x and choose a control a n = a from the admissible control subset A ( x ) , which is assumed to be a Borel subset of A . It is also assumed that the admissible pairs set K : = { ( x , a ) : x X , a A ( x ) } belongs to B ( X × A ) . Then, the controller receives a reward r ( x , a ) where r is a real-valued Borel measurable function defined on K : = { ( x , a ) : x X , a A ( x ) } . Moreover, the controlled system moves to a new state x n = x according to the distribution measure Q ( | x , a ) , where Q is a stochastic kernel on X given K , that is, Q ( | x , a ) is a probability measure on K for each pair ( x , a ) K , and Q ( | x , a ) is a Borel measurable function on K for each Borel subset B of X . Then, the controlled choose a new control a n = a A ( x ) receiving a reward r ( x , a ) and so on.

Let n : = K n × X for n 0 and 0 = X . Observe that a generic element of n has the form h n = ( x 0 , a 0 , x 1 , a 1 , , x n 1 , a n 1 , x n ) where ( x k , a k ) K for k = 0 , 1 , , n 1 and x n X . A control policy is a sequence π = { π n } where π n ( , ) is a stochastic kernel on A given n satisfying the constraint π n ( A ( x n ) | h n ) = 1 for all h n n , n 0 . Now, let F be the class of all measurable functions f : X A such that f ( x ) A ( x ) for each x X . A control policy π = { π n } is said to be (deterministic) stationary if there exists f F such that the measure π n ( | x ) is concentrated at f ( x ) for each x X and n 0 . Following a standard convention, the stationary policy π is identified with the selector f. The class of all policies is denoted by Π and the class of all stationary policies is identified with the class F .

Let Ω : = ( X × A ) be the canonical sample space and F the product σ-algebra. For each policy π = { π n } Π and “initial” state x 0 X there exists a probability measure P x π on the measurable space ( Ω , F ) that governs the evolution of the controlled process { ( x n , a n ) } .

The expected total discounted reward criterion is given as

R α ( x , π ) : = E x π t = 0 α t r ( x t , a t ) , (4)

where the discount factor α ( 0 , 1 ) is fixed and E x π denotes the expectation operator with respect to the probability measure P x π .

The optimal control problem is to find a control policy π * Π (if exists) such that

R α * ( x ) : = R α ( x , π * ) : = sup π Π R α ( x , π ) , (5)

for all x X .

The policy π * is called discounted optimal policy, while R α * is called the discounted optimal value function. Later we will impose conditions that guarantee the finiteness of the value function R α * and the existence of an optimal policy π * .

2.1. The Stability Index and the Problem of Its Estimation

The problem of (quantitative) estimation stability (“continuity” or “robustness”) arises when there is an uncertainty about the stochastic kernel Q ( | x , a ) defined in the standard Markov control model M (see model (3)). The “original” task of the controller consists in the search for the optimal policy π * (see Equation (5)). In many applications this task cannot be fulfilled directly due to any of the following causes:

1) Frequently Q ( | x , a ) or some of its parameters are unknown to the controller, and this stochastic kernel is estimated using some statistical procedures. With the results of these estimates, another stochastic kernel Q ˜ ( | x , a ) is generated that is interpreted as an accessible approximation to the unknown Q ( | x , a ) .

2) There are situations where Q ( | x , a ) is known but too complicated to have any hope of solving the control policy optimization problem. In such cases, Q ( | x , a ) is sometimes replaced by a “theoretical approximation” Q ˜ ( | x , a ) , which results in a controllable process with a simpler structure.

We assume that Q ( | x , a ) is not available to the controller and it is substituted by a given approximating stochastic kernel Q ˜ ( B | x , a ) , x X , A ( x ) and B B ( X ) . The “approximating” Markov process governed by Q ˜ will be denoted by { x ˜ t } { x ˜ t , t = 0 , 1 , } , i.e., let

M ˜ = ( X , A , { A ( x ˜ ) : x ˜ X } , r ˜ , Q ˜ ) , (6)

be the “approximate” for the Markov control model given in model (3).

Changing x t for x ˜ t in Equation (4), we get R ˜ α ( x , π ) the discounted reward criterion for the approximate process M ˜ . Now, suppose that it is possible (at least theoretically) to find an optimal policy π ˜ * for process M ˜ , i.e.,

R ˜ α * ( x ˜ ) : = R ˜ α ( x ˜ , π ˜ * ) : = sup π Π R ˜ α ( x ˜ , π ) . (7)

The control policy π ˜ * defined in Equation (7) is used as the approximation to the optimal non-accessible policy π * (assuming it exists). In other words, policy π ˜ * is used to control the original process M instead of policy π * .

The reduction in reward for such an approximation, is estimated by the following stability index (see [3] [4] [5] ):

Δ R α ( x ) : = R α ( x , π * ) R α ( x , π ˜ * ) 0 , x X . (8)

The stability estimation problem consists of searching for inequalities of the following type:

Δ R α ( x ) K ( x ) ψ [ μ ( Q , Q ˜ ) ] , x X . (9)

where K ( x ) is a function with explicitly calculated values; ψ : + + is a real continuous function such that ψ ( s ) 0 as s 0 and μ is a metric probabilistic on the space of probability measures.

The results obtained in [6] - [11] provide inequalities as given in inequality (9).

In this paper, we consider a particular example of a Markov control process for which optimal stationary policies can be explicitly calculated. The explicit form of these stationary policies π * (for the “original” process M) and π ˜ * (for the “approximate” process M ˜ ) makes it possible to explicitly calculate the stability index Δ R α . The goal of this work is to study the asymptotic behavior of Δ R α when α 1 . Using direct calculations and numerical approximations, we will show that the stability index (see Equation (8)) can be expressed as a function that depends on ( 1 α ) and has an order of κ, i.e.,

Δ R α Δ R α ( 1 α ) = W ( 1 α ) κ , W , κ 1 , (10)

where the (unknown) parameters W and κ will be estimated using statistical techniques, see the analogy with Equation (2).

To finish this section, the assumptions that guarantee the existence of the stationary optimal control policy ( π * and π ˜ * ) for the optimal control problems given in equations (5) and (7) respectively, are shown below:

Assumption 2.1. (Existence)

1) The function r ( , ) is bounded by a constant b > 0;

2) A ( x ) is non-empty compact subset of A for each x X and the mapping x A ( x ) is continuous;

3) r ( , ) is a continuous function on K ;

4) Q ( | , ) is weakly continuous on K , that is, the mapping

( x , a ) X u ( y ) Q ( d y | x , a ) , (11)

is continuous for each function u C b ( X ) .

The second set of assumptions guarantees the discounted reward criterion is both well defined and finiteness.

Assumption 2.2. (Finiteness)

The following holds for each x X :

1) The function r ( , ) is bounded by a constant b > 0;

2) A ( x ) is a non-empty compact subset of X ;

3) r ( x , ) is a continuous function on A ( x ) ;

4) Q ( | x , ) is strongly continuous on A ( x ) , that is, the mapping

a X u ( y ) Q ( d y | x , a ) , (12)

is continuous for each function u M b ( X ) .

For more information see [2] [14] [15]. Now, if C ( X ) denote either C b ( X ) or M b ( X ) depending on whether Assumption 2.1 or 2.2 is being used, respectively; then, under either one of Assumption 2.1 or 2.2, the dymanic programming operator:

T u ( x ) : = sup a A ( x ) [ r ( x , a ) + α X u ( y ) Q ( d y | x , a ) ] , (13)

x X ,is a contraction operator from Banach space ( C ( X ) , ) into itself with contraction factor α (see [2] ).

Remark 2.3. Under Assumptions 2.1 and 2.2, there is a solution to the optimal control problem given in Equation (5); which is unique and the value function does not depend on the initial state of the process. For a proof, see [2] or [13].

3. A Markov Control Consumption-Investment Process and Its Approximation

This example is presented in [1]. Consider the following Markov control process:

Let X = [ 0 , ) ; A = [ 0 , ) ; A ( x ) = [ 0 , ) , x X . The dynamics of the “original” process (M) is given by:

x t = a t ξ t , for t = 1 , ; (14)

and for the “approximate” process ( M ˜ )

x ˜ t = a ˜ t ξ ˜ t , for t = 1 , ; (15)

where { ξ t , t 1 } and { ξ ˜ t , t 1 } are two sequences of independent and identically distributed non-negative random variables (i.i.d), which have distributions F ξ and F ξ ˜ respectively. Clearly, F ξ and F ξ ˜ are in the space of all distributions in ( X , B ( X ) ) .

In this model, x t 1 is interpreted as current capital. Amount a t [ 0 , x t 1 ] represents what is invested in assets (such as stocks, bonds, etc.), which generate a profit/loss given by a t ξ t . The rest of the capital x t 1 a t is dedicated to consumption and the satisfaction (or benefit) of this consumption is estimated by the utility function given by ( x t 1 a t ) p , where 0 < p < 1 is a given parameter.

The reward function per unit of time is given by

r ( x t 1 , a t ) = ( x t 1 a t ) p for t = 1 , ; 0 < p < 1 . (16)

Assumption 3.1. (Only for this example)

The i.i.d random variables { ξ t } t 1 , { ξ ˜ t } t 1 given in Equations (14) and (15) respectively, satisfy the following (for details, see [1] ):

λ : = E ξ p < 1 α ; λ ˜ : = E ξ ˜ p < 1 α . (17)

Now, for an “initial” state x [ 0 , ) the optimal control problem (see Equation (5)) for this Markov control consumption-investment process is

R α ( x , π * ) : = sup π Π E x π t = 1 α t 1 r ( x t 1 a t ) p , (18)

analogously for the “approximate” process, we have

R ˜ α ( x ˜ , π ˜ * ) : = sup π Π E x ˜ π t = 1 α t 1 r ( x ˜ t 1 a ˜ t ) p , (19)

where x ˜ [ 0 , ) is an “initial” state for the “approximate” process.

Under these conditions, in [1] it is shown that the processes are given in equations (14) and (15) satisfies both Assumptions 2.1 and 2.2 and that fulfill the following:

1) The optimal stationary policy for Equation (18) is the following selector

f = ( α λ ) 1 1 p x , x [ 0 , ) . (20)

2) The value function given in Equation (18) is

R α ( x , f ) = 1 [ 1 ( α λ ) 1 1 p ] 1 p x p , x [ 0 , ) . (21)

3) The optimal stationary policy for Equation (19) is the following selector

f ˜ = ( α λ ˜ ) 1 1 p x ˜ , x ˜ [ 0 , ) . (22)

The next thing is that we explicitly calculate the stability index for this control process, which we will use to perform the asymptotic evaluations. In the next section, we show the development we did to obtain this calculation.

3.1. Explicit Calculation of the Stability Index for the Markov Control Consumption-Investment Process

In this section, the stability index ( Δ R α ) is explicitly calculated for the control consumption-investment process which was presented in the previous section. As was mentioned in the introduction section, the expression that we find for the stability index is a function of the parameters p and ϵ , where ϵ is the measure of the approximation between the probability distributions F ξ and F ξ ˜ (see Equations (14) and (15)), while p is the parameter involved in the reward function (see Equation (16)).

In economics, this parameter p is associated with elasticity, that is, elasticity measures the percentage change in the consumer’s utility in response to percentage changes in the consumer’s money supply (for more details, see [16] or [17] ). For this reason, it is important to measure its effect on the asymptotic behavior of the stability index.

From Equation (16), the possible values for the parameter p lie in interval 0 < p < 1 .

Our goal is to calculate asymptotic evaluations of the stability index when ϵ 0 (which would imply that F ξ ˜ is closer to F ξ ) and for extreme values of

the range of p, that is, we are interested in values of p 0 , p 1 2 and p 1 .

Now, we will proceed to calculate the stability index and for this, we will take an “initial” state x = x ˜ = 1 as well as the following distribution functions to measure the effect of the shock on the processes:

Assumption 3.2. (Only for this example)

We consider the random vectors given in processes (14) and (15) respectively, have an exponential distribution with parameters θ and θ ˜ respectively, i.e., ξ ~ F ξ exp ( θ ) and ξ ˜ ~ F ξ ˜ exp ( θ ˜ ) with θ ˜ = θ ( 1 ϵ ) , where the values of ϵ measure the approximation between both distributions, 0 < ϵ < 1 .

Under Assumption 3.1 and 3.2, we have

λ : = E ξ 1 p = 0 ξ p [ e ξ θ θ ] d ξ ,

and after some direct calculations,

λ = θ p Γ ( p ) . (23)

Similarly, for the perturbed random vector, we have

λ ˜ = θ ˜ p Γ ( p ) ,

and since θ ˜ = θ ( 1 ϵ ) , then from the above equality it follows that

λ ˜ = λ ( 1 ϵ ) p . (24)

Next, the stability index is calculated.

From Equation (8) we have

Δ R α ( 1 ) : = R α ( 1 , f ) R α ( 1 , f ˜ ) 0 . (25)

The first term on the right side of Equation (25) is given in Equation (21) with x = 1 . The next thing is that we will calculate the second term on the right side of Equation (25): To do this, we substitute the approximate policy of optimal control with x ˜ = 1 , given in Equation (22), in the reward function of the “original” model given in Equation (18), and we have

R α ( 1 , f ˜ ) = E 1 f ˜ t = 1 α t 1 ( x ˜ t 1 a ˜ t 1 ) p .

The above equation represents the discounted reward obtained when the trajectory of the “original” process given in Equation (14) is controlled by the optimal policy obtained from the “approximate” process given in Equation (15) and the “initial” state is x = 1 .

Now, since a t f ( x t 1 ) (see [1] for details), we have

R α ( 1 , f ˜ ) = E 1 f ˜ t = 1 α t 1 ( x ˜ t 1 a ˜ t 1 ) p ,

R α ( 1 , f ˜ ) = t = 1 α t 1 E 1 f ˜ ( x ˜ t 1 ( α λ ˜ ) 1 1 p x ˜ t 1 ) p ,

finally, we have

R α ( 1 , f ˜ ) = [ 1 ( α λ ˜ ) 1 1 p ] p t = 0 α t E 1 f ˜ x ˜ t p . (26)

Now, the evolution of the approximate process (see Equation (15)) is represented as follows

x ˜ t = a ˜ t ξ ˜ t ,

So,

x ˜ 1 = a ˜ 1 ξ ˜ 1 = ( α λ ˜ ) 1 1 p x ˜ ξ ˜ 1 = ( α λ ˜ ) 1 1 p ξ ˜ 1 .

x ˜ 2 = a ˜ 2 ξ ˜ 2 = ( α λ ˜ ) 1 1 p x ˜ 1 ξ ˜ 2 = ( α λ ˜ ) 2 1 p ξ ˜ 1 ξ ˜ 2 .

x ˜ t = a ˜ t ξ ˜ t = ( α λ ˜ ) 1 1 p x ˜ t 1 ξ ˜ t = ( α λ ˜ ) t 1 p ξ ˜ 1 ξ ˜ 2 ξ ˜ t .

If we raise the last equality to the power p, we have

x ˜ t p = ( α λ ˜ ) p t 1 p ξ ˜ 1 p ξ ˜ 2 p ξ ˜ t p .

Now if we take the expected value on both sides of the above equality and since the random elements are i.i.d.,

E 1 f ˜ x ˜ t p = ( α λ ˜ ) p t 1 p E 1 f ˜ [ ξ ˜ 1 p ξ ˜ 2 p ξ ˜ t p ] ,

E 1 f ˜ x ˜ t p = ( α λ ˜ ) p t 1 p E 1 f ˜ ( ξ ˜ 1 p ) E 1 f ˜ ( ξ ˜ 2 p ) E 1 f ˜ ( ξ ˜ t p ) .

Now, by inequality (17),

E 1 f ˜ x ˜ t p = ( α λ ˜ ) p t 1 p ( λ ˜ ) t ,

E 1 f ˜ x ˜ t p = ( α ) p 1 p t ( λ ˜ ) 1 1 p t . (27)

Substituting Equation (27) in Equation (26) and after performing some direct calculations, we have

R α ( 1 , f ˜ ) = [ 1 ( α λ ˜ ) 1 1 p ] p t = 0 α t ( α ) p 1 p t ( λ ˜ ) 1 1 p t ,

R α ( 1 , f ˜ ) = [ 1 ( α λ ˜ ) 1 1 p ] p t = 0 [ ( α λ ˜ ) 1 1 p ] t . (28)

Inequalities (17) guarantees α λ ˜ < 1 ; furthermore, since 0 < p < 1 , it is guaranteed that 1 < 1 1 p . Finally, the two reasons above guarantee that ( α λ ˜ ) 1 1 p < 1 .

Therefore, calculating the sum of geometric serie involved in Equation (28), this one can be expressed as

R α ( 1 , f ˜ ) = [ 1 ( α λ ˜ ) 1 1 p ] p 1 1 ( α λ ˜ ) 1 1 p ,

R α ( 1 , f ˜ ) = 1 [ 1 ( α λ ˜ ) 1 1 p ] 1 p . (29)

Then, to obtain the stability index, Equation (21) with x = 1 and Equation (29) are substituted in Equation (25) and we obtain

Δ R α ( 1 ) = 1 [ 1 ( α λ ) 1 1 p ] 1 p 1 [ 1 ( α λ ˜ ) 1 1 p ] 1 p . (30)

Now, substituting Equation (24) in Equation (30), we have

Δ R α ( 1 ) = 1 [ 1 ( α λ ) 1 1 p ] 1 p 1 [ 1 ( α λ ( 1 ϵ ) p ) 1 1 p ] 1 p . (31)

For each fixed p, a θ value in Equation (23) can be selected such that λ = 1 , so Equation (31) can be written as

Δ R α ( 1 ) = 1 [ 1 α 1 1 p ] 1 p 1 [ 1 α 1 1 p ( 1 ϵ ) p 1 p ] 1 p . (32)

The stability index given in Equation (32) remains a function that depends on the discount factor (α), the parameter p of the reward function, see Equation (16), and the level of approximation ϵ of the distributions F ξ and F ξ ˜ , see Assumption 3.2.

3.2. Study of the Asymptotic Evaluations of the Stability Index

The goal of this work is to perform asymptotic numerical estimations of the stability index as a function of ( 1 α ), that is, find its order (κ) when α 1 , see Equation (10). For this, we will use the result obtained in the previous section of the explicit calculation of the stability index, see Equation (32).

Equation (32) shows that the stability index is a function of p and as mentioned in the previous section, this parameter of the utility function is important in economics since it is related to elasticity. So, to estimate the effect that this parameter has on the stability index, we will select arbitrary values of this parameter in such way that 1) values close to zero (it would imply consumers insensitive

to monetary change); 2) values close to 1 2 (average consumers); and 3) values

close to 1 (sensitive consumers). However for our goal, these values of p would give us information about the conditions in which the approximate policy f ˜ can be used to control the original process M, that is, we want to study if values

given of p close to zero (to 1 2 and to 1) in the reward function allows us to use this approach.

Methodology and results obtained. For a fixed value of p in Equation (32) and given a value of ϵ , we will generate 100 values of α, starting at α = 0.5 with increments of 0.005. Then, for each of the 100 generated values of α = 0.5 , 0.505 , , 0.995 , the value of ( 1 α ) is substituted in Equation (32) and we would have 100 values of the stability index (as function of ( 1 α )). With these 100 values of ( 1 α ) and the stability index Δ R α ( 1 α ) , a simple linear regression model is performed to estimate the κ parameter involved in Equation (10) and this value would be the estimation of the order of the stability index with respect to ( 1 α ), i.e., Δ R α ~ ( 1 α ) κ ^ . We are interested in the behavior of the k estimate when α 1 and ϵ 0 .

For example, if p = 1 100 then of Equation (32) we have that the stability index is expressed as

Δ R α ( 1 ) = 1 [ 1 α 100 99 ] 99 100 1 [ 1 α 100 99 ( 1 ϵ ) 1 99 ] 99 100 , (33)

Now, remembering that ϵ values represent the measure of the approximation between the distributions F ξ and F ξ ˜ (see Assumption 3.2), so let’s assume ϵ = 0.2 and we substitute it in Equation (33), we have

Δ R α ( 1 ) = 1 [ 1 α 100 99 ] 99 100 1 [ 1 α 100 99 ( 0.8 ) 1 99 ] 99 100 , (34)

Now, we generate 100 values of α = 0.5 , 0.505 , , 0.995 and later we substitute ( 1 α ) in Equation (34) and 100 values of the stability index are generated, that is shown in Figure 1.

Remark 3.3. In Figure 1, the stability index Δ R α ( 1 ) given in Equation (34) is represented as delta, this is, Δ R α ( 1 ) delta and the measure ϵ , we call epsilon.

From Figure 1, we can see that when α 1 delta Δ R α ( 1 ) that is, it is very costly to use the optimal policy of the approximate process given in Equation (22) to control the original process given in Equation (14).

Figure 1. Scatterplot generated by 100 data points of stability index ( Δ R α ) obtained from Equation (34).

On the other hand, to obtain the asymptotic evaluations of the stability index when α 1 , that is, the estimation of the κ parameter that appears in Equation (10):

Δ R α Δ R α ( 1 α ) = W ( 1 α ) κ , W , κ 1 ,

we will proceed to estimate the following simple linear regression model:

ln [ Δ R α ( 1 α ) ] i = ln ( W ) + κ ln ( 1 α ) i + ν i , i = 1 , , 100 . (35)

where ν i is a white noise (see [18] for definition), W and κ are the parameters to be estimated with the results of 100 data generated and represented in Figure 1. The results of the regression estimate given in Equation (35) is shown below:

Regression Analysis: ln(delta) versus ln(1-alpha)

Therefore, from the above results we have that κ ^ = 2.1369 and from Equation (10) it can be concluded that the asymptotic estimate of the stability index when α 1 , it is κ ^ = 2.1369 , that is, the sensitivity of the stability index with respect to ( 1 α ) is

Δ R α ( 1 α ) ~ M ( 1 α ) 2.1369 . (36)

On the other hand, the estimation of this asymptotic evaluation of κ will be better when the approximation of the F ξ ˜ distribution is closer to the F ξ distribution (see Assumption 3.2), that is

if ϵ 0 , then F ξ ˜ F ξ (and so κ ^ κ ). (37)

To see the above, given the fixed value of p = 1 100 , we proceeded to replicate the estimates of κ given in Equation (35) for ϵ = 0.1 , 0.05 , 0.01 , 0.001 .

For p = 1 100 and ϵ = 0.1 , from Equation (33) we have the following stability index

Δ R α ( 1 ) = 1 [ 1 α 100 99 ] 99 100 1 [ 1 α 100 99 ( 0.9 ) 1 99 ] 99 100 . (38)

For the same 100 values of ( 1 α ) and using the above equation, another 100 values of the stability index were generated, which are presented in Figure 2.

From this Figure 2, we observe that for p = 1 100 and ϵ = 0.1 , (when

α 1 ) it remains it is very costly to use the optimal policy of the approximate process given in Equation (22) to control the original process given in Equation (14); however, the stability index Δ R α ( 1 ) delta is reduced, that is due to the greater precision ϵ = 0.1 of in the approximation of the distribution F ξ ˜ to the distribution F ξ .

Figure 2. Scatterplot generated by 100 data points of stability index ( Δ R α ) obtained from Equation (38).

Now, with this new 100 data from Figure 2, the κ parameter is re-estimated in the simple linear regression model given in Equation (35). The results obtained are the following:

Regression Analysis: ln(delta) versus ln(1-alpha)

The results show that κ ^ = 2.1562 , and we obtain that the stability index has an order −2.156 with respect to ( 1 α ), that is Δ R α ( 1 α ) ~ M ( 1 α ) 2.1562 .

Now, to investigate the asymptotic behavior of this sensitivity κ, we will make the approximation between the probability functions better and better, i.e., ϵ 0 .

So, analogously to what has already been explained, the results for p = 1 100 and ϵ = 0.05 , 0.01 , 0.001 are presented in Figures 3-5.

The five figures above show that for fixed p = 1 100 and when ϵ tends to

zero (which implies that F ξ ˜ approaches to F ξ ), then the stability index has zero (see y-values labels).

The previous interpretation is clearer if we look at Figure 6 and Figure 7, in which we have joined the five previous figures in contour lines of the stability index.

In the last two figures observe y-values labels, it is clear that when epsilon tends to zero, then the stability index also tends to zero. The above implies that, the better the approximation between the distribution functions then the approximate optimal policy can be used to control the original process.

Now, for each group of 100 data generated in each of the five graphs, the κ parameter involved in the simple linear regression model given in Equation (35) was estimated. The results obtained from these estimates are presented in Table 1,

Figure 3. Scatterplot generated by 100 data points of stability index ( Δ R α ) obtained from Equation (33) with ϵ = 0.05 .

Figure 4. Scatterplot generated by 100 data points of stability index ( Δ R α ) obtained from Equation (33) with ϵ = 0.01 .

Figure 5. Scatterplot generated by 100 data points of stability index ( Δ R α ) obtained from Equation (33) with ϵ = 0.001 .

Figure 6. Results of the association of the stability index and the approximation measure in the probability distributions (epsilon).

Figure 7. Magnification of Figure 6, when alpha approaches to 1.

Table 1. Asymptotic evaluation of the stability index ( κ ^ ).

note that the first two cases ( ϵ = 0.2 ; ϵ = 0.1 ) correspond to the results that have been explained in previous pages.

In Table 1, the green cell shows the best approximation used between the distribution functions (see Assumption 3.2) with which the numerical estimate for the asymptotic evaluation of the stability index was found, which is shown in the skyblue cell.

Based on results of Table 1, we can conclude that for p = 1 100 , when ϵ 0

and α 1 , then the asymptotic evaluation of the stability index is κ ^ 2.175 , i.e., the stability index has an order

Δ R α ( 1 α ) ~ M ( 1 α ) 2.175 .

To study the sensitivity of the stability index ( κ ^ ), numerical experiments were carried out for other values of p. Each of thesep values was substituted in Equation (32) and the stability indices ( Δ R α ) were obtained as a function of α and ϵ as shown in Table 2.

Then, for each fixed value of p given in Table 2, we will use ϵ = 0.2 , 0.1 , 0.05 , 0.01 , 0.001 ; subsequently for each pair of fixed p and ϵ , 100 data of ( 1 α ) were generated and they were substituted in the formulas of Table 2 obtaining 100 data of the index stability as a function of ( 1 α ); finally, these 100 pairs of ( 1 α ) and Δ R α ( 1 α ) were used for the asymptotic evaluation of the approximate stability index with the estimation of the κ parameter involved in the simple linear regression model given in Equation (35). The results obtained in these numerical estimates are presented in Table 3.

Table 2. Explicit expression of the stability index for different values of the reward function parameter (p).

Table 3. Explicit expression of the stability index for different values of the reward function parameter (p).

Remark 3.4. In Table 3, for values of p 95 100 the speed with which the

stability index has to infinity is greater; so it does not allow to obtain n = 100 values of α = 0.5 , 0.505 , , 0.995 . Thus, for example the results shown in

Table 3 for p = 95 100 were obtained with n = 89 data, while for p = 99 100 ,

they were performed with n = 50 data. The results presented for the rest of the p values in Table 3 were obtained with 100 data.

Clearly, the results presented for each p value in Table 3, must be interpreted in the same way that the results in Table 1 were previously explained.

Discussion of results. The motivation for studying discounted reward (cost) problems is primarily economic. Capital accumulation processes of an economy, inventory problems, inventory management, portfolio management, are applications of this type of optimization criteria. The reward function used in this work, see Equation (16), is a very used function in economics, it belongs to the family of consumer utility functions, specifically the so-called Cobb-Douglas utility function (see [19] for definitions), so the selection of the parameter p in Equation (16) it must be very careful. The results obtained in this work about the asymptotic evaluations of the stability index (which are presented in Table 3) are interpreted as follows:

1) If p 1 : Δ R α ~ M ( 1 α ) . That is, if the parameter p of the reward function approaches 1, then the sensitivity of the stability index grows indefinitely.

Therefore, for values of p 1 the use of an approximate policy to control the original process is not recommended, that is because the results show (see

Table 3) that for p 8 9 we have κ ^ 2.284 , which is why the stability index can be up to M ( 1 α ) 8.05 .

2) If p 1 2 : Δ R α ~ M ( 1 α ) 1.75 . In this case, the results obtained (see Table 3) suggest that if values of 2 5 p 7 10 are selected in the reward function

given in Equation (16), then it would seem reasonable to use the approximate policy f ˜ to control the original process M.

3) If p 0 : Δ R α ~ M ( 1 α ) 2 . In this case, the results obtained in this work using statistical techniques are the same as those found in articles [6] and [7], but using uppers bounds such as given in Equation (1).

Remember that by definition we have 0 < p < 1 (see Equation (16)). Now, from the three previous points, the results obtained show that for extreme values of p (close to zero or one) it is not recommended to use an approximate policy to control the original process. The results obtained suggest a selection of the p

value close to the average ( p 1 2 ) in the reward function, to use of such an approximation.

4. Conclusion

Despite the extensive literature that exists on the subject of Markov control process, there are few works developed on the subject of estimating the stability index. The study of stability for control processes represents a challenge, both from a theoretical and an applied point of view. In this application work, it is intended to contribute to the study of stability using statistical techniques instead of probabilistic metrics. The limitations of this work are the use of a simple Markov control process as well as the use of an exponential distribution function to measure the shock effect of the process. However, the numerical estimates found are consistent and show their impact on the sensitivity of the stability index to changes in both the discount factor and the parameter in the reward function; obviously, the results obtained respond favorably to the original question, which was posed in the introduction, so we can conclude that the objective of this work was achieved. Finally, it is recommended to strengthen the results found in this work by carrying out some of the following future investigations: 1) Using more complex Markov control processes; 2) Validate the robustness of the results using another type of distribution function to measure the shock effect of the process; 3) Use another type of reward function; and 4) Use of other statistical techniques for the asymptotic estimation of the stability index.

Acknowledgements

The author wishes to thank referees for valuable suggestions on improvement of the previous version of the paper.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Dynkin, E.B. and Yushkevich, A.A. (1979) Controlled Markov Processes. Springer-Verlag, New York.
[2] Hernandez-Lerma, O. (1989) Adaptive Markov Control Process. Vol. 79, Springer-Verlang, New York.
https://doi.org/10.1007/978-1-4419-8714-3
[3] Gordienko, E.I. (1992) An Estimate of the Stability of Optimal Control of Certain Stochastic and Deterministic Systems. Journal of Soviet Mathematics, 59, 891-899.
https://doi.org/10.1007/BF01099115
[4] Gordienko, E.I. and Salem, F.S. (1998) Robustness Inequalities for Markov Control Processes with Unbounded Cost. Systems & Control Letters, 33, 125-130.
https://doi.org/10.1016/S0167-6911(97)00077-7
[5] Gordienko, E.I. and Yushkevich, A.A. (2003) Stability Estimates in the Problem of Average Optimal Switching of a Markov Chain. Mathematical Methods of Operations Research, 57, 345-365.
https://doi.org/10.1007/s001860200258
[6] Gordienko, E.I., Lemus-Rodriguez, E. and Montes-de-Oca, R. (2008) Discounted Cost Optimality Problem: Stability with Respect to Weak Metrics. Mathematical Methods of Operations Research, 68, 77-96.
https://doi.org/10.1007/s00186-007-0171-z
[7] Gordienko, E., Martínez, J. and Ruiz de Chávez, J. (2015) Stability Estimation of Transient Markov Decision Processes. In: Mena, R.H., Pardo, J.C., Rivero, V. and Bravo, G.U., Eds., XI Symposium on Probability and Stochastic Processes, Mexico, 18-22 November 2013, 157-176.
https://doi.org/10.1007/978-3-319-13984-5_8
[8] Martínez-Sánchez, J.E. (2020) Stability Estimation for Markov Control Processes with Discounted Cost. Applied Mathematics, 11, 491-509.
https://doi.org/10.4236/am.2020.116036
[9] Gordienko, E.I. and Salem-Silva, F. (2000). Estimates of Stability of Markov Control Processes with Unbounded Costs. Kybernetika, 36, 195-210.
[10] Montes-de-Oca, R. and Salem-Silva, F. (2005) Estimates for Perturbations of Average Markov Decision Process with a Minimal State and Upper Bounded by Stochastically Ordered Markov Chains. Kybernetika, 41,757-772.
[11] Martinez, J. and Zaitzeva, E. (2015) Note on Stability Estimation in Average Markov Control Processes. Kybernetika, 51, 629-638.
http://doi.org/10.14736/kyb-2015-4-0629
[12] Rachev, S.T. (1991) Probability Metrics and the Stability of Stochastic Models. Wiley, Chichester.
[13] Hernandez-Lerma, O. and Lasserre, J. (1996) Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York.
https://doi.org/10.1007/978-1-4612-0729-0
[14] Hernandez-Lerma, O. and Lasserre, J.B. (1999) Further Topics on Discrete-Time Markov Control Processes. Springer, New York.
https://doi.org/10.1007/978-1-4612-0561-6
[15] Van Nunen, J.A. and Wessels, J. (1978) Note—A Note on Dynamic Programming with Unbounded Rewards. Management Science, 24, 485-586.
https://doi.org/10.1287/mnsc.24.5.576
[16] Carlton, D. and Perloff, J. (2005) Modern Industrial Organization. Pearson, Addison Wesley, Boston.
[17] Viscusi, W., Harrington, J. and Vernon, J. (2005) Economics of Regulation and Antitrust. The MIT Press, Cambridge.
[18] Kmenta, J. (1971) Elements of Econometrics. 2nd Edition. Macmillan Publishing Company, New York.
[19] Cobb, C.W. and Douglas, P.H. (1928) A Theory of Production. American Economic Review, 18, 139-165.

Copyright © 2021 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.