Offline Robustness of Distributional Actor-Critic Ensemble Reinforcement Learning ()
1. Introduction
Offline reinforcement learning (RL) [1]-[3] concerns the problem of learning a policy from a fixed dataset without further interactions. Offline RL can reduce risk and costs since it eliminates the need for online interaction. In this way, offline RL can be well used in real-world applications such as autonomous driving [4], healthcare [5] and robot control [6].
Applying the standard policy improvement approaches to an offline dataset typically results in the distribution shift problem, making offline RL a challenging task [7] [8]. Some prior works have relieved this issue by penalizing the action-value of the out-of-distribution (OOD) actions [9]-[11]. Nevertheless, simply learning the expectation of action-value is unable to quantify risk and ensure that the learned policy acts safely. To overcome this problem, some efforts have been made to import distributional RL [12]-[14] into offline RL to learn the full distribution over future returns, which is used to make plans to avoid risky and unsafe actions. In addition, with the establishment of risk-sensitive objectives [14], distributional offline RL [15] [16] learns state representations better since they can acquire richer distributed signals, making them superior to traditional reinforcement learning algorithms even on risk-seeking and risk-averse objectives.
Unfortunately, research on distributional offline RL is less complete. CODAC [15] brings distributional RL into the offline setting by penalizing the predicted return quantiles for OOD actions. Meanwhile, MQN-CQR [16] learns a worst-case policy by optimizing the conditional value-at-risk of the distributional value function. However, existing distributional offline RL methods only focus on the safety of the learned policy. These methods leverage a conservative return distribution to impair the robustness, and will make policies highly sensitive, even a minor perturbation in observations [17]. As a result, merely possessing safety can not make a fine balance between conservatism and robustness, which does not pay enough attention to robustness.
In this paper, we propose Offline Robustness of Distributional actor-critic Ensemble Reinforcement Learning (ORDER) by introducing a smoothing technique to quantile networks. Firstly, we consider the dynamic entropy regularizer of the quantile function instead of an unchangeable constant to ensure sufficient exploration. Secondly, the increasing number of quantile networks is also beneficial in obtaining a more robust distribution value function. Thirdly, smooth regularization is brought into the distribution and policies of states near the dataset. In theory, we prove that ORDER obtains a uniform lower bound on all integrations of the quantiles with the distribution soft Bellman operator, which controls the distribution shift. Such bound also achieves the same effect for both expected returns and risk-sensitive objectives. Overall, ORDER can mitigate the OOD problem and simultaneously balance conservatism and robustness.
In our experiments, ORDER outperforms the existing distributional offline RL methods in the D4RL benchmark [18]. Meanwhile, our algorithm is also competitive against the current advanced algorithms. Our ablation experiments demonstrate that strengthening the quantile network is critical to the performance of ORDER. In addition, choosing different risk measure functions does not have a great impact on the performance of ORDER, which also shows the robustness of our method.
2. Preliminaries
2.1. Markov Decision Process and Offline Reinforcement
Learning
Consider an episodic Markov decision process (MDP)
, where
is the state space,
is the action space,
is the transition distribution,
is the length of the episode,
is the reward function, and
is the discount factor. For a stochastic policy
, action-value function
is defined as
Standard RL aims at learning the optimal policy
such that
for all
and all
. The corresponding
-function of the policy satisfies the Bellman operator:
In the offline setting, the agent is not allowed to interact with the environment [18]. The objective of agents is to learn an optimal policy only from a fixed dataset
. For all states
, let be the empirical behavior policy. In order to avoid the situation where the denominator of the fraction is 0 in the theoretical analysis, we assume that
. Broadly, actions that are not drawn from
(i.e., those with low probability density) are the out-of-distribution (OOD).
2.2. Distributional Reinforcement Learning
Distributional RL directly models the full distribution of returns
, instead of merely learning its expected value [19]. The distributional Bellman operator for policy evaluation is
, where
indicates equality in distribution. Define the quantile function be the inverse of the cumulative density function
as
[20], where
denotes the quantile fraction. For random variables
and
with quantile functions
and
, the
-Wasserstein distance
is the
metric on quantile functions. [19] gave that the distributional Bellman operator
is a
-contraction in the
, i.e., let
be the largest Wasserstein
distance over
, and
be the space of distributions over
with bounded
-th moment, then
Fitted distributional evaluation (FDE) [15] approximates
by
using
, then
can be estimated by starting from an arbitrary
and iteratively computing
In distributional RL, let risk measure function
be a map from the value distribution space to real numbers. Given a distorted function
over
, the distorted expectation of
is
and the corresponding policy is
[13]. Specially, if
, then
. For other choices of
, please refer to Section 5.2.
2.3. Robust Reinforcement Learning
Robust RL learns the policy by introducing worst-case adversarial noise to the system dynamics and formulating the noise distribution as the solution of a zero-sum minimax game. In order to learn robust policy
, SR2L [21] obtains by adding a perturbation to the state
, where
and the Jeffrey’s divergence
for two distributions P and
is defined by
, and then define a smoothness regularizer for policy as . Analogously,
is the smoothness regularizer for Q-function, where
is state distribution.
3. Offline Robustness of Distributional Actor-Critic
Ensemble
In ORDER, we first obtain a state with adversarial perturbations, and then introduce the smoothness regularization to both the policy and the distribution action-value function for states with adversarial noises. The smooth regularization can be used to learn a smooth Z-function and generate a smooth policy, which makes the algorithm robust. However, the introduction of smoothness could potentially result in an overestimation of values at the boundaries of the supported dataset. To overcome this problem, we incorporate a penalty factor for OOD actions to reduce the quantile values of these actions. In addition, we strengthen the quantile network by increasing the number of quantile networks, which is also beneficial to the robustness of our algorithm. The overall architecture of ORDER is shown in Figure 1.
Figure 1. Architecture diagram for ORDER.
3.1. Robust Distributional Action-Value Function
In this part, we sample three sets of state-action pairs and form three different loss functions to obtain a conservative smooth policy. First of all, we construct a perturbation set
to obtain
pairs, where
is an
-radius ball measured in metric
and
. Then we sample
pairs from the current policy
, where
. ORDER contains
Z-function and denotes the parameters of the
-th Z-function and the target Z-function as
and
, respectively. With the help of the constructions, we will give the different learning targets for
,
and
pairs, respectively.
SAC [7] incorporates an entropy term in its objective function to optimize both cumulative rewards and policy stochasticity, enhancing policy robustness and generalization. We also introduce an entropy term, for a
pair sampled from
, we obtain the target as
where the next Z-function takes the minimum value among the target Z-functions. The loss function is defined as follows,
Next, we introduce the smoothness regularizer term to the distribution action-value function that is designed to enhance the smoothness of the Z-function. Specifically, we minimize the difference between
and
, where
is a state-action pair with a perturbed state. Then we take an adversarial
which maximizes
. The final smooth term we introduced is shown below:
(1)
where
is a factor that balances the learning of in-distribution and out-of-distribution values. Thus, for the selected
, we minimize Equation (1) to get a smooth Z-function. Since the actions should be near the offline data and close to the behavior actions in the dataset, we do not consider OOD action for smoothing in this part.
Finally, we consider the following loss function to prevent overestimation of OOD actions.
for some state-action dependent scale factor
and
.
Incorporating both the in-distribution target and OOD target, we conclude the loss function in ORDER as follows,
(2)
3.2. Robust Policy
With the above smooth limits, we can learn a robust policy with fewer policy changes under perturbations. We choose a state
as mentioned above which is maximizing . Consequently, our loss function for policy is as follows:
(3)
where the first term is designed to get a conservative policy by maximizing the minimum of the distributional functions ensemble and the last term is a regularization term.
3.3. Implementation Details
In this subsection, we integrate the distributional evaluation and policy improvement algorithms introduced in Section 3.1 and Section 3.2 into the actor-critic framework. With the loss function introduced in Equation (2), it is natural to get the following iterative formula for
by starting from an arbitrary
,
(4)
Following [17], we suggest employing a min-max objective in which the inner loop selects the current policy to maximize the objective, while the outer loop is responsible for minimizing the objective with respect to this policy:
where
is an actor policy. To establish a well-posed optimization problem, we introduce a regularization term in the original objective. Detailed analysis procedures are provided in Appendix A.1. The final optimization objective becomes
where
. To perform optimization with respect to the distribution
, we express the quantile function using a neural network
. In order to calculate
[22], we minimize the weighted pairwise Huber regression loss of various quantile fractions.
-Huber quantile regression loss [23] with threshold
is represented as
where
are random quantiles,
,
,
. More details for the algorithm ORDER are presented in Algorithm 1.
3.4. Theoretical Analysis
Before presenting our theorems, similar to [15], we first give some assumptions about the MDP and dataset. Next, we assume that the search space in Equation (4) includes all possible functions.
Assumption 1. For all
and
,
is smooth. Furthermore, the search space of the minimum over Z in Equation (4) is overall smooth functions
.
The assumption is given to guarantee the boundness of the
-th moments of
as well as
.
Assumption 2. For all
and
, there exists
, such that
is
-strongly monotone, i.e.,
.
This assumption is only designed to ensure the convergence of
in our theoretical analysis. In algorithm implementation, we introduce a policy smoothing term and choose the minimum value of the Z-function in the ensemble to restrict the policy update amplitude, thus implicitly satisfying monotonicity. Next, we assume that the infinite norm of two quantiles with perturbation is restricted by a fixed constant.
Assumption 3. For all
,
and the selected
, we assume that
,
is constant.
In the above assumption,
represents the infinite norm. Since perturbed observation
is randomly sampling from an
ball of norm
, the assumption is reasonable.
Finally, we assume
. Therefore, we derive the following lemma which characterizes the conservative distribution soft evaluation iterates
with the distributional soft Bellman operator.
Lemma 1. Suppose Assumptions 1 - 3 hold. For all
,
,
, and
, we have , where
.
For detailed proof, please refer to Appendix B.2. Briefly, it is according to the result of a simple variational skill to handle that F is a function, and setting the derivative of Equation (4) equal to zero.
Next, we define the conservative soft distributional evaluation operator by compositing and the shift operator
, which is defined by
. Consequently, it is following that
.
Now, the main theorem we came up with shows that the conservative distributional soft evaluation obtains a conservative quantile estimate of the true quantile at all quantiles
.
Theorem 1. For any
, with probability at least
, we have
for all
, and
. Furthermore, for
, we have
.
For detailed proof, please refer to Appendix B.2. As the theorem shows, the above inequality indicates that the quantile estimates obtained by
are a lower bound of the true quantiles. Furthermore, we give a sufficient condition to show that the result given in Theorem 1 is not a vacuous conclusion. Therefore, Theorem 1 theoretically illustrates that ORDER does not exacerbate the distribution shift problem, and the mitigation of the distribution shift problem will be demonstrated in the experimental section.
The performance of many RL algorithms will exhibit different behaviors under different distorted expectations. Consequently, we can acquire the same conservative estimates of these objectives, which is a kind of generalization of Theorem 1.
Corollary 1 For any
,
, sufficiently large
and
, with probability at least
, for all
, we have
.
In particular,
is obtained if we take
. By choosing different risk measure functions, we can apply this conclusion to any risk-sensitive offline RL.
4. Related Works
Offline RL [1] [24]-[26] learns a policy from previously collected static datasets. As a subfield of RL [27] [28], it has achieved significant accomplishments in practice [4]-[6] [29]. However, the main two challenges of offline RL are the distribution shift problem and robustness [26] [30] [31], which require various techniques to improve the stability and performance of learned policies.
4.1. Distribution Shift
The cause of the distribution shift problem is that the distribution of collected offline training data differs from the distribution of data in practice. BCQ [9] addresses this problem through policy regularization techniques, which formulates the policy as an adaptable deviation constrained by maximum values [32]. One solution to alleviate the distribution shift problem as mentioned in BEAR [33] is incorporating a weighted behavior-cloning loss achieved by minimizing maximum mean discrepancy (MMD) into the policy improvement step. Though learning a conservative Q-function caused by distribution shift, CQL [17] solves the overestimation of value functions, which theoretically proves that a lower bound of the true value is obtained. With the introduction of distributional reinforcement learning into offline RL, CODAC [15] learns a conservative return distribution by punishing the predicted quantiles returned for the OOD actions. From another perspective, continuous quantiles are used in MQN-CQR [16] to learn the quantile of return distribution with non-crossing guarantees. ORDER builds on these approaches, but considers the entropy regularizer of quantile function instead of an unchangeable constant for ensuring sufficient exploration and may relieve the training imbalance.
4.2. Robustness Issues
Owning to the distribution shift issues, current offline RL algorithms tend to prioritize caution in their approach to value estimation and action selection. Nevertheless, this selection can compromise the robustness of learned policies, making them highly sensitive to even minor perturbations in observations. As a groundbreaking piece of work, SR2L [21] achieves a more robust policy by introducing a smoothness regularizer into both the value function and the policy. A robust Q-learning algorithm proposed in REM [30] is presented that integrates multiple Q-value networks in a random convex combination of multiple Q-value estimates, ensuring that the final Q-value estimate remains robust. With arbitrarily large state spaces, RFQI [34] learns the optimal policy by employing function approximation using only an offline dataset and addresses robust offline reinforcement learning problems. In ORDER, we import the smoothing regularizer to the distribution functions and policies instead of simple action-value functions.
Our method is related to the previous offline RL algorithms based on constraining the learned value function [17] [35]. What sets our method apart is that it can better capture uncertain information about OOD actions and learn more robust policies with the introduction of the smoothing technique into distributional reinforcement learning. In addition, we enhance the network by using the entropy regularizer of quantile networks and increasing the number of network ensembles, which also improves robustness.
5. Experiments
In the sequel, we first compare ORDER against some offline RL algorithms. Then how different risk measure functions impact our proposed algorithm is investigated in Section 5.2. Besides, Section 5.3 shows the ensemble size of the quantile network in ORDER. And our approach significantly exceeds the baseline algorithm in most tasks.
We evaluate our experiments on the D4RL benchmark [18] with various continuous control tasks and datasets. Specifically, we employ three environments (HalfCheetah, Hopper and Walker2d) and four dataset types (random, medium, medium-replay, and medium-expert). The random or medium dataset is generated by a single random or medium policy. The medium-replay dataset contains experiences collected in training a medium-level policy, and the medium-expert dataset is a mixture of medium and expert datasets.
5.1. Comparison with Offline RL Algorithms
In all the aforementioned datasets, we compare our method against several previous popular offline RL algorithms, including 1) bootstrapping error accumulation reduction (BEAR) [33], 2) conservative q-learning (CQL) [17], 3) CODAC [15], 4) robust offline reinforcement learning (RORL) [36], 5) monotonic quantile network with conservative quantile regression (MQN-CQR) [16]. The results of BEAR and CQL are directly taken from [18]. For CODAC and MQN-CQR, their results are taken from the original paper. Since the RORL paper does not report scores with five random seeds, we run the RORL using the official code base. Implementation details are given in Appendix D. Table A1 and Table A2 list the hyperparameters of ORDER in different datasets. Without loss of generality, we employ the neutral risk measure in this subsection.
Table 1. Normalized average returns on D4RL benchmark, averaged over five random seeds. “r”, “m”, “m-r” and “m-e” indicate the abbreviations of random, medium, medium-replay and medium-expert, respectively. All methods are run for 1M gradient steps.
Datasets |
BEAR |
CQL |
CODAC |
RORL (Reproduced) |
MQN-CQR |
ORDER |
hopper-r |
3.9 ± 2.3 |
7.9 ± 0.4 |
11.0 ± 0.4 |
22.7 ± 8.4 |
13.2 ± 0.6 |
24.8 ± 7.8 |
hopper-m |
51.8 ± 4.0 |
53.0 ± 28.5 |
70.8 ± 11.4 |
104.8 ± 0.3 |
94.7 ± 13.2 |
101.5 ± 0.2 |
hopper-m-r |
52.2 ± 19.3 |
88.7 ± 12.9 |
100.2 ± 1.0 |
102.3 ± 0.5 |
95.6 ± 18.5 |
106.4 ± 0.1 |
hopper-m-e |
50.6 ± 25.3 |
105.6 ± 12.9 |
112.0 ± 1.7 |
112.8 ± 0.2 |
113.0 ± 0.5 |
114.6 ± 3.3 |
walker2d-r |
12.8 ± 10.2 |
5.1 ± 1.3 |
18.7 ± 4.5 |
21.5 ± 0.2 |
22.6 ± 6.1 |
28.4 ± 6.2 |
walker2d-m |
-0.2 ± 0.1 |
73.3 ± 17.7 |
82.0 ± 0.5 |
103.2 ± 1.7 |
80.0 ± 0.5 |
86.0 ± 0.2 |
walker2d-m-r |
7.0 ± 7.8 |
81.8 ± 2.7 |
33.2 ± 17.6 |
90.1 ± 0.6 |
52.3 ± 16.7 |
87.9 ± 4.8 |
walker2d-m-e |
22.1 ± 44.9 |
107.9 ± 1.6 |
106.0 ± 4.6 |
120.3 ± 1.8 |
112.1 ± 8.9 |
115.1 ± 1.2 |
halfCheetah-r |
2.3 ± 0.0 |
17.5 ± 1.5 |
34.6 ± 1.3 |
28.2 ± 0.7 |
32.6 ± 2.9 |
31.5 ± 1.0 |
halfCheetah-m |
43.0 ± 0.2 |
47.0 ± 0.5 |
46.3 ± 1.0 |
64.7 ± 1.1 |
45.1 ± 1.5 |
63.7 ± 0.4 |
halfCheetah-m-r |
36.3 ± 3.1 |
45.5 ± 0.7 |
44.0 ± 0.8 |
61.1 ± 0.7 |
45.3 ± 7.9 |
57.4 ± 1.7 |
halfCheetah-m-e |
46.0 ± 4.7 |
75.6 ± 25.7 |
70.4 ± 19.4 |
108.2 ± 0.8 |
71.1 ± 4.9 |
93.2 ± 1.1 |
The performance of all these algorithms is exhibited in Table 1, which reports the average normalized scores along with their corresponding standard deviations. We observe that ORDER outperforms BEAR in all tasks and surpasses the performance of the CQL algorithm. Significantly, our algorithm surpasses the performance of current distributed offline RL methods (see CODAC and MQN-CQR in Table 1), which is attributed to the assurance of robustness in ORDER. Meanwhile, ORDER competes favorably with the current state-of-the-art algorithms, owing to the safety guaranteed by distributional RL.
5.2. Policy Training under Risk Measures Function
In this subsection, we investigate how the risk measure functions affect the performance of ORDER. We compare three risk-averse learned policies [14] in distribution RL with the risk-neutral measure function. Specifically, for different
, three ways of distorted expectation are considered,
CPW:
, and
is set as 0.71.
Wang:
, where
is set as 0.25 and
is the standard Gaussian CDF.
CVaR:
, and
is set as 0.25.
Besides, we evaluate three risk-seeking learned policies.
The first risk-seeking method is mean-variance and
is set as −0.1.
The second risk-seeking method is Var, and
is set as 0.75.
The third risk-seeking method is Wang, and
is set as −0.75.
The results are shown in Table 2 and Table 3, indicating that there is little difference between risk-averse methods and risk-seeking learned policies. This suggests that risk measure functions within the ORDER framework are not highly sensitive. At this point, we also empirically demonstrate the robustness of our approach.
Table 2. Performance of ORDER under various risk-averse methods in hopper-medium-replay-v2. Each method is run with five random seeds.
Risk measure |
Neutral |
CPW (0.71) |
CVaR (0.25) |
Wang (0.75) |
Performance of
ORDER |
106.4 ± 0.1 |
105.3 ± 0.4 |
106.2 ± 0.4 |
106.0 ± 1.5 |
Table 3. Performance of ORDER under various risk-seeking methods in hopper-medium-replay-v2. Each method is run with five random seeds.
Risk measure |
Neutral |
Mean-Std (−0.1) |
VaR (0.75) |
Wang (−0.75) |
Performance of
ORDER |
106.4 ± 0.1 |
107.5 ± 0.3 |
106.5 ± 0.6 |
106.4 ± 0.2 |
5.3. Ablations on Benchmark Results
Without loss of generality, we choose the hopper-medium-replay-v2 dataset as an example to conduct the ablation study in this subsection. The performance of our ORDER algorithm under different Ms is visualized in Figure 2. We observe with the increase of M, the effect has a significant improvement in both computation efficiency and stability; as shown in the yellow and purple lines. However, M should not be too large, which is presumably attributed to the overfitting problem (see the blue line, the normalized score value fluctuates significantly around the training epoch of 700). In conclusion, M is set as four to balance between robustness enhancement and computational efficiency improvement.
Figure 2. The normalized score under different ensemble sizes. Each method is run with five random seeds.
6. Conclusion
In this work, we introduce Offline Robustness of Distributional actor-critic Ensemble Reinforcement Learning (ORDER) to balance the conservatism and robustness in the offline setting. To achieve robustness, we first take into account the entropy regularizer that helps fully explore the dataset and alleviates training imbalance issues. Moreover, we consider the ensemble of multiple quantile networks to enhance robustness. Furthermore, a smoothing technique is introduced to the policies and the distributional functions for the perturbed states. In addition, we theoretically prove that ORDER converges to a conservative lower bound, which also shows that we improve the robustness without exacerbating the OOD problem. Finally, ORDER shows its advantage against the existing distributional offline RL methods in the D4RL benchmark. We also validate the effectiveness of ORDER through ablation studies.
Appendix
A. Algorithm and Implementation Details
In this section, we provide a comprehensive account of our practical implementation of ORDER, offering a detailed explanation of the process.
A.1. ORDER Objective
To establish a well-defined optimization problem, we introduce a regularization term denoted as
in the original objective:
Let
and
to be the entropy
, then
is the solution to the inner-maximization. Substituting this selection into the previously mentioned regularized objective function gives
As demonstrated in [15], we also introduce a parameter
to threshold the quantile value difference between
and
, and give this difference a weight
. Then we get a trainable expression of
through the process of dual gradient descent:
Since all our experiments take place in continuous-control domains, it is not feasible to list all possible actions as and directly compute
. In our implementation, we employ the importance sampling approximation method described in [17], and obtain
(5)
where
represents a uniform distribution of actions. Algorithm 7.1 summarizes a single step of the actor and critic updates used by ORDER.
Algorithm 1. ORDER |
1: |
Hyperparameters: Number of generated quantiles N, number of quantile networks for value functions M, Huber loss threshold
, discount rate
, learning rates
,
, tuning parameter
, OOD penalty scale
, OOD penalty threshold
|
2: |
Parameters: Critic parameters
, Actor parameters
, Penalty
|
3: |
# Compute distributional TD loss |
4: |
Get the next action using current policy
|
5: |
for
= 1 to
do (For the
-th quantile network) |
6: |
for
= 1 to
do |
7: |
|
8: |
end for |
9: |
end for |
10: |
Computer  |
11: |
# Compute OOD penalty |
12: |
Sample
and use quantile
|
13: |
Estimate according to Equation (5) |
14: |
Compute 
|
15: |
# Update quantile network |
16: |
Use Equation (1) to add perturbations to the state to obtain
. |
17: |
Train
using Equation (2) by SGD |
18: |
Update
|
19: |
# Update policy network with
objective |
20: |
Get new actions with re-parameterized samples
|
21: |
Computer using
|
22: |
|
23: |
Update
|
B. Proofs
B.1. Proof of Lemma 1
Proof. By the definition of
-Wasserstein distance, we can re-write Equation (4) as
For arbitrary smooth functions
with compact support
, we consider a perturbation
of
, then the above formula can be written as
Then the following equation is obtained considering the derivative of
at
:
Owning to some perturbation
will cause the objective value to decrease, so the above equation must be equal to 0 for
. If
does not equal zero for each
, since
is arbitrary, it will also cause the above equation to be not equal to 0. Therefore, we obtain
for all
. According to the above term is zero for all
, we have
According to Assumption 3, the above equation can be converted to
which holds if and only if
where
. ≤
B.2. Proof of Theorem 1
Lemma 2. For all
, with probability at least
, for any
and
, we have
(6)
where
represents the number of occurrences of
in
.
Proof. Applying the definition of distributional soft Bellman operator to the cumulative density function, we obtain that
Adding and subtracting
from this expression gives
We proceed by bounding the two terms in the summation. For the first term,
Therefore,
The following derivation process is similar to CODAC [15], so we can finally obtain the above conclusion.
It has been proved in CODAC that if
, then
, where F and
are two cumulative distribution functions (CDFs) with support
, and F is
-strongly monotone. Thus, according to Lemma 2, we have
Lemma 3. For any return distributional Z with
-strongly monotone CDF
and any
, with probability at least
, for all
and
, we have
Let
and followed by Lemma 1
the second step holds by Lemma 2 with probability at least
. For any
, if
satisfies
for all
and
, then
. Then,
(7)
Notice that since for the last term in Equation (7) to be positive, we need
Owning to we have assumed that
, then it can be equivalent to
we thus prove the conclusion of Theorem 1.
C. Experimental Settings
Our experimental procedure largely adheres to [18], and the results of non-distributional methods are directly taken from [18]. For all experiments, we run algorithms for 1000 epochs (1000 training steps each epoch, i.e., 1M gradient steps in total). Then we evaluate them using 10 test episodes in the original environment, which all last 1000 steps long. All benchmark results are averaged over 5 random seeds. The reported results are normalized to D4RL scores that measure how the performance compared with expert score and random score:
.
D. Implementation Details
We implement ORDER based on DSAC and keep the DSAC-specific hyperparameters the same. These hyperparameters are detailed in Table A1. As with CODAC, we introduce hyperparameters
(See Appendix A.1). In most cases,
is a learnable parameter initialized to 1 with learning rate
. In a few cases, setting
throughout the entirety of training, which we indicate by setting
.
Table A1. ORDER backbone hyperparameters.
Hyper-parameter |
Value |
Discount factor
|
0.99 |
Batch size |
256 |
Replay buffer size |
1e6 |
Optimizer |
Adam |
Minimum steps before training |
1e4 |
Policy network learning rate
|
3e−4 |
Quantile network learning rate
|
3e−5 |
Huber regression threshold
|
1 |
Number of quantile fractions N |
32 |
Quantile fraction embedding size |
64 |
Table A2. Hyperparameters of ORDER for the benchmark results.
ss |
|
|
|
entropy tuning |
|
|
hopper-random |
1 |
10 |
3e−5 |
yes |
0.0001 |
0.1 |
hopper-medium |
10 |
10 |
3e−4 |
yes |
0.0 |
0.0 |
hopper-med-rep |
1 |
10 |
3e−5 |
yes |
0.0 |
0.0 |
hopper-med-exp |
10 |
10 |
3e−5 |
no |
0.0001 |
0.1 |
walker2d-random |
1 |
10 |
3e−5 |
yes |
0.0001 |
1.0 |
walker2d-medium |
10 |
10 |
3e−5 |
no |
0.0001 |
1.0 |
walker2d-med-rep |
1 |
10 |
3e−5 |
yes |
0.0 |
0.0 |
walker2d-med-exp |
10 |
10 |
3e−5 |
no |
0.0001 |
1.0 |
halfCheetah-random |
1 |
10 |
3e−5 |
yes |
0.0001 |
0.1 |
halfCheetah-medium |
10 |
10 |
3e−5 |
no |
0.0001 |
0.1 |
halfCheetah-med-rep |
1 |
10 |
3e−5 |
yes |
0.0001 |
0.1 |
halfCheetah-med-exp |
0.1 |
−1 |
3e−4 |
no |
0.0 |
0.0 |
Since we introduce smoothing techniques to the policy and quantile networks, we also add other hyperparameters. In Equation (2), the weight
for the quantile network smoothing loss
is searched in
. And beyond that, the weight
of the policy smoothing loss in Equation (3) is searched in
. When training the policy and distribution action-value functions, we randomly sample
perturbed observations from a
ball of norm
and select the one to maximize
and
, respectively. For
smoothing loss in Equation (1), set parameter
to 0.2 for conservative value estimation. All the hyperparameters used in ORDER for the benchmark are listed in Table A2.