by Dominik Drexler

In reinforcement learning (RL), the intrinsic reward function defines reward signals that help learning policies in cases where the extrinsic rewards are sparse. The extrinsic rewards specify the high level goal of a problem and the intrinsic reward functions represent the hidden subgoal structure of a problem or even a family of problems. The subgoal structure represent stepping stones that tell us “what” to achieve whereas policies tell us “how” to behave for reaching the high level goal. The former being more general pieces of knowledge than the latter allowing it to be better suited for speeding up the learning of policies for different high level goals or different dynamics of the world. Furthermore, the subgoal structure introduces the possibility to obtain hierarchical decompositions on different levels of abstraction which makes it interesting pieces of knowledge in general.

RL vs. Planning

In contrast to RL formalisms where actions are often stochastic, we consider the classical planning formalism where actions are deterministic. Similar to RL, we do not require any structure on the actions itself but a simulator that returns the state after applying a given action in a state. We consider infinite classes of problems Q over a common planning domain, e.g., all problems over the Delivery domain where there is a set of packages that must be delivered to target locations one-by-one, in a gridlike world. W.l.o.g., we assume that Q is closed in the sense that if problem P with initial state s is in Q then P’ that is like P but with initial state s’ that is reachable from s is also in Q. This assumption makes it easier to consider all relevant subproblems that one can start from.

Subgoal Structure Representations

The representations for the subgoal structure in deep reinforcement learning (DRL) are usually deep neural nets (Zheng et al., 2020) that result in black boxes or symbolic reward machines (Icarte et al., 2022) that are easier to interpret and closely related to our work. In our work, we use the simple policy sketches language (Bonet and Geffner, 2021) where a policy sketch R consists of rules of form C -> E over state features F. The feature conditions C say what must hold in the current state s and the feature effects say how the features are supposed to change going from s to a state s’. Hence, the rules define subgoals dependent on a given state. Obviously, there is a huge design space to come up with sketches for a specfic class of problems Q. We are specifically interested to find a policy sketch that allows provably polynomial time solutions for each problem in Q. To this end, the policy sketch must induce subgoals in such a way that they can be exploited by a simple search algorithm called serialized iterated width with sketches SIW_R and the policy sketch must be simple or otherwise making it hard to proof that they work as intended.

Width, Iterated Width, and Serialized Iterated Width with Sketches

Serialized iterated width with sketches SIW_R(k) runs a sequence of iterated width IW(k) searches from one subgoal to another until the high level goal is achieved starting from the initial state. Each IW(k) search is a breadth-first search with novelty pruning where generated states are pruned if they do not contain sufficiently much unseen information, i.e., fact tuple of size at most k. The number of generated states is exponential in k meaning that the search is very efficient for small k (usually k=0,1,2) but infeasible for k > 2. Underlying IW(k) is the notion of problem width w(P[s,t]) that defines the subgoal fact tuples t of size at most k that are optimally reachable from the initial state s of problem P. Hence, if the policy sketch defines the states underlying such tuple t as additional closest goal states then this subproblem has width bounded by k. If every P in Q has width bounded by k then we say that the sketch width is bounded by k. However, we want to achieve the overall goal of the problem and not some arbitrary subgoal. To this end, we need to introduce the notion of termination which intuitively means that following the sketch will never end in a cycle and therefore, in the high level goal of the problem. To conclude, if the sketch is terminating and its width is bounded by k then this omits polynomial times solutions for any problem P in Q.

Domain General State Features

We consider domain general state features made up of description logics and counting (Bonet et al., 2019). Description logics is a language for knowledge representation consisting of two types of objects called concepts and roles. A concept represents a unary and a role represents a binary relation referring to predicates of the planning domain. More complex concepts and roles are obtained by composition such as intersection, union, etc. Furthermore, We consider two types of features: Boolean and numerical. A Boolean feature maps states into either true or false and a numerical feature maps states into natural numbers. We obtain a pool of relevant features by iteratively applying description logics grammar rules and counting. The feature complexity is the number of times we apply grammar rules. We use the DLPlan library to generate a feature pool. The important part is that these features are based on a simple language and that they can be evaluated on any given state of a problem over the domain, no matter the size of the problem.

Learning Policy Sketches

Depending of the difficulty of the tractable domain, a single up to very few fully expanded but small training problems suffice because it captures all the needed aspects. For example, in the Delivery domain, if we want to learn a 2-width sketch then we need a single problem where two packages have to be delivered. The reason is that a problem where a single package has to be delivered has width 2. However, information is missing that there can in principle be multiple packages. To learn a sketch, we solve a weighted-Max-SAT problem expressed in ASP where we let the solver select a set of features F from the feature pool and rules over F that must be acyclic and have sketch width bounded by k. To obtain the simplest solution, we require the total feature complexity to be minimal among the solutions. The learned 2-width sketch for the Delivery domain then looks as follows.

n := number of undelivered packages
rule = {n > 0} -> {dec(n)}

where r says that if the current state s has undelivered packages, then a state s’ is a subgoal state if the number of undelivered packages decreases. The sketch width is 2 implies that SIW_R solves any Delivery problem in polynomial time. More examples, proofs and details can be found in our paper (Drexler et al., 2022).

Limitations & Future Research

  • Our approach requires the domain predicates to derive a pool of domain general state features. Hence, how to derive domain general state features from images or other more complex inputs?

  • Weighted-Max-SAT solvers do not scale as good as deep-learning solvers. Therefore, How can we learn the subgoal structure as representations of target languages like policy sketches that allow for polynomial time solutions using deep-learning solvers?