Abstract
To improve the speed and accuracy of the trace based policy evaluation method TD({\lambda}), under appropriate assumptions, we derive and propose an off-policy compatible method of meta-learning state-based {\lambda}’s online with efficient incremental updates. Furthermore, we prove the derived bias-variance tradeoff minimization method, with slight adjustments, is equivalent to minimizing the overall target error in terms of state based {\lambda}’s. In experiments, the method shows significantly better performance when compared to the existing method and the baselines.
Abstract (translated by Google)
URL
http://arxiv.org/abs/1904.11439