Measure-Theoretic View of Policy Gradients
Introduction Why a Measure-Theoretic View of Policy Gradients? Reinforcement learning (RL) has long always relied on probability densities and likelihood ratios to compute policy gradients. The standard derivation comes to this conclusion: $$ \nabla_\theta J(\pi_\theta) = \mathbb{E} \left[ R \nabla_\theta \log \pi_\theta(a | s) \right] $$ where $J(\pi_\theta)$ is the objective function (e.g. expected reward), $\pi_\theta$ is the policy, $R$ is the reward, and $\nabla_\theta \log \pi_\theta(a | s)$ is the gradient of the log policy. Basically what we covered previously. ...