在MDP中,action不仅影响immediate reward,也影响未来的reward。v*和q*的backup disgrams 当已知 ,我们只需做one-step search,从action中选择能到达 的一个(做greedy search)。若已知 ,更简单,选择最大的对应的action即可,不需要知道任何环境信息。
3.1 The Agent-Environment Interface
Dynamics of finite MDP: 根据3.2,我们可以计算state-transition probabilities ,expected reward given state and reward ,expected reward for state-action-next-state 。RL问题中,state、action的选择更多是art而非science3.2 Goals and Rewards
reward指明goal,但是不能给agent prior knowledge,比如下围棋,赢了一局给1分,但是agent并不知道该怎么下。what you want it to achieve, not how you want it achieved。3.3 Returns and Episodes
episodic tasks:有terminal states,cumulative reward直接加起来就好。continuing tasks:更多RL问题没有terminal states,一直跑啊跑啊跑啊,于是用discounting来计算discounted return, 为discounted rate: 若 ,只要 有限制,则3.8式为finite value;若 ,则只考虑immediate reward。 只要 ,3.8就是finite3.4 Unified Notation for Episodic and Continuing Tasks
用abosorbing state 将episodic tasks 和continuing tasks结合,于是episodic tasks的reward变成 , or (or both)。3.5 Policies and Value Functions
value function: how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).policy : a mapping from states to probabilities of selecting each possible action.The value of a state under a policy , denoted , state-value function for policy : action-value function for policy : Monte Carlo methods:把所有可能都试了,计算average。也可以用parameterized functions来表示 ,不断调整参数。虽然依赖于approximator,但也可以听精确。在value function 中,有Bellman equation成立,是很多计算value function的基础:3.6 Optimal Policies and Optimal Value Functions
总存在一个policy优于其他policy或相等,称为optimal policy 。所有的optimal policies 有相同的state-value function,称为optimal state-value function ,相同的optimal action-value funtion 。 无需指定policy(independent of policies),直接取max就可。Bellman optimality equation: 同上。3.7 Optimality and Approximation
算力不够,时间不够,memory不够,必须用approximation。而且可能存在很多很少见的states,approximation不需要在这些states上做出很好的选择。这也是RL和其他解决MDPs方法一个主要区别。3.8 Summary
本文标题: Intro to RL Chapter 3: Finite Markov Decision Process
本文地址: http://www.lzmy123.com/duhougan/135350.html