强化学习代考-ELEN E6885代写-Reinforcement Learning代写

Midterm Exam

ELEN E6885: Introduction to Reinforcement Learning

强化学习代考 If the only difference between two MDPs is the value of discount factor γ, then they must have the same optimal policy.

Formula Sheet

1.Bellman expectationequation:

2.Bellman optimalityequation:

3.TD(0)iteration:

v(s_t) ← v(s_t) + α(r_t₊₁ + γv(s_t₊₁) − v(s_t)).

Problem 1 (20 Points, 2 Points each)

True or False. No explanation is needed.

Reinforcementlearning uses the formal framework of Markov decision processes (MDP) to define the interaction between a learning agent and its environment in terms of states, actions and rewards.
MDP instances with small discount factors tend to emphasize near-term rewards.
Ifthe only difference between two MDPs is the value of discount factor γ, then they must have the same optimal policy.
Bothϵ-greedy policy and Softmax policy will balance between exploration and exploitation.
Everyfinite MDP with bounded rewards and discount factor γ [0, 1) has a unique optimal policy.
In practice, both policy iteration and value iteration are widely used, and it is not clear which,if either, is better in general. 强化学习代考
Generalized policy iteration (GPI) is a term used to refer to the general idea of letting policy-evaluationand policy improvement processes interact, independent of the granularity and other details of the two processes.
Sarsaconverges with probability 1 to an optimal policy and action-value function as long as all state–action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy.
Inpartially observable Markov decision process (POMDP), it is likely that an agent cannot identify its current So the best choice is to maintain a probability distribution over the states and actions. and then updates this probability distribution based on its real-time observations.
Manyoff-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.

Problem 2 (20 Points, 5 Points each) 强化学习代考

Short-answer questions.

Considera 100×100 grid world domain where the agent starts each episode in the bottom- left corner, and the goal is to reach the top-right corner in the least number of steps. To learn an optimal policy to solve this problem you decide on a reward formulation in which the agent receives a reward of +1 on reaching the goal state and 0 for all other Suppose you try two variants of this reward formulation, (P1), where you use discounted returns with γ (0, 1), and (P2), where no discounting is used. As a consequence, a good policy can be learnt in (P1) but no learning in (P2), why?
Supposethe reinforcement learning player was Might it always learn to play better, or worse, than a nongreedy player?
Isthe MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?
Givena stationary policy, is it possible that if the agent is in the same state at two different time steps, it can choose two different actions? If yes, please provide an example.

Problem 3 (30 Points)

Consider a 2-state MDP. The transition probabilities for the two actions a₁ and a₂ are summa- rized in the following two tables, where the row and column represents from-state and to-state, respectively. For example, the probability of transition from s₁ to s₂ by taking action a₁ is 0.8.The reward function is Assume the discount factor γ = 0.5.

Table 1: Transition probabilities

[10Pts] Let v1∗, v2∗ denote the optimal state-value function of state s₁, s₂, respectively. Write down the Bellman optimality equations with resepect to v1∗ and v2∗.
[6Pts] Based on your answer to the previous question, prove that v1∗ > v2∗.
[14Pts] Based on the fact that v1∗ > v2∗, find out the optimal value function v1∗ and v2∗. And write down the optimal policy of this MDP.

Problem 5 (30 Points) 强化学习代考

Consider an undiscounted MDP (γ = 1) with two non-terminal states A, B and a terminal state C. The transition function and reward function of the MDP are unknown. However, we have observed the following two episodes:

A, a₁, −1, A, a₁, +1, A, a₂, +3, C,

A, a₂, +1, B, a₃, +2, C,

where a₁, a₂, a₃ are actions, and the number after each action is an immediate reward. For example, A, a₁, −3, A means that the agent took action a₁ from state A, received an immediate

reward −3 and ended up in state A. 强化学习代考

[9Pts] Using a learning rate of α = 0.1, and assuming initial state values of 0, what updates to V (A) does on-line TD(0) method make after the first episode?
[10Pts] Draw a diagram of an MDP and policy π that can best fit these two episodes (i.e., the maximum likelihood Markov model). Write down the transition probability P ^a ′ , the reward function R^a of the MDP you draw, and your estimated policy π(a|s).

[6Pts] Based on your results in the previous question, solve Bellman equation to find out the state-value function of V_π(A) and V_π(B). (Assume V_π(C) = 0 as state C is the terminal state.)
[5Pts] What value function would batch TD(0) find, e., if TD(0) was applied repeatedly to these two episodes?

合作平台：essay代写论文代写写手招聘英国留学生代写

Midterm Exam

ELEN E6885: Introduction to Reinforcement Learning

Formula Sheet

Problem 1 (20 Points, 2 Points each)

Problem 2 (20 Points, 5 Points each) 强化学习代考

Problem 3 (30 Points)

Problem 5 (30 Points) 强化学习代考

你可能也喜欢

数据建模assignment代写-CSCI E-106代写-RStudio Cloud代写

php代写-容易被老师发现是真的么-自己实在没基础怎么办

政治网课代修 – 留学生的政治网课代修该怎么办 – 留学代写

发表回复 取消回复

发表回复取消回复