cover

87. Implement the Sarsa Learning Algorithm to Exit the Maze#

87.1. Introduction#

In value-based reinforcement learning, we mainly implemented the Q-Learning algorithm. In fact, the biggest difference between the Sarsa algorithm and Q-Learning lies in the update of the Q-Table. In this experiment, we combined the implementation of the Q-Learning algorithm in the experiment and completed the maze challenge according to the algorithm process of Sarsa.

87.2. Key Points#

  • Q-Table Initialization

  • Q-Table Update Function

  • Complete Sarsa Algorithm Implementation

87.3. Q-Table Initialization#

Based on the previous experimental content, you should know that whether it is Q-Learning or Sarsa, their core is based on value iteration, so it is necessary to initialize the Q-Table first.

{exercise-start}
:label: chapter11_03_1

Challenge: Initialize the Q-Table as required.

Requirement: Construct a \(16*4\) DataFrame table (16 states, 4 actions) as the Q-Table.

Hint: The initialization method is the same as that in the Q-Learning experiment.

{exercise-end}
import numpy as np
import pandas as pd
import time
from IPython import display


def init_q_table():
    ### 代码开始 ### (≈ 2 行代码)
    actions = None
    q_table = None
    ### 代码结束 ###

    return q_table
{solution-start} chapter11_03_1
:class: dropdown
import numpy as np
import pandas as pd
import time
from IPython import display


def init_q_table():

    ### Start of code ### (≈ 2 lines of code)
    actions = np.array(['up', 'down', 'left', 'right'])
    q_table = pd.DataFrame(np.zeros((16, len(actions))),
                           columns=actions)  # Initialize the Q-Table to all 0s
    ### End of code ###

    return q_table
{solution-end}
init_q_table()
{note}
In this course, the Notebook challenge system cannot automatically judge. You need to supplement the missing code in the above cell and run it by yourself. If the output result is the same as the expected output result below, it means that this challenge has been successfully passed. After completing all the content, click "Submit for Detection" to pass. This note will not appear again in the following.

Expected output

up

down

left

right

0

0.0

0.0

0.0

0.0

1

0.0

0.0

0.0

0.0

2

0.0

0.0

0.0

0.0

3

0.0

0.0

0.0

0.0

4

0.0

0.0

0.0

0.0

5

0.0

0.0

0.0

0.0

6

0.0

0.0

0.0

0.0

7

0.0

0.0

0.0

0.0

8

0.0

0.0

0.0

0.0

9

0.0

0.0

0.0

0.0

10

0.0

0.0

0.0

0.0

11

0.0

0.0

0.0

0.0

12

0.0

0.0

0.0

0.0

13

0.0

0.0

0.0

0.0

14

0.0

0.0

0.0

0.0

15

0.0

0.0

0.0

0.0

A table with a border of 1, presented as a DataFrame. The table has a header row with column names ‘up’, ‘down’, ‘left’, and ‘right’, and 16 rows with each cell in the table containing the value 0.0.

87.4. Action Selection#

Next, we need to use the \(\epsilon\)-greedy method to select actions based on the Q-Table. Here, we implement the act_choose function following the experimental content.

{exercise-start}
:label: chapter11_03_2

Challenge: Use the \(\epsilon\)-greedy method to select actions based on the Q-Table.

Specification: With a probability of \(1 - \epsilon\), or when all Q-values are 0, select an action randomly; otherwise, select the action according to the maximum Q-value, and denote the action as action.

Hint: You may use if-else statements for judgment, which is the same as in the experiment.

{exercise-end}
def act_choose(state, q_table, epsilon):
    state_act = q_table.iloc[state, :]
    actions = np.array(["up", "down", "left", "right"])

    ### 代码开始 ### (≈ 4 行代码)
    if None:
        action = None
    else:
        action = None
    ### 代码结束 ###

    return action
{solution-start} chapter11_03_2
:class: dropdown
def act_choose(state, q_table, epsilon):

    state_act = q_table.iloc[state, :]
    actions = np.array(['up', 'down', 'left', 'right'])

    ### Start of code ### (≈ 4 lines of code)
    if (np.random.uniform() > epsilon or state_act.all() == 0):
        action = np.random.choice(actions)
    else:
        action = state_act.idxmax()
    ### End of code ###

    return action
{solution-end}

Run the test

seed = np.random.RandomState(25)  # 为了保证验证结果相同引入随机数种子
a = seed.rand(16, 4)
test_q_table = pd.DataFrame(a, columns=["up", "down", "left", "right"])
l = []
for s in [1, 4, 7, 12, 14]:
    l.append(act_choose(state=s, q_table=test_q_table, epsilon=1))
l

Expected output

['left', 'right', 'right', 'right', 'left']

87.5. Action Feedback#

In the action feedback, we also set the reward at the terminal to 10 and the penalty for falling into a hole to -10. Also, to find the shortest path as soon as possible, the penalty for each step is -1. You can directly use the similar code blocks in the experiment.

def env_feedback(state, action, hole, terminal):
    reward = 0.0
    end = 0
    a, b = state
    if action == "up":
        a -= 1
        if a < 0:
            a = 0
        next_state = (a, b)
    elif action == "down":
        a += 1
        if a >= 4:
            a = 3
        next_state = (a, b)
    elif action == "left":
        b -= 1
        if b < 0:
            b = 0
        next_state = (a, b)
    elif action == "right":
        b += 1
        if b >= 4:
            b = 3
        next_state = (a, b)

    if next_state == terminal:
        reward = 10.0
        end = 2
    elif next_state == hole:
        reward = -10.0
        end = 1
    else:
        reward = -1.0

    return next_state, reward, end

87.6. Q-Table Update#

Next, it is necessary to complete the Q-Table update function. From the experimental content, it can be seen that the update of the Q-Table in Sarsa is the biggest difference from Q-Learning. Therefore, it needs to be implemented according to the Q-Table update formula of Sarsa.

{exercise-start}
:label: chapter11_03_3

Challenge: Complete the Q-Table update function according to the Q-Table update formula of Sarsa below.

Hint: Modify it in combination with the Q-Table update function in Q-Learning. Use .loc[] when viewing specific values of the DataFrame by label.

{exercise-end}
\[ Q(s_{t},a_{t})=(1-\alpha) \cdot Q(s_{t},a_{t})+\alpha \cdot (r_{t}+\gamma \cdot Q(s_{t+1},a_{t+1})) \]
def update_q_table(
    q_table, state, action, next_state, next_action, terminal, gamma, alpha, reward
):
    x, y = state
    next_x, next_y = next_state
    q_original = q_table.loc[x * 4 + y, action]

    if next_state != terminal:
        ### 代码开始 ### (≈ 1 行代码)
        q_predict = None
        ### 代码结束 ###
    else:
        q_predict = reward

    ### 代码开始 ### (≈ 1 行代码)
    q_table.loc[None] = None
    ### 代码结束 ###

    return q_table
{solution-start} chapter11_03_3
:class: dropdown
def update_q_table(q_table, state, action, next_state, next_action, terminal, gamma, alpha, reward):

    x, y = state
    next_x, next_y = next_state
    q_original = q_table.loc[x * 4 + y, action]

    if next_state!= terminal:
        ### Code starts ### (≈ 1 line of code)
        q_predict = reward + gamma * \
            q_table.loc[next_x * 4 + next_y, next_action]  # Not selected through.max()
        ### Code ends ###
    else:
        q_predict = reward

    # Code starts ### (≈ 1 line of code)
    # Consistent with Q-Learning
    q_table.loc[x * 4 + y, action] = (1-alpha) * q_original + alpha * q_predict
    ### Code ends ###

    return q_table
{solution-end}

Run the test (execute only once. Restart the kernel if you need to execute it again)

new_q_table = update_q_table(
    q_table=test_q_table,
    state=(2, 2),
    action="right",
    next_state=(2, 3),
    next_action="down",
    terminal=(3, 2),
    gamma=0.9,
    alpha=0.8,
    reward=10,
)

new_q_table.loc[10, "right"]

Expected output: (results obtained from executing only once)

8.740755431411795

Similarly, to demonstrate the effect of reinforcement learning, define a state display function. Here, you can simply reuse the corresponding code block in the experiment.

def show_state(end, state, episode, step, q_table):
    terminal = (3, 2)
    hole = (2, 1)
    env = np.array([["_ "] * 4] * 4)
    env[terminal] = "$ "
    env[hole] = "# "
    env[state] = "L "
    interaction = ""
    for i in env:
        interaction += "".join(i) + "\n"

    if state == terminal:
        message = "EPISODE: {}, STEP: {}".format(episode, step)
        interaction += message
        display.clear_output(wait=True)
        print(interaction)
        print("\n" + "q_table:")
        print(q_table)
        time.sleep(3)  # 在成功到终点时,等待 3 秒
    else:
        display.clear_output(wait=True)
        print(interaction)
        print("\n" + "q_table:")
        print(q_table)
        time.sleep(0.3)  # 在这里控制每走一步所需要时间

87.7. Sarsa Algorithm Implementation#

Finally, we implement the complete learning process according to the Sarsa algorithm pseudocode.

Exercise 87.1

Challenge: After successfully completing the above several functions, implement the complete learning process according to the following Sarsa algorithm pseudocode. Please complete the code in combination with Q-Learning.

image

def sarsa(max_episodes, alpha, gamma, epsilon):
    q_table = init_q_table()
    terminal = (3, 2)
    hole = (2, 1)
    episodes = 0
    while episodes < max_episodes:
        step = 0
        state = (0, 0)
        end = 0
        show_state(end, state, episodes, step, q_table)
        x, y = state

        ### 代码开始 ### (≈ 1 行代码)
        action = None  # 动作选择
        ### 代码结束 ###

        while end == 0:
            next_state, reward, end = env_feedback(
                state, action, hole, terminal
            )  # 环境反馈
            next_x, next_y = next_state
            next_action = act_choose(next_x * 4 + next_y, q_table, epsilon)  # 动作选择

            ### 代码开始 ### (≈ 3 行代码)
            q_table = None  # q-table 更新
            state = None
            action = None
            ### 代码结束 ###

            step += 1
            show_state(end, state, episodes, step, q_table)
        if end == 2:
            episodes += 1
sarsa(max_episodes=20, alpha=0.8, gamma=0.9, epsilon=0.9)  # 执行测试

Expected output

_ _ _ _ 
_ _ _ _ 
_ # _ _ 
_ _ L _ 
EPISODE: 19, STEP: 5

q_table:
          up       down      left     right
0  -4.421534  -3.457078 -3.936450 -4.152483
1  -3.409185  -9.062400 -3.433181 -3.596441
2  -2.213120   4.590499 -3.514029 -3.414613
3  -1.536000  -1.574400 -2.908418 -3.936450
4  -4.109114  -2.730086 -2.836070 -2.548000
5  -2.836070  -9.984000 -2.065920 -1.720000
6  -2.850867   8.000000 -2.562662 -0.800000
7  -2.342144  -0.800000 -1.982720  0.000000
8  -2.509531  -2.348544 -2.213120 -8.000000
9   0.000000   0.000000  0.000000  0.000000
10 -3.033926  10.000000 -8.000000 -0.800000
11 -2.844488   0.000000  0.000000 -0.800000
12 -2.766862  -1.536000 -1.536000  6.142464
13  0.000000   0.000000 -2.325504  9.600000
14  0.000000   0.000000  0.000000  0.000000
15  0.000000   0.000000  0.000000  0.000000

Since the values in the Q Table are random, the above experimental results are for reference only. As long as the Q Table is continuously updated as required with the increase in the number of iterations, and the number of steps taken by the agent becomes fewer, eventually approaching 5 steps, it is fine.


○ Sharethis article link to your social media, blog, forum, etc. More external links will increase the search engine ranking of this site.