HOMEWORK 5 (CS 1573)

Assigned: April 6, 2004

Due: April 15, 2004

This exercise modifies HW1 on agents for the vacuum-cleaner world. by reformulating it as a reinforcement learning problem.

Adapt the reflex-vacuum-agent (Chapter 2 of Russell and Norvig) so that instead of having the policy in Figures 2.3/2.8, it only knows the set of actions available at each state, and attempts to gather information which would hopefully enable it to learn a policy such as in Figure 2.8

Assume the agent has a lifetime of 10 steps, and the reward is calculated by awarding one point for each clean square at each time step.

1. Assign initial action-value estimates that represent the policy in Figure 2.8. For each of the following exploration-exploitation strategies, compute the average reward (averaged over each initial state) when the agent uses each of the following strategies:

  1. a greedy strategy (only exploit)
  2. an epsilon-greedy strategy with epsilon=1 (only explore)

2. Now assign initial action-value estimates that represent the opposite of the information in Figure 2.8, and redo part 2.

3. Discuss what parts 2 versus 3 tell you.

4. If we had more time and could really implement the "learning" part of this agent, what would you do with the reward information that your exploration/exploitation strategy gave you?

Please submit hardcopies and electronic versions showing a trace of what your agent does in your experiments. Bring the hardcopies to class, and submit electronically following the class submission policies.

This is meant to be a programming assignment, but if you absolutely cannot get it done a handsimulation will be acceptable with a half-grade deduction of points.

IMPORTANT NOTE:Points will be deducted if the submission procedure is not followed.