This exercise modifies HW1 on agents for the vacuum-cleaner world. by reformulating it as a reinforcement learning problem.
Adapt the reflex-vacuum-agent (Chapter 2 of Russell and Norvig) so that instead of having the policy in Figures 2.3/2.8, it only knows the set of actions available at each state, and attempts to gather information which would hopefully enable it to learn a policy such as in Figure 2.8
Assume the agent has a lifetime of 10 steps, and the reward is calculated by awarding one point for each clean square at each time step.
1. Assign initial action-value estimates that represent the policy in Figure 2.8. For each of the following exploration-exploitation strategies, compute the average reward (averaged over each initial state) when the agent uses each of the following strategies:
2. Now assign initial action-value estimates that represent the opposite of the information in Figure 2.8, and redo part 2.
3. Discuss what parts 2 versus 3 tell you.
4. If we had more time and could really implement the "learning" part of this agent, what would you do with the reward information that your exploration/exploitation strategy gave you?
Please submit hardcopies and electronic versions showing a trace of what your agent does in your experiments. Bring the hardcopies to class, and submit electronically following the class submission policies.
This is meant to be a programming assignment, but if you absolutely cannot get it done a handsimulation will be acceptable with a half-grade deduction of points.
IMPORTANT NOTE:Points will be deducted if the submission procedure is not followed.