Robot Assistive Tasks

User-Guided Reinforcement Learning for an Intelligent Environment


Authors : Y. Wang, M. Huber, V. N. Papudesi, and D. J. Cook
Department of Computer Science and Engineering
University of Texas at Arlington

Abstract
Autonomous robots hold the possibility of performing a varietyof assistive tasks in intelligent environments. However,widespread use of Robot assistants in these environmentsrequires ease of use by individuals who are generally not skilledrobot operators. In this paper we present a method of trainingrobots that bridges the gap between user Programming of a robotand autonomous learning of a robot task. With our approach tovariable autonomy, we integrate user commands at varyinglevels of abstraction into a reinforcement learner to permit fasterpolicy acquisition. We illustrate the ideas using a robot assistanttask, that of retrieving medicine for an inhabitant of a smarthome.

1 Introduction
The application of robot technologies in complex,semi-structured environments and in the service ofgeneral end-users promises many benefits. In particular,such Robots can perform repetitive and potentiallydangerous tasks, as well as assist in operations that arephysically challenging for the user. In the context ofintelligent environments, assistive robots have a variety offunctions to offer. They can move through theenvironment making sure that the contents and inhabitantsare secure. They can also perform simple tasks such ascleaning and retrieving needed objects.

Moving robot systems from factory settings into moregeneral environments, particularly environments requiringinteraction with humans, poses large challenges for theircontrol system and for the interface to the human user.The robot system must be able to operate based on directuser guidance or increasingly autonomously as theenvironment, robot experience, and task complexitydictates. Furthermore, it must do so in a safe and efficientmanner without requiring constant, detailed user inputwhich can lead to rapid user fatigue (Wettergreen et al.,1995).

For Personal Robot applications, such as robot assistivetasks in intelligent environments, this requirement isfurther amplified by the fact that the user is generally nota skilled engineer and can therefore not be expected to beable or willing to provide constant, detailed instructions.An inhabitant of a smart home, for example, would like torequest that needed medicine be retrieved without givingdetailed instructions of how to accomplish the task. Forthe user interface and the integration of human input intoan autonomous Control System this implies that a robotsystem must facilitate the incorporation of usercommands at different levels of abstraction and atdifferent bandwidths. This, in turn, requires operation atvarying levels of autonomy (Dorais et al., 1998; Hexmoreet al., 1999) depending on the available user feedback.

An additional challenge arises because efficient taskperformingstrategies that conform with the preferences ofthe user are often not available a priori. As a result, thesystem has to be able to acquire them on-line whileensuring that autonomous operation and user-providedcommands do not lead to catastrophic failures.

In recent years, a number of researchers haveinvestigated the issues of learning and user interfaces(Clouse & Utgoff, 1992; Smart & Kaelbling, 2000;Kawamura et al., 2001). However, this work wasconducted largely in the context of mission-levelinteraction with the robot systems using skilled operators.In contrast, the approach presented here is aimed at theintegration of potentially unreliable user instructions intoan adaptive and flexible control framework in order toadjust control policies on-line. The learned policiesshould more closely reflect the preferences andrequirements of the particular end-user. To achieve this,user commands at different levels of abstraction areintegrated into an autonomous learning component. Theirinfluence speeds learning of the control policy, but islimited to not prevent ultimate task achievement. As aresult, the robot can seamlessly switch between fullyautonomous operation and the integration of high and/orlow-level user commands.

In the remainder of this paper, our approach tovariable autonomy is presented. In particular, fullyautonomous policy acquisition, the integration of highleveluser commands in the form of subgoals and the userof intermittent low-level instructions using directteleoperation are introduced. Their use is demonstrated inthe context of an intelligent environment task using awalking robot, that of retrieving an object as requested bythe environment inhabitant.

2 Combining User Input and AutonomousLearning for Variable AutonomyThe approach presented here introduces a method ofachieving variable autonomy by integrating user input andautonomous control policies in a Semi-Markov DecisionProcess (SMDP) model that is built on a hybrid controlarchitecture. Overall behavior is derived from a set ofreactive behavioral elements that address localperturbations autonomously. These elements areendowed with formal characteristics that permit thehybrid systems framework to impose a priori safetyconstraints that limit the overall behavior of the system(Huber & Grupen, 1999; Ramadge and Wonham, 1989).These constraints are enforced during autonomousoperation as well as during phases with extensive userinput. In the latter case, they overwrite user commandsthat are inconsistent with the specified safety limitationsand could thus endanger the system. The goal here is toprovide the robot with the ability to avoid dangeroussituations while facilitating flexible task performance.

On top of this control substrate, task-specific controlpolicies are represented as solutions to an SMDP,permitting new tasks to be specified by means of a rewardstructure rT that provides numeric feedback according tothe task requirements. The advantage here is thatspecifying intermittent performance feedback is generallymuch simpler than determining a corresponding controlpolicy. Using this reward structure, reinforcementlearning (Barto et al., 1993; Kaelbling et al., 1996) is usedto permit the robot to learn and optimize appropriatecontrol policies from its interaction with the environment.When no user input is available, this forms a completelyautonomous mode of task acquisition and execution.

User input at various levels of abstraction is integratedinto the same SMDP model. User commands temporarilyguide the operation of the overall system and serve astraining input to the reinforcement learning component.Use of such training input can dramatically improve thespeed of policy acquisition by focusing the learningsystem on relevant parts of the behavioral space (Clouse& Utgoff, 1992). In addition, user commands provideadditional information about user preferences and areused here to modify the way in which the robot performsa task. This integration of user commands with the helpof, and as a jumpstart for, reinforcement learningfacilitates a seamless transition between user operation ofthe robot and fully autonomous execution, based on theavailability of user input. Furthermore, it permits usercommands to alter the performance of autonomouscontrol strategies without the user needing to provide acomplete specification of the control policy. Figure 1shows a high-level overview of the components of thecontrol system.

In the work presented here, user commands at a highlevel of abstraction are presented to the SMDP model inthe form of temporary subgoals to be achieved orsuggested specific actions to execute. This input is used,as long as it conforms with the a priori safety constraints,to temporarily drive the robot. At the same time, usercommands play the role of training input to the learningcomponent, which optimizes the autonomous controlpolicy for the current task. Here, Q-learning (Watkins,1989) is used to estimate the utility function, Q(s,a), byupdating its value when action a is executed from state saccording to the formula

where r represents the obtained reward.

Low-level user commands in the form of intermittentcontinuous input from devices such as a joystick areincluded in the same fashion into the learning component,serving as temporary guidance and training information.

3 User Commands as Reward Modifiers
To address the preferences of the user beyond a singleexecution of the action and to permit user commands tohave long-term influence on the robot’s performance of atask, we employ user commands to modify the taskspecificreward structure to more closely resemble theactions indicated by the user. This is achieved by meansof a separate user reward function, ru, that represents thehistory of commands provided by the user. User input iscaptured by means of a bias function, bias(s,a), which isupdated each time a user gives a command to the robotaccording to the function

where action a in state s is part of the user command andthere are n possible actions in state s. The total rewardused by the Q-learning algorithm throughout robotoperation is then

leading to a change in the way a task is performed evenwhen operating fully autonomously.

Incorporating user commands into the rewardstructure rather than directly into the policy permits theautonomous system to ignore actions that have previouslybeen specified by the user if they were contradictory, iftheir cost is prohibitively high, or if they prevent theachievement of the overall task objective as specified bythe task reward function, rt. This is particularly importantin personal robot systems such as assistive robots inintelligent environments, where the user is often untrainedand might not have a full understanding of the robotmechanism. For example, a user could specify a different,random action every time the robot enters a particularsituation (e.g., a different fetch operation from a differentlocation). Under these extreme circumstances, the userrewards introduced above would cancel out and no longerinfluence the learned policy. Similarly, the user mightgive a sequence of commands which, when followed,form a loop (e.g., perform sentry duty over the entirehouse, returning to start location) and thus prevent theachievement of the task objective. To avoid this, the userreward function has to be limited to ensure that it does notlead to the formation of spurious loops. In the approachpresented here, the following formal lower and upperbounds for the user reward, ru, applied to action a in states, have been established and implemented. Details on thederivation of the bounds are reported elsewhere (Papudesi,2002).

These bounds ensure that the additional user rewardstructure does not create any loops, even if explicitlycommand by the user. As a result, the system cansuccessfully achieve the overall task objective providedby the task reward, rT.

4 Experiments
To demonstrate the Power and applicability of themodel of variable autonomy introduced here, a number ofexperiments in simulation and on mobile and walkingrobot tasks have been performed. These experimentsdemonstrate that the approach presented here provides aneffective interface between robot and human as well as avaluable robot training mechanism.

4.1 High-Level User Commands
Our first experiment demonstrates the integration ofuser commands and autonomous learning. The goal ofthe robot navigation task is to learn to optimally navigatethe environment and reach a specific target. Theenvironment itself consists of a set, V, of via pointssuperimposed on a collection of maps consisting of a50x50 grid of square cells. These via points representuser-guided bias and thus affect the problem reward.

Actions are specified as instances of geometric motioncontrollers that permit the robot to move safely betweensubsets of the via points. These actions directly handlethe continuous geometric space by computing collisionfreepaths to the selected via point, if such a path exists.Targets represented by via points are directly reachable byat least one controller. However, controllers are onlyapplicable from a limited number of states, making itnecessary to construct navigation strategies as a sequenceof via points that lead to the target location. Here,harmonic path control (Connolly & Grupen, 1993), apotential-field path planner is used to generate continuousrobot trajectories while ensuring that the robot does notcollide with an object. By abstracting the environmentinto a set of via points, the agent is capable of acombination of geometric and topological path planning.At the lower level, each harmonic controller generatesvelocity vectors that describe the path geometrically. Atthe higher level, the D-EDS Supervisor producestopological plans in the form of sequences of via points.

To illustrate the guidance of the robot using high-leveluser commands in the form of subgoals, two experimentswere performed on the Pioneer 2 mobile robot. Theseexperiments demonstrate the ability of high-level userinput to accelerate learning and modify autonomousbehavior while avoiding unreliable user commands.

First, we demonstrate the capability of the approach touse sparse user input to modify the learned control policy.This forces the learned policy to more closely reflect thepreferences of the user. We demonstrate this capabilityon a navigation task, which is first learned without userinput and then modified by incorporating a single usercommand in the form of an intermediate subgoal.Because the subgoal is outside the chosen path, thelearned path is modified based on user input, as shown inFigure 3. Here, the end location is marked with an X andthe learned paths are highlighted. Figure 3 shows thecorresponding changes in the Q-value and user rewardfunctions for the previously best action (black line) andthe new best action (grey line). These graphs illustratethe effect of the command on the reward function for thetask and, as a result, on the value function and policy.Figure 4 shows the robot performing the navigation task.

Second, we illustrate the capability of the presentedapproach to overwrite inconsistent user commands thatwould invalidate the overall task objective. Here, the userexplicitly commands a loop between two via points.Figure 5 shows the loop specified by the user commandsthe the learned loop-free policy that the robot executesafter learning.

Although the robot will execute the loop as long as theuser explicitly commands this, it reverts to a policy thatfulfills the original task objective as soon as no furtheruser commands are received.

4.2 Multi-Level User Input
A second set of experiments was performed using awalking robot dog, Astro (shown in Figure 6), todemonstrate user-guided robot learning at multiple levelsof abstraction. In these experiments, high-level subgoalsas well as low-level joystick commands were integratedto demonstrate the capabilities of the presented model forvariable autonomy.

Once again, the robot task is to navigate to a specifiedlocation, but user guidance takes multiple forms. First,user-specified subgoals represent via-points that Astroshould visit en route to the goal location. Second, userinteraction guides the selection of low-level movementpatterns for Astro to make. In the wheeled robotnavigation task, a harmonic path is calculated for therobot to circumvent corners in the space that could causecollisions. However, this motion is inefficient for thesmaller, and potentially more agile, walking dog. As aresult, we provide two movement options for Astro:straight line and harmonic motion.

A reinforcement learning algorithm is used to selectthe movement pattern that is best for any pair of viapoints. As we mentioned before, the system has twocontrollers, namely a line controller and a harmoniccontroller, to determine the most appropriate movingpattern of the robot. If line controller is chosen, the robotwill travel between via-points along a straight line, whilewith the harmonic controller, the robot will walk along acurve. For a given pair of via points, the user selects adirection for the robot to follow or allows the algorithm toselect a motion consistent with the learned policy. If theuser selects a direction, the dog moves in this direction fora fixed distance. The executed path is compared with thepath generated using one of the two predeterminedmovement patterns, and the movement choices are givenrewards based on the difference between planned andselected movement paths.

This combination of high-level and low-level userguidance is validated in a walking robot-based navigationtask. Here, Astro is successfully taught the best movingstyle to follow from one location to another based onjoystick-controlled direction from the user as well as thevia points shown in Figure 6. Localization for this task isbased on heuristics, but future implementations will makeuse of paw joint angles to further improve estimation ofthe robot current location.

In this experiment, point via-005 is specified as thegoal. Initially, Astro chose via-003 as the first subgoal.However, the user discourages that choice because itwould move too close to the wall. Astro then selects via-005 as a subgoal. Because there is a wall between thestart and goal locations, this choice also ultimately failsand Astro selects via-001 as the next choice.

Although the subgoal choice is viable, Astro selects aharmonic motion to reach via-001. The user intercedesusing a joystick to flatten the path and the movementpolicy is refined based on this interaction.

After reaching point via-001, Astro begins to movestraight toward point via-005, which again leads him tothe corner. The user maneuvers the joystick to avoid this.After reaching via-003 using a curved motion, Astrowisely chooses point via-004 as a pass-through point andthen walks straight to via-004 and finally to the goal.When repeating the same task, Astro improves hismovement efficiency based on the same high-level andlow-level feedback from the user.

4.3 Intelligent Environment Robot Task
For our final experiment, we utilize our variableautonomy approach to accomplish a retrieval task. Thisclass of Robot Tasks is an important component of theMavHome smart home environment. The goal ofMavHome is to view the home as an intelligent agent,able to make decisions to control the environment in away that maximizes comfort for inhabitants whileminimizing resource utilization (Das et al., 2002).

Robot agents in MavHome can perform a widevariety of assistive tasks. One such task is to retrieve anobject at the request of an inhabitant. For example, abedridden individual at home alone may request the robotto fetch some needed medicine when the person cannotget it himself. This experiment equips Astro with thecapability of bringing medicine to a patient.In this experiment, the patient is near the start pointshown in Figure 6 and commands Astro to retrieve themedicine located at via-005. The task thus consists ofnavigating to point via-005, picking up the medicine, andreturning to the start location.

In addition to the abstract navigation and low-levelmotion controller actions discussed in the previoussections, this application adds actions to search for atarget object and to pick up the object. To accommodateretrieval tasks, we design a pink basket to hold smallobjects, which the robot can identify and lift with its head.Driven by high-level user commands, the robot arrives atvia-005 as he performs the navigation task, then he needsto conduct the pickup action. Navigation is driven by thehigh and low-level control policies learned earlier. Therobot then needs to identify the pink basket, and adjustsits position based on the current neck angle and thedistance to the basket. After Astro adjusts his position, heuses his neck to pick up the basket with the medicine.Finally, Astro carries the medicine back to the start pointwhere the patient is. Figure 7 shows Astro executing thistask in the MavHome environment.

Tech Materials (Free)

Robot Behaviors Exploring the T-Maze: Evolving Learning-Like Robot Behaviors using CTRNNs
Humanoid Robotics A Biochemical Subsystem for a Humanoid Robot
Industrial Automation Systems Applying Agents for Engineering of Industrial Automation Systems
Robot Team Cooperation A Descriptive Model of Robot Team and the Dynamic Evolution of Robot Team Cooperation
Kuka Robots For ONU ONU Robotics Technology Center of Excellence, powered by KUKA Robotics Corporation
Augmented reality Annotation System for Robotic Application
Modular Robots Self-Reconfiguration Planning Of Identical Modules
Autonomous robots A New Approach To Robotics
Robotic Mounting Flat Panel Displays With Robotic Mounting
Calibration of Industrial Robots A Photogrammetric Robot Calibration System Based On Off-The-Shelf Low Cost Hardware Components

More...

Amazon Books
Creative Projects with LEGO Mindstorms Creative Projects with LEGO Mindstorms by Benjamin Erwin
Buy new: $20.64 / Used from: $13.00
A good place to start, especially for kids, with Lego Mindstorms
RobotProgramming : A Practical Guide to Behavior-BasedRobotics A Practical Guide to Behavior-Based Robotics by Joe Jones
Buy new: $20.67 / Used from: $15.13
Very good for programming not so much behavior as control. Language and controller agnostic


Add to Google
Add to Yahoo

Robotics  What is Robotics?
     - Robotic Applications
     - Communication Types
     - Robo Structures
     - Grippers
     - Direction Control
     - Power Sources
     - Programming Methods
Human Robot Interaction  Interaction Dynamics Among Humans And Robots
     - Seal Robot
     - I-Blocks
     - LEGO Mindstorms
Industrial Automation  Modern trends in Industrial Automation, Process Control and Robotics
Design Priniciples  Design principles of Human Machine Interface Systems In Industrial automation
     - Design Process
Gallery  Industrial Robots Gallery
     - ABB Robots
     - Epson Robots
     - Faunc Robots
     - Humanoid Robots
     - Scara Robots