You finally have it. The most awesome robot, with all the sensors and actuators you’ve ever dreamed of. LIDAR, RADAR, vision, far-field microphone array, pressure-sensitive touch, precision GPS, and odometry. Speakers, screens, legs, arms, hands, and maybe even a flamethrower. This robot has everything. Maybe it’s something you’ve put together in your garage in your spare time. Whichever, many years of work, research, and rework have gone into building your dream robot, which you’re confident can sense and do anything.
So now the real work begins — you need to somehow get this robot to do what you want.
Let’s say you’ve got a task that you want this robot to perform. It could be anything, from juggling 6 flaming chainsaws to helping a stroke victim do their exercises to baking an apple pie. It could be the key to your money-making robot company concept, or just a useful behavior to help you around the house. Whatever it is, inside your brain, latent in your mind, is a policy that, if transferred to the robot, would make it perform the task just the way you want it. The problem of getting the task from your mind to the robot is what we call Human-Robot Policy Transfer (HRPT). And there are a few different ways to approach it.
The first, and generally simplest, way to get a robot to do something is direct teleoperation. That is, having a human directly control the robot to perform the task. You’ll need an appropriate method of controlling the robot, which could be anything from a set of joysticks to a full-body telemetry suit. You’ll also need a human, possibly you (or a paid expert), to control the robot, or maybe even a whole team of humans, each controlling a different part of the robot in different ways.
Whatever the control system, the idea is that humans implement the control policy themselves. That is, a human observes the robot’s state either directly or through sensors, then tells the robot how to use its actuators (such as manipulators, cutters, lifters, etc.) to achieve the task. With teleoperation, even though some low-level systems (for example, balance) might be automated, all high-level decision-making is done by a human.
There are many advantages to teleoperation. First and foremost, it will save you the trouble of figuring out how to state your goals explicitly; that is, the policy can remain latent in your mind. This feature is a big help if the policy is fairly complex. When you consider the effort it could take to program a robot to perform a complicated, multi-step task, it can be clear that simply controlling the robot yourself might get the job done more quickly.
Further, with a human in the loop, adapting to a changing environment can be done immediately. Additionally, since a human is making the decisions, legal and ethical questions of liability and responsibility are easier to answer. An excellent example of teleoperation for HRPT is the use of surgical robots, where the precision and scale of the robot are well-used by the complex — and unexpressed — control policy inside the human surgeon’s head.
There are, however, also many drawbacks to teleoperation as an HRPT approach. The main one is that, since humans have to be involved with teleoperated robots at every moment, it doesn’t really achieve the dream of developing autonomous robots. A teleoperated robot doesn’t free humans from doing work or enable one human to multiply their effectiveness and become as productive as 10 people. (It does, however, allow humans to remove themselves from dangerous situations, which is sometimes the entire point.)
Similarly, since a human is, in effect, still doing the work with teleoperation, all the faults and issues of humans come through. For example, teleoperators get tired and distracted, show up late, or just make mistakes — all things that we expect (perhaps incorrectly) our robots not to do.
As a first-pass tool, teleoperation is great, in that it proves that the task you want your robot to perform is actually performable by the robot. It’s good to know that the robot’s hardware is physically capable of doing the task you want it to do, before you try and develop an autonomous control policy for it.
Further, if you limit yourself to only using the robot’s sensor information during teleoperation, you can prove that the task is decideable by the robot. That is, that the robot has access to enough information to actually determine what to do in response to its sensor stream. Again, this is highly useful to know before you pour hours/weeks/months/years into trying to get the robot to do the task on its own.
So let’s say you’ve proved that the task you want your robot to accomplish is actually decideable and performable by this awesome robot body, and now you want to make the robot perform the task by itself. Great! How’re you going to do that? Probably with the second major approach to HRPT.
That right. You’re going to sit down and try to — explicitly and in great detail, in your favorite programming language — write out exactly what the robot needs to do, how it needs to react to its environment, in order to accomplish the task.
Honestly, to consider every possible situation that could potentially happen while your robot goes about its business would take too long (there are an infinity of them, after all, and even people paid to come up with odd situations miss a bunch). So, you’ll likely make some assumptions about what situations are unlikely to arise and ignore them in your code. Do you really need to consider what should happen if your robot encounters a wheelchair/duck/broom combo?
Even ignoring the more unlikely scenarios, there’s a lot of work required to program a robot to perform anything other than the most basic tasks in a wide variety of different contexts. In fact, for a given level of programming effort (say, a month), as the complexity of the task goes up, the variance of the environment that it can be programmed to operate in goes down.
Sure, it’s easy to code a simple task (drive in a square) to be performed in a wide variety of environments (different floor surfaces/rooms). It’s also just about as easy to program a complex task (set the table) in one particular environment (all plates/utensils and tables/chairs in fixed, known locations). But programming a complex task for varying environments (setting the table in any kitchen) is much, MUCH harder. All of this effort (designing, architecting, coding, debugging, and testing) can in fact be a lot more work than just doing the task yourself.
So why bother trying to get a robot to do complex tasks autonomously? Because once your programming works — if it works — the robot can perform the task over and over and over again. With luck, you’ll recoup your effort in the long run.
Almost all autonomous robots you’re likely to have interacted with up to now (outside of research lab demos) have gone through this process. They’ve been painstakingly, explicitly coded by a team of developers to perform specific tasks in a set of constrained environments. Change one parameter (the height of a step, for example) beyond the constrained range, and the whole system comes tumbling down like, well, a robot trying to walk.
The third main approach to HRPT is to use learning: having a robot figure out how to perform the task by analyzing data.
Now, there are many different types of machine learning using many different technologies, but they all try and get at the same goal: to find patterns in data that are useful for accomplishing a desired task. The task might be identifying a cat in an image, predicting the stock market, or determining a control policy for a robot. When it comes to HRPT, this last one is exactly what we need.
A conceptually simple way to have a robot learn to perform a task is to tell the robot what the goal of the task is, and let the robot experiment to figure out how to achieve this goal. This is reinforcement learning (RL), where the robot gets a reward when the task is accomplished (or pays a penalty when the task is NOT accomplished) and tries to figure out how to use its capabilities to get to the goal. The reward itself is generated by a reward function that maps the robot’s state (or robot state-action pairs) to how good (or bad) it is for the robot to be in that state (or perform that action in that state).
On one hand, RL can be really easy to use, as the end-state, or desired goal, is often simple to describe. When a simple end-state is all that matters, the reward function is easy to write: If there is an omelette on my plate, +1000, else 0.
On the other hand, to find its way to the desired end-state, a robot might have to explore all of the possible configurations it and the world around it could be in (its state space) and all of the possible actions it could perform (its action space) in each of those states. If the state and action spaces are large, it can take a very, very long time for the robot to find ANY path to the goal, let alone an optimal one. But, for restricted situations with small state-spaces and limited action-spaces (such as simulated systems in a grid world), RL can often lead to robust and optimal control policies.
One way to limit RL issues in larger spaces is to supply additional rewards, say, for when the robot accomplishes intermediate states on the way to the goal. Or, a developer can define heuristics to guide the robot toward more task-appropriate behavior. Unfortunately, providing these types of information make the reward function more complicated and difficult to write down and can approach the complexity of explicit programming.
Learning… from Demonstration
Another learning approach bypasses the tricky parts of RL by just letting humans provide the robot with helpful examples, or demonstrations, of what it should do to accomplish the desired task. This approach is called Learning from Demonstration (LfD) or, sometimes, Programming by Demonstration (PbD).
LfD only assumes that you, the human, can perform the task you want the robot to do. Interestingly enough, one of the most common forms that demonstrations take are teleoperations of the robot through the task. Using teleoperation in service of robotic learning nicely leverages the benefits of that method of HRPT, namely proving the performability of the task by the robot.
Another form of generating demonstrations is direct kinesthetic manipulation, where the robot is physically guided through the task, again proving performability. Unfortunately, this approach (and teleoperation as well), can prove difficult when a robot has many degrees-of-freedom (think 2 arms, each with 7 joints, as well as legs, a neck, etc).
A much-sought-after LfD approach is to have the robot merely observe the human perform the task themselves. If you can solve the correspondence problem (which parts of the human correspond to which parts of the robot), observation can be an extremely powerful way to collect demonstration data.
So let’s say you’ve got some nice demonstrations of how the task should go. What’s next?
Simply having the robot play back the demonstration is not enough, as the world is not generally guaranteed to behave exactly as it did before. So, instead, many LfD techniques build a model of the demonstration data, capturing both the expected behavior and allowable variance around it. One common approach is to take the demonstrations, warp them all to be the same length in time, and then build a dynamical model of how the robot’s state evolves over time. You can even guarantee stability of the system, by imposing different constraints on it. These systems can be used to do some pretty cool things, like play mini-golf.
Still, a common issue with LfD is that the learned control policy is often limited to being as good as the human demonstrator. A different approach, termed Inverse Reinforcement Learning (IRL) is designed to combine the demonstration-based approach of LfD with the optimization powers of RL.
In this family of techniques, instead of relying just on a control policy derived directly from demonstrations, the system first tries to estimate the reward function underlying them. By using both the reward function and an initial control policy based on the demonstrations, IRL systems can optimize for task performance. Eventually, it’s even possible for such a system to perform a task better than the humans themselves! A classic example of this is the Stanford Autonomous Helicopter project.
I could go on and on, as there is so much interesting work in the field of LfD. There’s work that looks at discovering reusable sub-skills from demonstrations, so new, different tasks can be accomplished. And there’s work that looks at giving feedback to the human during the training process so they can generate more useful training data. And then there’s my own work that has looked at how to determine when there are multiple correct ways to achieve a goal, or how to extract useful information from failed demonstration attempts, instead of assuming the human can always perform the task correctly.
As exciting as it is, learning as a solution to HRPT still has drawbacks. The major issue, as with any learning system, is how to guarantee limits on the robot’s behavior. With direct teleoperation, or explicit programming, you can say with some degree of confidence that the robot butler will never put mustard in your coffee (barring perceptual issues, or a very tired or malicious human teleoperator). However, with a learned system, you’re never quite certain what the robot will do in any given situation.
And that’s where the real fun begins.