forked from RussTedrake/manipulation
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrl.html
290 lines (230 loc) · 13.5 KB
/
rl.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
<!DOCTYPE html>
<html>
<head>
<title>Ch. 9 - Reinforcement Learning</title>
<meta name="Ch. 9 - Reinforcement Learning" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://manipulation.csail.mit.edu/rl.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="chapters.js"></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('manipulation');">
<div data-type="titlepage" pdf="no">
<header>
<h1><a href="index.html" style="text-decoration:none;">Robotic Manipulation</a></h1>
<p data-type="subtitle">Perception, Planning, and Control</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2020-2022<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p pdf="no"><b>Note:</b> These are working notes used for <a
href="http://manipulation.csail.mit.edu/Fall2022/">a course being taught
at MIT</a>. They will be updated throughout the Fall 2022 semester. <!-- <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.--></p>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=trajectories.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=drake.html>Next Chapter</a></td>
</tr></table>
<script type="text/javascript">document.write(notebook_header('rl'))
</script>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 8"><h1>Reinforcement Learning</h1>
<div style="clear:right;"></div>
<p>These days, there is a lot of excitement around reinforcement learning
(RL), and a lot of literature available. The scope of what one might
consider to be a reinforcement learning algorithm has also broaden
significantly. The classic (and now updated) and still best introduction to
RL is the book by Sutton and Barto <elib>Sutton18</elib> . For a more
theoretical view, try <elib>Agarwal20b+Szepesvari10</elib>. There
are some great online courses, as well: <a
href="http://web.stanford.edu/class/cs234/">Stanford CS234</a>, <a
href="https://rail.eecs.berkeley.edu/deeprlcourse/">Berkeley CS285</a>, <a
href="https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver">DeepMind
x UCL</a>.</p>
<p>My goal for this chapter is to provide enough of the fundamentals to get
us all on the same page, but to focus primarily on the RL ideas (and
examples) that are particularly relevant for manipulation. And manipulation
is a great playground for RL, due to the need for rich perception, for
operating in diverse environments, and potentially with rich contact
mechanics. Many of the core applied research results have been motivated by
and demonstrated on manipulation examples!</p>
<section><h1>RL Software</h1>
<p>There are now a huge variety of RL toolboxes available online, with
widely varying levels of quality and sophistication. But there is one
standard that has clearly won out as the default interface with which one
should wrap up their simulator in order to connect to a variety of RL
algorithms: the <a href="https://www.gymlibrary.dev/">Gym</a>.</p>
<p>It's worth taking a minute to appreciate the difference in the OpenAI
Gym <a href="https://www.gymlibrary.dev/">Environments</a>
(<code>gym.Env</code>) interface and the Drake System interface; I think it
is very telling. My goal (in Drake), is to present you with a rich and
beautiful interface to express and optimize dynamical systems, to expose
and exploit all possible structure in the governing equations. The goal in
Gym is to expose the absolute minimal details, so that it's possible to
easily wrap every possible system under the same common interface (it doesn't matter if it's a robot, an Atari game, or
even a <a href="https://github.com/facebookresearch/CompilerGym">compiler</a>). Almost
by definition, you can wrap any Drake system as a Gym Environment; I have
examples of how to do it correctly <a
href="https://github.com/RobotLocomotion/drake/issues/15508">here</a>. You
can also use any Gym environment in the Drake ecosystem; you just won't be
able to apply some of the more advanced algorithms that Drake provides. Of
course, I think that you should use Drake for your work in RL, too (many
people do), because it provides the rich library of dynamical systems that
are rigorously authored and tested, and leaves open the option to put RL
approaches head-to-head against more model-based alternatives. I admit I
might be a little biased. At any rate, that's the approach we will take in
these notes.</p>
<p>Some people <a
href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">might
argue</a> that the more thoughtfully you model your system, the more
assumptions you have baked in, making yourself susceptible to "sim2real"
gaps; but I think that's simply not the case. Thoughtful modeling includes
making uncertainty models that can account for as narrow or broad of a
class of systems as we aim to explore; good things happen when we can make
the uncertainty models themselves <i>structured</i>. I think one of the
most fundamental challenges waiting for us at the intersection of
reinforcement learning and control deeper understanding of the class of
models that is rich enough to describe the diversity and richness of
problems we are exploring in manipulation (e.g. with RGB cameras as inputs)
while providing somewhat more structure that we can exploit with stronger
algorithms and still allowing them to continually expand and improve with
data.</p> <todo>callout to intuitive physics chapter once it exists</todo>
<!-- another point: the RL approach has been successful in part because
it's easy. approachable with less training, and as a result got more
people/funding put against it. that's super valuable. controls as a field,
on the other hand, has made itself somewhat intentionally obscure, perhaps
partly out of arrogance. that is perhaps the more bitter lesson. -->
<p>The OpenAI Gym provides an interface for RL environments, but doesn't
provide the implementation of the actual RL algorithms. There are a large
number of popular repositories for the algorithms, too. As of this
writing, I would recommend <a
href="https://stable-baselines3.readthedocs.io/en/master/index.html">Stable
Baselines 3</a>: it provides a very nice and thoughfully-documented set of
implementations in PyTorch.</p>
<!--
- the env interface is now standard, but the policy interface is not.
everyone rolls their own, and mostly assumes nn.Module as a base class.
- SB3 lets you define a custom "policy" (actually policy + value func),
but doesn't currently support dynamic policies (LSTM was removed at
some point; hopefully it will come back)
- I really dislike that the notion of value function and Q function are
applied directly to observations. (In SB3, but most other packages,
too). That's bad form.
-->
<p>One other class of algorithms that is very relevant to RL but not
specifically designed for RL is algorithms for black-box optimization. I
quite like <a
href="https://facebookresearch.github.io/nevergrad/">Nevergrad</a>, and
will also use that here.</p>
<!-- We have some relevant nevergrad <=> drake (e.g. mathematicalprogram)
code in anzu. -->
</section>
<section><h1>Policy-gradient methods</h1>
<!-- Note: I have code for the plate pickup in
Dropbox/notebooks/low_dof_plate_pick_experimentation and in Google
Drive/Colab notebook/Plate pickup.ipynb -->
<subsection><h1>Black-box optimization</h1>
<p></p>
</subsection>
<subsection><h1>Stochastic optimal control</h1>
</subsection>
<subsection><h1>Using gradients of the policy, but not the environment</h1>
<p>You can find more details on the derivation and some basic analysis of
these algorithms <a
href="http://underactuated.csail.mit.edu/rl_policy_search.html">here</a>.</p>
</subsection>
<subsection><h1>REINFORCE, PPO, TRPO</h1>
<todo>Example of each of these algorithms on six-hump camel (with trivial environment/policy).</todo>
</subsection>
<subsection><h1>Control for manipulation should be easy</h1>
<p>This is a great time for theoretical RL + controls, with experts from
controls embracing new techniques and insights from machine learning, and
vice versa. As a simple example, we've increasingly come to understand
that, even though the cost landscape for many classical control problems
(like the linear quadratic regulator) is not convex in the typical policy
parameters, we <a
href="http://underactuated.csail.mit.edu/policy_search.html">now
understand that gradient descent still works</a> for these problems
(there are no local minima), and the class of problems/parameterizations
for which we can make statements like this is growing rapidly.</p>
</subsection>
</section>
<section><h1>Value-based methods</h1>
</section>
<section><h1>Model-based RL</h1>
<todo>MPC with learned models, typically represented with a neural network.</todo>
</section>
<section><h1>Exercises</h1>
<exercise><h1>Stochastic Optimization</h1>
<p> For this exercise, you will implement a stochastic optimization scheme that does not require exact analytical gradients. You will work exclusively in <script>document.write(notebook_link('stochastic_optimization', deepnote['exercises/rl'], link_text='this notebook'))</script>. You will be asked to complete the following steps: </p>
<ol type="a">
<li> Implement gradient descent with exact analytical gradients.
</li>
<li> Implement stochastic gradient descent with approximated gradients. </li>
<li> Prove that the expected value of the stochastic update does not change with baselines.</li>
<li> Implement stochastic gradient descent with baselines.
</li>
</ol>
</exercise>
<exercise><h1>REINFORCE</h1>
<p> For this exercise, you will implement the vanilla REINFORCE algorithm on a box pushing task. You will work exclusively in <script>document.write(notebook_link('policy_gradient', deepnote['exercises/rl'], link_text='this notebook'))</script>. You will be asked to complete the following steps: </p>
<ol type="a">
<li> Implement the policy loss function. </li>
<li> Implement the value loss function. </li>
<li> Implement the advantage function. </li>
</ol>
</exercise>
</section>
</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<div id="references"><section><h1>References</h1>
<ol>
<li id=Sutton18>
<span class="author">Richard S. Sutton and Andrew G. Barto</span>,
<span class="title">"Reinforcement Learning: An Introduction"</span>, MIT Press
, <span class="year">2018</span>.
</li><br>
<li id=Agarwal20b>
<span class="author">Alekh Agarwal and Nan Jiang and Sham M. Kakade and Wen Sun</span>,
<span class="title">"Reinforcement Learning: Theory and Algorithms"</span>, Online Draft
, <span class="year">2020</span>.
</li><br>
<li id=Szepesvari10>
<span class="author">Csaba Szepesvari</span>,
<span class="title">"Algorithms for Reinforcement Learning"</span>, Morgan and Claypool Publishers
, <span class="year">2010</span>.
</li><br>
</ol>
</section><p/>
</div>
<table style="width:100%;" pdf="no"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=trajectories.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=drake.html>Next Chapter</a></td>
</tr></table>
<div id="footer" pdf="no">
<hr>
<table style="width:100%;">
<tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">© Russ
Tedrake, 2022</td></tr>
</table>
</div>
</body>
</html>