RFundamentals/Formula.xml at master · dsidavis/RFundamentals · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
<section
	 xmlns:r="http://www.r-project.org">
<title>Formula and Environments</title>

<para>
Like functions, model formula such as <r:expr>mpg ~ wt + cyl</r:expr>
also have an associated formula.
We can create a formula as a stand-alone object with, e.g.,
<r:code>
frm = b ~ a
</r:code>
We can examine its structure with
<r:code>
str(frm)
<r:output><![CDATA[
Class 'formula'  language b ~ a
  ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
]]></r:output>
</r:code>
We see that there is an attribute associated with the formula
and in this case it is the global environment.
Like functions, the environment of a formula is the environment
in which it is defined (unless explicitly set to be different.)
So if a formula was created in a call to a function, its environment
would be the call frame in which it was created.
</para>

<para>
Environments on formula allow functions manipulating a
formula to find the associated variables referenced
in the formula.
We'll look at this and how it can be somewhat surprising in some
respects.
Again, we'll construct an experiment we can explicitly debug
and understand in <r/>, rather than just stating the rules.
Let's create two variables <r:var>a</r:var> and <r:var>b</r:var>:
<r:code>
set.seed(12312)
a = runif(10)
b = 5 + 2.34*a + rnorm(length(a))
</r:code>
Now, consider the formula
<r:code>
b ~ a
</r:code>
As we saw, this has the global environment associated with it.
</para>

<para>
  Note also that <r:expr>b ~ a</r:expr>
  doesn't actually use the values of <r:var>a</r:var> and <r:var>b</r:var>.
  The formula is symbolic and will its contents will be evaluated
  later. It is like the <r:func>quote</r:func> function
  which doesn't evaluate its argument.  It is like (but not the same as) <r:func>parse</r:func>  in that it's contents will be
  used later. The expression <r:expr>b ~ a</r:expr>
  is a call to the function <r:func>~</r:func>
  which is a .Primitive function. That means it immediately
  passes its un-evaluated arguments to <c/> code
  and does not follow <r/>'s standard evaluation
  model.  In other words, it uses non-standard evaluation!
  So it doesn't actually get the value of <r:var>a</r:var>
  or <r:var>b</r:var>.
</para>


<para>
Consider the call
<r:code>
coef(lm(b ~ a))
</r:code>
This calls <r:func>coef</r:func> which <r/> locates
on the search path and matches the arguments to the parameters.
The call to <r:func>lm</r:func> is also evaluated in the global
environment and the same rules apply.
The formula <r:expr>b ~ a</r:expr> is also evaluated in the global
environment since this is a top-level expression.
So the formula's environment is the global environment just
as it would be if we had done this in two steps
<r:code>
frm = b ~ a
coef(lm(frm))
</r:code>
The answer we get is
<r:output><![CDATA[
(Intercept)           a
   5.097029    2.618424
]]></r:output>
</para>

<para>
  How does <r/> evaluate the call to <r:expr>lm(b ~ a)</r:expr>.
  We have described the general rules of finding the
  function (<r:var>lm</r:var>), creating
  the call frame with a variable for each parameter,
  matching the arguments to the parameters, and evaluating
  the expressions in the body of the function with the call frame
  as the environment for the evaluation.
  However, like <r:func>library</r:func>, <r:func>rm</r:func>,
  <r:func>?</r:func>, and a few more,  <r:func>lm</r:func>
  uses non-standard evaluation (NSE).
  And this is made more complicated by how formulae work.
</para>


<para>
The call <r:expr>lm(b ~ a)</r:expr>
passes the formula to <r:func>lm</r:func>.
<r:func>lm</r:func> uses an explicit call to the <r:func>eval</r:func>
function to create the design/model matrix for the regression.
Actually, <r:func>lm</r:func> behaves even more unusually
in that it constructs the call it passes to <r:func>eval</r:func>
by adapting how it itself was called.
We can see this by debugging the call to <r:func>lm</r:func>
<r:code>
debug(lm)
lm(b ~ a)
</r:code>
After stepping through the initial expressions,
we see the sequence of commands
<r:code><![CDATA[
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
]]></r:code>
We have seen <r:func>match.call</r:func>.
This returns are original call with the parameter names explicitly matched
and added, i.e.,
<r:code>
lm(formula = b ~ a)
</r:code>
Next, we match the parameter names in the call to find formula, data, subset, etc.
and we then keep only those elements of the call, discarding any other arguments.
Why, because we are the arguments in the call to <r:func>lm</r:func>
to actually call <r:func>model.frame</r:func> in the <r:pkg>stats</r:pkg>
package. And <r:func>lm</r:func> doesn't want to have to check which arguments
were passed to it and pass those on to <r:func>model.frame</r:func>. Instead
it does this meta-programming on how it was invoked/called to create
the corresponding call to <r:func>model.frame</r:func>.
So the next two expressiins add a named argument <r:arg>drop.unused.levels</r:arg>
with a value of <r:true/>
and changing the function being invoked in this call object
t <r:expr>stats::model.frame</r:expr>.
So at the end of this sequence of expressions, we can see
that <r:var>mf</r:var> is
<r:output><![CDATA[
stats::model.frame(formula = b ~ a, drop.unused.levels = TRUE)
]]></r:output>
and has class <r:class>call</r:class>.
</para>

<para>
  Next, <r:func>lm</r:func> evaluates this call with
<r:code><![CDATA[
mf <- eval(mf, parent.frame())
]]></r:code>
This explicitly calls <r:func>eval</r:func>
and specifies an environment in which to evaluate that call.
This environment - computed as <r:expr>parent.frame()</r:expr> -
controls where <r/> will look for variables in the call.
<r:func>parent.frame</r:func> looks along the call stack,
not the search path of this call frame.
In our case, <r:func>parent.frame</r:func> is the global environment
since our call to <r:func>lm</r:func> is the only call in the current
call stack.
</para>

<para>
  So let's debug the call to <r:func>model.frame</r:func>
<r:code>
debug(model.frame)
</r:code>
This stops in
<r:output><![CDATA[
debugging in: stats::model.frame(formula = b ~ a, drop.unused.levels = TRUE)
debug: UseMethod("model.frame")
]]></r:output>
This is the <s3/> method and <r/> will dispatch to <r:func>model.frame.default</r:func>
and will debug that for us as it is related to <r:func>model.frame</r:func>.
We step through the expressions in <r:func>model.frame.default</r:func>
and we get to
<r:code><![CDATA[
if (missing(data) && inherits(formula, "data.frame")) {
    if (length(attr(formula, "terms")))
        return(formula)
    data <- formula
    formula <- as.formula(data)
}
]]></r:code>
We didn't provide a value for the <r:arg>data</r:arg>
parameter in our call (to <r:func>lm</r:func> that was mimiced to call <r:func>model.frame</r:func>). But
<r:var>formula</r:var> does not inherit from <r:class>data.frame</r:class>,
so this condition is false. And then we get to
<r:code><![CDATA[
formula <- as.formula(formula)
]]></r:code>
and then
<r:code><![CDATA[
data <- environment(formula)
]]></r:code>
So model.frame will use the environment of the formula as the source
of the data, i.e., the place in which to look for symbols.
Recall that the environment of our formula is the global environment.
And after a few more expressions, we get to
<r:code><![CDATA[
variables <- eval(predvars, data, env)
]]></r:code>
<r:var>predvars</r:var> is
<r:output><![CDATA[
list(b, a)
]]></r:output>
and <r:var>data</r:var> and <r:var>env</r:var> are the global environment.
So this call to evaluate will find <r:var>a</r:var> and <r:var>b</r:var>
in the global environment.
</para>

<question>
<para>
  Where and how will <r/> find the variables <r:var>a</r:var>
  and <r:var>b</r:var> referenced in the formula?
<r:code>
f = function() {
  set.seed(12314)
  a = runif(30)
  b = pi * a + rnorm(30)
  lm(b ~ a)
}
f()
</r:code>
</para>
<answer>
  <para>
<r:func>model.frame.default</r:func> will call <r:expr>eval(predvars, data, env)</r:expr>
with <r:var>data</r:var> and <r:var>env</r:var> being the environment of the formula.
This will be the call frame for the call to <r:func>f</r:func>.
So <r/> will find the variables <r:var>a</r:var> and <r:var>b</r:var>
in that call frame.
  </para>
</answer>
</question>


<question>
<para>
What would happen in the following
<r:code>
f = function() lm(b ~ a)
f()
</r:code>
Where would <r/> find <r:var>b</r:var> and <r:var>a</r:var> and how/why?
</para>

<answer>
<para>
  This  looks simpler than the previous question
  but is actually slightly more complicated, but
  builds on the rules we saw earlier.
Again, <r:func>model.frame.default</r:func> will get to
<r:code><![CDATA[
variables <- eval(predvars, data, env)
]]></r:code>
with <r:var>data</r:var> and <r:var>env</r:var> coming from the
environment on the formula.
That will be the call frame from the call to <r:func>f</r:func>.
However, <r:var>a</r:var> and <r:var>b</r:var> are not in that call frame,
whereas they were in the previous question.
So the evaluation uses <r/>'s regular rules to search for the variables
along the chain of environments.
We can look at this chain with
<r:code>
names(getEnvChain(env))
<r:output><![CDATA[
 [1] "<0x7fd259232f08>"  "globalenv"         "package:stats"
 [4] "package:graphics"  "package:grDevices" "package:datasets"
 [7] "package:utils"     "package:methods"   "Autoloads"
[10] "<base>"            "emptyenv"
]]></r:output>
</r:code>
So we see the global environment is next on the chain after
the call frame for <r:func>f</r:func>.
This is, as we know, because the environment of <r:func></r:func>
is the global environment since that is where <r:func>f</r:func>
was defined. And the parent environment of a call frame is the environment
of that function being called. (Don't confuse this with <r:func>sys.parent</r:func>
which refers to call frames on the call stack.)
</para>
</answer>
</question>

<para>
Now, we will switch from finding the values for <r:var>a</r:var>
and <r:var>b</r:var> in the global environment
to providing them in a <r:class>data.frame</r:class>.
<r:code>
d = data.frame(a = a, b = b)
coef(lm(b ~ a, d))
</r:code>
We actually have two instances of <r:var>a</r:var> -
one in the global environment and one as an element in the
<r:class>data.frame</r:class>.
Which one did <r/> use?  It doesn't matter, in this case, since they are
the same. But it would matter if they were different.
So which did it use? We can read the help file,
or simply know, or use the debugger to step through the computations
as we did before.
<r/>, and <r:func>model.frame.default</r:func> specifically, will use
the values in the <r:class>data.frame</r:class>,
not those in the global environment.
We can also verify this by removing <r:var>a</r:var>
from the global environment and seeing if there is an error
about not finding <r:var>a</r:var>:
<r:code>
a1 = a
rm(a)
coef(lm(b ~ a, d))
</r:code>
This succeeds and gives us the same answer as we originally got.
</para>


<para>
  We'll now make this a little more interesting.
  We'll use <r:func>lm</r:func>'s <r:arg>weights</r:arg> parameter
  to do weighted least squares.
  We'll use two sets of weights, one where all observations
  have weight 1 and another which are random numbers between
  1 and 2, i.e.,
<r:code>
w1 = rep(1, length(b))
w2 = runif(length(b), 1, 2)
</r:code>
When we use these
<r:code>
coef(lm(b ~ a, weights = w1))
coef(lm(b ~ a, weights = w2))
</r:code>
we get
<r:output><![CDATA[
 (Intercept)           a
    5.097029    2.618424
 (Intercept)           a
    5.305530    2.194265
]]></r:output>
respectively.
So we get different estimates for a (and the intercept)
as we expect.
</para>
<para>
We might expect <r/>'s usual computational model
is in effect here. Specifically,
in the call to <r:func>lm</r:func>,
<r/> matches the expression <r:var>w1</r:var>
to the <r:arg>weights</r:arg> parameter
and creates a promise to evaluate <r:expr>w1</r:expr>
in the global environment in which we made the call to <r:func>lm</r:func>.
And certainly, the result we got are consistent with this.
However, it is not the case. We are finding <r:var>w1</r:var>
via a slightly different mechanism which happens to give the same
result, but only in this case.
</para>


<para>
  Let's create a new column in our <r:class>data.frame</r:class>
  <r:var>d</r:var> and we'll name it <r:el>w1</r:el>
<r:code>
d$w1 = w1
</r:code>
Again, we call <r:func>lm</r:func>
<r:code>
coef(lm(b ~ a, d, weights = w1))
</r:code>
and we get the same answer.
</para>
<para>
Now, let's set the <r:el>w1</r:el> to have the other weights
<r:code>
d$w1 = w2
</r:code>
and call <r:func>lm</r:func>
<r:code>
coef(lm(b ~ a, d, weights = w1))
</r:code>
Now we get
<r:output><![CDATA[
(Intercept)           a
   5.305530    2.194265
]]></r:output>
So we are using the weights from <r:var>w2</r:var>.
This indicates that <r:func>lm</r:func> and <r:func>model.frame.default</r:func>
are looking in the <r:class>data.frame</r:class>
when evaluating <r:expr>w1</r:expr>.
This is very different from <r/>'s standard computational
model we described.
</para>


<para>
Let's create our formula more explicitly and
give it its own environment (just like we can for a function.)
We create an environment and create
variables in it named <r:var>x</r:var> and <r:var>y</r:var>,
using the original values of <r:var>a</r:var> and <r:var>b</r:var>
from the global environment.
<r:code>
e = new.env()
e$x = a1
e$y = b
</r:code>
We create our formula and  examine its environment
<r:code>
frm = y ~ x
environment(frm)
</r:code>
It is the global environment.
We now change this to be
<r:code>
environment(frm) = e
<r:output><![CDATA[
y ~ x
<environment: 0x7fd22963a238>
]]></r:output>
</r:code>
We can now use this formula without specifying the
<r:arg>data</r:arg> argument as <r:func>model.frame.default</r:func>
will use this environment to find <r:var>x</r:var> and <r:var>y</r:var>
<r:code>
coef(lm(frm))
<r:output><![CDATA[
(Intercept)           x
   5.097029    2.618424
]]></r:output>
</r:code>
And again, we'll create a variable named w1, this time
in <r:var>e</r:var> with weights from <r:var>w2</r:var>
<r:code>
e$w1 = w2
</r:code>
And now
<r:code>
coef(lm(frm, weights = w1))
</r:code>
gives
<r:output><![CDATA[
(Intercept)           x
   5.305530    2.194265
]]></r:output>
showing that it used the weights from <r:var>w1</r:var> in <r:var>e</r:var>
rather than those in <r:var>w1</r:var> in the global environment.
</para>

<para>
  This is non-standard evaluation and it means we don't use the regular
  rules for evaluation.
  So now we have at least two computational models.
  This makes it hard to reason about <r/> code.
  If you are luck, you get an error when you do the wrong thing.
  Often, you get the wrong answer and that is much worse than an error.
</para>


<para>
  NSE is a powerful feature. But it needs to be used wisely and sparingly.
  NSE is convenient for programmers to do interesting things.
  However, it places a significant burden on the users
  of functions that use NSE
  because they cannot reason about what will happen
  using a single, familiar computational model.
  Instead, they have to carefully learn how each
  function that uses NSE will actually behave.
</para>

</section>