mlfactor.github.io/NN.html at master · theauheral/mlfactor.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>

  <meta charset="utf-8" />
  <meta http-equiv="X-UA-Compatible" content="IE=edge" />
  <title>Chapter 7 Neural networks | Machine Learning for Factor Investing</title>
  <meta name="description" content="Chapter 7 Neural networks | Machine Learning for Factor Investing" />
  <meta name="generator" content="bookdown 0.21 and GitBook 2.6.7" />

  <meta property="og:title" content="Chapter 7 Neural networks | Machine Learning for Factor Investing" />
  <meta property="og:type" content="book" />


  <meta name="twitter:card" content="summary" />
  <meta name="twitter:title" content="Chapter 7 Neural networks | Machine Learning for Factor Investing" />


<meta name="author" content="Guillaume Coqueret and Tony Guida" />


<meta name="date" content="2021-01-08" />

  <meta name="viewport" content="width=device-width, initial-scale=1" />
  <meta name="apple-mobile-web-app-capable" content="yes" />
  <meta name="apple-mobile-web-app-status-bar-style" content="black" />


<link rel="prev" href="trees.html"/>
<link rel="next" href="svm.html"/>
<script src="libs/jquery-2.2.3/jquery.min.js"></script>
<link href="libs/gitbook-2.6.7/css/style.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-table.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-bookdown.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-highlight.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-search.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-fontsettings.css" rel="stylesheet" />
<link href="libs/gitbook-2.6.7/css/plugin-clipboard.css" rel="stylesheet" />


<script src="libs/accessible-code-block-0.0.1/empty-anchor.js"></script>
<link href="libs/anchor-sections-1.0/anchor-sections.css" rel="stylesheet" />
<script src="libs/anchor-sections-1.0/anchor-sections.js"></script>
<script src="libs/kePrint-0.0.1/kePrint.js"></script>
<link href="libs/lightable-0.0.1/lightable.css" rel="stylesheet" />


<style type="text/css">
code.sourceCode > span { display: inline-block; line-height: 1.25; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode { white-space: pre; position: relative; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
code.sourceCode { white-space: pre-wrap; }
code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
  { counter-reset: source-line 0; }
pre.numberSource code > span
  { position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
  { content: counter(source-line);
    position: relative; left: -1em; text-align: right; vertical-align: baseline;
    border: none; display: inline-block;
    -webkit-touch-callout: none; -webkit-user-select: none;
    -khtml-user-select: none; -moz-user-select: none;
    -ms-user-select: none; user-select: none;
    padding: 0 4px; width: 4em;
    color: #aaaaaa;
  }
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa;  padding-left: 4px; }
div.sourceCode
  {   }
@media screen {
code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>

</head>

<body>


  <div class="book without-animation with-summary font-size-2 font-family-1" data-basepath=".">

    <div class="book-summary">
      <nav role="navigation">

<ul class="summary">
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html"><i class="fa fa-check"></i>Preface</a><ul>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#what-this-book-is-not-about"><i class="fa fa-check"></i>What this book is not about</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#the-targeted-audience"><i class="fa fa-check"></i>The targeted audience</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#how-this-book-is-structured"><i class="fa fa-check"></i>How this book is structured</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#companion-website"><i class="fa fa-check"></i>Companion website</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#why-r"><i class="fa fa-check"></i>Why R?</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#coding-instructions"><i class="fa fa-check"></i>Coding instructions</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#acknowledgments"><i class="fa fa-check"></i>Acknowledgments</a></li>
<li class="chapter" data-level="" data-path="preface.html"><a href="preface.html#future-developments"><i class="fa fa-check"></i>Future developments</a></li>
</ul></li>
<li class="part"><span><b>I Introduction</b></span></li>
<li class="chapter" data-level="1" data-path="notdata.html"><a href="notdata.html"><i class="fa fa-check"></i><b>1</b> Notations and data</a><ul>
<li class="chapter" data-level="1.1" data-path="notdata.html"><a href="notdata.html#notations"><i class="fa fa-check"></i><b>1.1</b> Notations</a></li>
<li class="chapter" data-level="1.2" data-path="notdata.html"><a href="notdata.html#dataset"><i class="fa fa-check"></i><b>1.2</b> Dataset</a></li>
</ul></li>
<li class="chapter" data-level="2" data-path="intro.html"><a href="intro.html"><i class="fa fa-check"></i><b>2</b> Introduction</a><ul>
<li class="chapter" data-level="2.1" data-path="intro.html"><a href="intro.html#context"><i class="fa fa-check"></i><b>2.1</b> Context</a></li>
<li class="chapter" data-level="2.2" data-path="intro.html"><a href="intro.html#portfolio-construction-the-workflow"><i class="fa fa-check"></i><b>2.2</b> Portfolio construction: the workflow</a></li>
<li class="chapter" data-level="2.3" data-path="intro.html"><a href="intro.html#machine-learning-is-no-magic-wand"><i class="fa fa-check"></i><b>2.3</b> Machine learning is no magic wand</a></li>
</ul></li>
<li class="chapter" data-level="3" data-path="factor.html"><a href="factor.html"><i class="fa fa-check"></i><b>3</b> Factor investing and asset pricing anomalies</a><ul>
<li class="chapter" data-level="3.1" data-path="factor.html"><a href="factor.html#introduction"><i class="fa fa-check"></i><b>3.1</b> Introduction</a></li>
<li class="chapter" data-level="3.2" data-path="factor.html"><a href="factor.html#detecting-anomalies"><i class="fa fa-check"></i><b>3.2</b> Detecting anomalies</a><ul>
<li class="chapter" data-level="3.2.1" data-path="factor.html"><a href="factor.html#challenges"><i class="fa fa-check"></i><b>3.2.1</b> Challenges</a></li>
<li class="chapter" data-level="3.2.2" data-path="factor.html"><a href="factor.html#simple-portfolio-sorts"><i class="fa fa-check"></i><b>3.2.2</b> Simple portfolio sorts  </a></li>
<li class="chapter" data-level="3.2.3" data-path="factor.html"><a href="factor.html#factors"><i class="fa fa-check"></i><b>3.2.3</b> Factors</a></li>
<li class="chapter" data-level="3.2.4" data-path="factor.html"><a href="factor.html#predictive-regressions-sorts-and-p-value-issues"><i class="fa fa-check"></i><b>3.2.4</b> Predictive regressions, sorts, and p-value issues</a></li>
<li class="chapter" data-level="3.2.5" data-path="factor.html"><a href="factor.html#fama-macbeth-regressions"><i class="fa fa-check"></i><b>3.2.5</b> Fama-Macbeth regressions</a></li>
<li class="chapter" data-level="3.2.6" data-path="factor.html"><a href="factor.html#factor-competition"><i class="fa fa-check"></i><b>3.2.6</b> Factor competition</a></li>
<li class="chapter" data-level="3.2.7" data-path="factor.html"><a href="factor.html#advanced-techniques"><i class="fa fa-check"></i><b>3.2.7</b> Advanced techniques</a></li>
</ul></li>
<li class="chapter" data-level="3.3" data-path="factor.html"><a href="factor.html#factors-or-characteristics"><i class="fa fa-check"></i><b>3.3</b> Factors or characteristics?</a></li>
<li class="chapter" data-level="3.4" data-path="factor.html"><a href="factor.html#hot-topics-momentum-timing-and-esg"><i class="fa fa-check"></i><b>3.4</b> Hot topics: momentum, timing and ESG</a><ul>
<li class="chapter" data-level="3.4.1" data-path="factor.html"><a href="factor.html#factor-momentum"><i class="fa fa-check"></i><b>3.4.1</b> Factor momentum</a></li>
<li class="chapter" data-level="3.4.2" data-path="factor.html"><a href="factor.html#factor-timing"><i class="fa fa-check"></i><b>3.4.2</b> Factor timing</a></li>
<li class="chapter" data-level="3.4.3" data-path="factor.html"><a href="factor.html#the-green-factors"><i class="fa fa-check"></i><b>3.4.3</b> The green factors</a></li>
</ul></li>
<li class="chapter" data-level="3.5" data-path="factor.html"><a href="factor.html#the-links-with-machine-learning"><i class="fa fa-check"></i><b>3.5</b> The links with machine learning</a><ul>
<li class="chapter" data-level="3.5.1" data-path="factor.html"><a href="factor.html#a-short-list-of-recent-references"><i class="fa fa-check"></i><b>3.5.1</b> A short list of recent references</a></li>
<li class="chapter" data-level="3.5.2" data-path="factor.html"><a href="factor.html#explicit-connections-with-asset-pricing-models"><i class="fa fa-check"></i><b>3.5.2</b> Explicit connections with asset pricing models</a></li>
</ul></li>
<li class="chapter" data-level="3.6" data-path="factor.html"><a href="factor.html#coding-exercises"><i class="fa fa-check"></i><b>3.6</b> Coding exercises</a></li>
</ul></li>
<li class="chapter" data-level="4" data-path="Data.html"><a href="Data.html"><i class="fa fa-check"></i><b>4</b> Data preprocessing</a><ul>
<li class="chapter" data-level="4.1" data-path="Data.html"><a href="Data.html#know-your-data"><i class="fa fa-check"></i><b>4.1</b> Know your data</a></li>
<li class="chapter" data-level="4.2" data-path="Data.html"><a href="Data.html#missing-data"><i class="fa fa-check"></i><b>4.2</b> Missing data</a></li>
<li class="chapter" data-level="4.3" data-path="Data.html"><a href="Data.html#outlier-detection"><i class="fa fa-check"></i><b>4.3</b> Outlier detection</a></li>
<li class="chapter" data-level="4.4" data-path="Data.html"><a href="Data.html#feateng"><i class="fa fa-check"></i><b>4.4</b> Feature engineering</a><ul>
<li class="chapter" data-level="4.4.1" data-path="Data.html"><a href="Data.html#feature-selection"><i class="fa fa-check"></i><b>4.4.1</b> Feature selection</a></li>
<li class="chapter" data-level="4.4.2" data-path="Data.html"><a href="Data.html#scaling"><i class="fa fa-check"></i><b>4.4.2</b> Scaling the predictors</a></li>
</ul></li>
<li class="chapter" data-level="4.5" data-path="Data.html"><a href="Data.html#labelling"><i class="fa fa-check"></i><b>4.5</b> Labelling</a><ul>
<li class="chapter" data-level="4.5.1" data-path="Data.html"><a href="Data.html#simple-labels"><i class="fa fa-check"></i><b>4.5.1</b> Simple labels</a></li>
<li class="chapter" data-level="4.5.2" data-path="Data.html"><a href="Data.html#categorical-labels"><i class="fa fa-check"></i><b>4.5.2</b> Categorical labels</a></li>
<li class="chapter" data-level="4.5.3" data-path="Data.html"><a href="Data.html#the-triple-barrier-method"><i class="fa fa-check"></i><b>4.5.3</b> The triple barrier method</a></li>
<li class="chapter" data-level="4.5.4" data-path="Data.html"><a href="Data.html#filtering-the-sample"><i class="fa fa-check"></i><b>4.5.4</b> Filtering the sample</a></li>
<li class="chapter" data-level="4.5.5" data-path="Data.html"><a href="Data.html#horizons"><i class="fa fa-check"></i><b>4.5.5</b> Return horizons</a></li>
</ul></li>
<li class="chapter" data-level="4.6" data-path="Data.html"><a href="Data.html#pers"><i class="fa fa-check"></i><b>4.6</b> Handling persistence</a></li>
<li class="chapter" data-level="4.7" data-path="Data.html"><a href="Data.html#extensions"><i class="fa fa-check"></i><b>4.7</b> Extensions</a><ul>
<li class="chapter" data-level="4.7.1" data-path="Data.html"><a href="Data.html#transforming-features"><i class="fa fa-check"></i><b>4.7.1</b> Transforming features</a></li>
<li class="chapter" data-level="4.7.2" data-path="Data.html"><a href="Data.html#macrovar"><i class="fa fa-check"></i><b>4.7.2</b> Macro-economic variables</a></li>
<li class="chapter" data-level="4.7.3" data-path="Data.html"><a href="Data.html#active-learning"><i class="fa fa-check"></i><b>4.7.3</b> Active learning</a></li>
</ul></li>
<li class="chapter" data-level="4.8" data-path="Data.html"><a href="Data.html#additional-code-and-results"><i class="fa fa-check"></i><b>4.8</b> Additional code and results</a><ul>
<li class="chapter" data-level="4.8.1" data-path="Data.html"><a href="Data.html#impact-of-rescaling-graphical-representation"><i class="fa fa-check"></i><b>4.8.1</b> Impact of rescaling: graphical representation</a></li>
<li class="chapter" data-level="4.8.2" data-path="Data.html"><a href="Data.html#impact-of-rescaling-toy-example"><i class="fa fa-check"></i><b>4.8.2</b> Impact of rescaling: toy example</a></li>
</ul></li>
<li class="chapter" data-level="4.9" data-path="Data.html"><a href="Data.html#coding-exercises-1"><i class="fa fa-check"></i><b>4.9</b> Coding exercises</a></li>
</ul></li>
<li class="part"><span><b>II Common supervised algorithms</b></span></li>
<li class="chapter" data-level="5" data-path="lasso.html"><a href="lasso.html"><i class="fa fa-check"></i><b>5</b> Penalized regressions and sparse hedging for minimum variance portfolios</a><ul>
<li class="chapter" data-level="5.1" data-path="lasso.html"><a href="lasso.html#penalized-regressions"><i class="fa fa-check"></i><b>5.1</b> Penalized regressions</a><ul>
<li class="chapter" data-level="5.1.1" data-path="lasso.html"><a href="lasso.html#penreg"><i class="fa fa-check"></i><b>5.1.1</b> Simple regressions</a></li>
<li class="chapter" data-level="5.1.2" data-path="lasso.html"><a href="lasso.html#forms-of-penalizations"><i class="fa fa-check"></i><b>5.1.2</b> Forms of penalizations</a></li>
<li class="chapter" data-level="5.1.3" data-path="lasso.html"><a href="lasso.html#illustrations"><i class="fa fa-check"></i><b>5.1.3</b> Illustrations</a></li>
</ul></li>
<li class="chapter" data-level="5.2" data-path="lasso.html"><a href="lasso.html#sparse-hedging-for-minimum-variance-portfolios"><i class="fa fa-check"></i><b>5.2</b> Sparse hedging for minimum variance portfolios</a><ul>
<li class="chapter" data-level="5.2.1" data-path="lasso.html"><a href="lasso.html#presentation-and-derivations"><i class="fa fa-check"></i><b>5.2.1</b> Presentation and derivations</a></li>
<li class="chapter" data-level="5.2.2" data-path="lasso.html"><a href="lasso.html#sparseex"><i class="fa fa-check"></i><b>5.2.2</b> Example</a></li>
</ul></li>
<li class="chapter" data-level="5.3" data-path="lasso.html"><a href="lasso.html#predictive-regressions"><i class="fa fa-check"></i><b>5.3</b> Predictive regressions</a><ul>
<li class="chapter" data-level="5.3.1" data-path="lasso.html"><a href="lasso.html#literature-review-and-principle"><i class="fa fa-check"></i><b>5.3.1</b> Literature review and principle</a></li>
<li class="chapter" data-level="5.3.2" data-path="lasso.html"><a href="lasso.html#code-and-results"><i class="fa fa-check"></i><b>5.3.2</b> Code and results</a></li>
</ul></li>
<li class="chapter" data-level="5.4" data-path="lasso.html"><a href="lasso.html#coding-exercise"><i class="fa fa-check"></i><b>5.4</b> Coding exercise</a></li>
</ul></li>
<li class="chapter" data-level="6" data-path="trees.html"><a href="trees.html"><i class="fa fa-check"></i><b>6</b> Tree-based methods</a><ul>
<li class="chapter" data-level="6.1" data-path="trees.html"><a href="trees.html#simple-trees"><i class="fa fa-check"></i><b>6.1</b> Simple trees</a><ul>
<li class="chapter" data-level="6.1.1" data-path="trees.html"><a href="trees.html#principle"><i class="fa fa-check"></i><b>6.1.1</b> Principle</a></li>
<li class="chapter" data-level="6.1.2" data-path="trees.html"><a href="trees.html#treeclass"><i class="fa fa-check"></i><b>6.1.2</b> Further details on classification</a></li>
<li class="chapter" data-level="6.1.3" data-path="trees.html"><a href="trees.html#pruning-criteria"><i class="fa fa-check"></i><b>6.1.3</b> Pruning criteria</a></li>
<li class="chapter" data-level="6.1.4" data-path="trees.html"><a href="trees.html#code-and-interpretation"><i class="fa fa-check"></i><b>6.1.4</b> Code and interpretation</a></li>
</ul></li>
<li class="chapter" data-level="6.2" data-path="trees.html"><a href="trees.html#random-forests"><i class="fa fa-check"></i><b>6.2</b> Random forests</a><ul>
<li class="chapter" data-level="6.2.1" data-path="trees.html"><a href="trees.html#principle-1"><i class="fa fa-check"></i><b>6.2.1</b> Principle</a></li>
<li class="chapter" data-level="6.2.2" data-path="trees.html"><a href="trees.html#code-and-results-1"><i class="fa fa-check"></i><b>6.2.2</b> Code and results</a></li>
</ul></li>
<li class="chapter" data-level="6.3" data-path="trees.html"><a href="trees.html#adaboost"><i class="fa fa-check"></i><b>6.3</b> Boosted trees: Adaboost</a><ul>
<li class="chapter" data-level="6.3.1" data-path="trees.html"><a href="trees.html#methodology"><i class="fa fa-check"></i><b>6.3.1</b> Methodology</a></li>
<li class="chapter" data-level="6.3.2" data-path="trees.html"><a href="trees.html#illustration"><i class="fa fa-check"></i><b>6.3.2</b> Illustration</a></li>
</ul></li>
<li class="chapter" data-level="6.4" data-path="trees.html"><a href="trees.html#boosted-trees-extreme-gradient-boosting"><i class="fa fa-check"></i><b>6.4</b> Boosted trees: extreme gradient boosting</a><ul>
<li class="chapter" data-level="6.4.1" data-path="trees.html"><a href="trees.html#managing-loss"><i class="fa fa-check"></i><b>6.4.1</b> Managing loss</a></li>
<li class="chapter" data-level="6.4.2" data-path="trees.html"><a href="trees.html#penalization"><i class="fa fa-check"></i><b>6.4.2</b> Penalization</a></li>
<li class="chapter" data-level="6.4.3" data-path="trees.html"><a href="trees.html#aggregation"><i class="fa fa-check"></i><b>6.4.3</b> Aggregation</a></li>
<li class="chapter" data-level="6.4.4" data-path="trees.html"><a href="trees.html#tree-structure"><i class="fa fa-check"></i><b>6.4.4</b> Tree structure</a></li>
<li class="chapter" data-level="6.4.5" data-path="trees.html"><a href="trees.html#boostext"><i class="fa fa-check"></i><b>6.4.5</b> Extensions</a></li>
<li class="chapter" data-level="6.4.6" data-path="trees.html"><a href="trees.html#boostcode"><i class="fa fa-check"></i><b>6.4.6</b> Code and results</a></li>
<li class="chapter" data-level="6.4.7" data-path="trees.html"><a href="trees.html#instweight"><i class="fa fa-check"></i><b>6.4.7</b> Instance weighting</a></li>
</ul></li>
<li class="chapter" data-level="6.5" data-path="trees.html"><a href="trees.html#discussion"><i class="fa fa-check"></i><b>6.5</b> Discussion</a></li>
<li class="chapter" data-level="6.6" data-path="trees.html"><a href="trees.html#coding-exercises-2"><i class="fa fa-check"></i><b>6.6</b> Coding exercises</a></li>
</ul></li>
<li class="chapter" data-level="7" data-path="NN.html"><a href="NN.html"><i class="fa fa-check"></i><b>7</b> Neural networks</a><ul>
<li class="chapter" data-level="7.1" data-path="NN.html"><a href="NN.html#the-original-perceptron"><i class="fa fa-check"></i><b>7.1</b> The original perceptron</a></li>
<li class="chapter" data-level="7.2" data-path="NN.html"><a href="NN.html#multilayer-perceptron"><i class="fa fa-check"></i><b>7.2</b> Multilayer perceptron</a><ul>
<li class="chapter" data-level="7.2.1" data-path="NN.html"><a href="NN.html#introduction-and-notations"><i class="fa fa-check"></i><b>7.2.1</b> Introduction and notations</a></li>
<li class="chapter" data-level="7.2.2" data-path="NN.html"><a href="NN.html#universal-approximation"><i class="fa fa-check"></i><b>7.2.2</b> Universal approximation</a></li>
<li class="chapter" data-level="7.2.3" data-path="NN.html"><a href="NN.html#backprop"><i class="fa fa-check"></i><b>7.2.3</b> Learning via back-propagation</a></li>
<li class="chapter" data-level="7.2.4" data-path="NN.html"><a href="NN.html#further-details-on-classification"><i class="fa fa-check"></i><b>7.2.4</b> Further details on classification</a></li>
</ul></li>
<li class="chapter" data-level="7.3" data-path="NN.html"><a href="NN.html#howdeep"><i class="fa fa-check"></i><b>7.3</b> How deep we should go and other practical issues</a><ul>
<li class="chapter" data-level="7.3.1" data-path="NN.html"><a href="NN.html#architectural-choices"><i class="fa fa-check"></i><b>7.3.1</b> Architectural choices</a></li>
<li class="chapter" data-level="7.3.2" data-path="NN.html"><a href="NN.html#frequency-of-weight-updates-and-learning-duration"><i class="fa fa-check"></i><b>7.3.2</b> Frequency of weight updates and learning duration</a></li>
<li class="chapter" data-level="7.3.3" data-path="NN.html"><a href="NN.html#penalizations-and-dropout"><i class="fa fa-check"></i><b>7.3.3</b> Penalizations and dropout</a></li>
</ul></li>
<li class="chapter" data-level="7.4" data-path="NN.html"><a href="NN.html#code-samples-and-comments-for-vanilla-mlp"><i class="fa fa-check"></i><b>7.4</b> Code samples and comments for vanilla MLP</a><ul>
<li class="chapter" data-level="7.4.1" data-path="NN.html"><a href="NN.html#regression-example"><i class="fa fa-check"></i><b>7.4.1</b> Regression example</a></li>
<li class="chapter" data-level="7.4.2" data-path="NN.html"><a href="NN.html#classification-example"><i class="fa fa-check"></i><b>7.4.2</b> Classification example</a></li>
<li class="chapter" data-level="7.4.3" data-path="NN.html"><a href="NN.html#custloss"><i class="fa fa-check"></i><b>7.4.3</b> Custom losses</a></li>
</ul></li>
<li class="chapter" data-level="7.5" data-path="NN.html"><a href="NN.html#recurrent-networks"><i class="fa fa-check"></i><b>7.5</b> Recurrent networks</a><ul>
<li class="chapter" data-level="7.5.1" data-path="NN.html"><a href="NN.html#presentation"><i class="fa fa-check"></i><b>7.5.1</b> Presentation</a></li>
<li class="chapter" data-level="7.5.2" data-path="NN.html"><a href="NN.html#code-and-results-2"><i class="fa fa-check"></i><b>7.5.2</b> Code and results</a></li>
</ul></li>
<li class="chapter" data-level="7.6" data-path="NN.html"><a href="NN.html#other-common-architectures"><i class="fa fa-check"></i><b>7.6</b> Other common architectures</a><ul>
<li class="chapter" data-level="7.6.1" data-path="NN.html"><a href="NN.html#generative-aversarial-networks"><i class="fa fa-check"></i><b>7.6.1</b> Generative adversarial networks</a></li>
<li class="chapter" data-level="7.6.2" data-path="NN.html"><a href="NN.html#autoencoders"><i class="fa fa-check"></i><b>7.6.2</b> Autoencoders</a></li>
<li class="chapter" data-level="7.6.3" data-path="NN.html"><a href="NN.html#a-word-on-convolutional-networks"><i class="fa fa-check"></i><b>7.6.3</b> A word on convolutional networks</a></li>
<li class="chapter" data-level="7.6.4" data-path="NN.html"><a href="NN.html#advanced-architectures"><i class="fa fa-check"></i><b>7.6.4</b> Advanced architectures</a></li>
</ul></li>
<li class="chapter" data-level="7.7" data-path="NN.html"><a href="NN.html#coding-exercise-1"><i class="fa fa-check"></i><b>7.7</b> Coding exercise</a></li>
</ul></li>
<li class="chapter" data-level="8" data-path="svm.html"><a href="svm.html"><i class="fa fa-check"></i><b>8</b> Support vector machines</a><ul>
<li class="chapter" data-level="8.1" data-path="svm.html"><a href="svm.html#svm-for-classification"><i class="fa fa-check"></i><b>8.1</b> SVM for classification</a></li>
<li class="chapter" data-level="8.2" data-path="svm.html"><a href="svm.html#svm-for-regression"><i class="fa fa-check"></i><b>8.2</b> SVM for regression</a></li>
<li class="chapter" data-level="8.3" data-path="svm.html"><a href="svm.html#practice"><i class="fa fa-check"></i><b>8.3</b> Practice</a></li>
<li class="chapter" data-level="8.4" data-path="svm.html"><a href="svm.html#coding-exercises-3"><i class="fa fa-check"></i><b>8.4</b> Coding exercises</a></li>
</ul></li>
<li class="chapter" data-level="9" data-path="bayes.html"><a href="bayes.html"><i class="fa fa-check"></i><b>9</b> Bayesian methods</a><ul>
<li class="chapter" data-level="9.1" data-path="bayes.html"><a href="bayes.html#the-bayesian-framework"><i class="fa fa-check"></i><b>9.1</b> The Bayesian framework</a></li>
<li class="chapter" data-level="9.2" data-path="bayes.html"><a href="bayes.html#bayesian-sampling"><i class="fa fa-check"></i><b>9.2</b> Bayesian sampling</a><ul>
<li class="chapter" data-level="9.2.1" data-path="bayes.html"><a href="bayes.html#gibbs-sampling"><i class="fa fa-check"></i><b>9.2.1</b> Gibbs sampling</a></li>
<li class="chapter" data-level="9.2.2" data-path="bayes.html"><a href="bayes.html#metropolis-hastings-sampling"><i class="fa fa-check"></i><b>9.2.2</b> Metropolis-Hastings sampling</a></li>
</ul></li>
<li class="chapter" data-level="9.3" data-path="bayes.html"><a href="bayes.html#bayesian-linear-regression"><i class="fa fa-check"></i><b>9.3</b> Bayesian linear regression</a></li>
<li class="chapter" data-level="9.4" data-path="bayes.html"><a href="bayes.html#naive-bayes-classifier"><i class="fa fa-check"></i><b>9.4</b> Naive Bayes classifier</a></li>
<li class="chapter" data-level="9.5" data-path="bayes.html"><a href="bayes.html#BART"><i class="fa fa-check"></i><b>9.5</b> Bayesian additive trees</a><ul>
<li class="chapter" data-level="9.5.1" data-path="bayes.html"><a href="bayes.html#general-formulation"><i class="fa fa-check"></i><b>9.5.1</b> General formulation</a></li>
<li class="chapter" data-level="9.5.2" data-path="bayes.html"><a href="bayes.html#priors"><i class="fa fa-check"></i><b>9.5.2</b> Priors</a></li>
<li class="chapter" data-level="9.5.3" data-path="bayes.html"><a href="bayes.html#sampling-and-predictions"><i class="fa fa-check"></i><b>9.5.3</b> Sampling and predictions</a></li>
<li class="chapter" data-level="9.5.4" data-path="bayes.html"><a href="bayes.html#code"><i class="fa fa-check"></i><b>9.5.4</b> Code</a></li>
</ul></li>
</ul></li>
<li class="part"><span><b>III From predictions to portfolios</b></span></li>
<li class="chapter" data-level="10" data-path="valtune.html"><a href="valtune.html"><i class="fa fa-check"></i><b>10</b> Validating and tuning</a><ul>
<li class="chapter" data-level="10.1" data-path="valtune.html"><a href="valtune.html#mlmetrics"><i class="fa fa-check"></i><b>10.1</b> Learning metrics</a><ul>
<li class="chapter" data-level="10.1.1" data-path="valtune.html"><a href="valtune.html#regression-analysis"><i class="fa fa-check"></i><b>10.1.1</b> Regression analysis</a></li>
<li class="chapter" data-level="10.1.2" data-path="valtune.html"><a href="valtune.html#classification-analysis"><i class="fa fa-check"></i><b>10.1.2</b> Classification analysis</a></li>
</ul></li>
<li class="chapter" data-level="10.2" data-path="valtune.html"><a href="valtune.html#validation"><i class="fa fa-check"></i><b>10.2</b> Validation</a><ul>
<li class="chapter" data-level="10.2.1" data-path="valtune.html"><a href="valtune.html#the-variance-bias-tradeoff-theory"><i class="fa fa-check"></i><b>10.2.1</b> The variance-bias tradeoff: theory</a></li>
<li class="chapter" data-level="10.2.2" data-path="valtune.html"><a href="valtune.html#the-variance-bias-tradeoff-illustration"><i class="fa fa-check"></i><b>10.2.2</b> The variance-bias tradeoff: illustration</a></li>
<li class="chapter" data-level="10.2.3" data-path="valtune.html"><a href="valtune.html#the-risk-of-overfitting-principle"><i class="fa fa-check"></i><b>10.2.3</b> The risk of overfitting: principle</a></li>
<li class="chapter" data-level="10.2.4" data-path="valtune.html"><a href="valtune.html#the-risk-of-overfitting-some-solutions"><i class="fa fa-check"></i><b>10.2.4</b> The risk of overfitting: some solutions</a></li>
</ul></li>
<li class="chapter" data-level="10.3" data-path="valtune.html"><a href="valtune.html#the-search-for-good-hyperparameters"><i class="fa fa-check"></i><b>10.3</b> The search for good hyperparameters</a><ul>
<li class="chapter" data-level="10.3.1" data-path="valtune.html"><a href="valtune.html#methods"><i class="fa fa-check"></i><b>10.3.1</b> Methods</a></li>
<li class="chapter" data-level="10.3.2" data-path="valtune.html"><a href="valtune.html#example-grid-search"><i class="fa fa-check"></i><b>10.3.2</b> Example: grid search</a></li>
<li class="chapter" data-level="10.3.3" data-path="valtune.html"><a href="valtune.html#example-bayesian-optimization"><i class="fa fa-check"></i><b>10.3.3</b> Example: Bayesian optimization</a></li>
</ul></li>
<li class="chapter" data-level="10.4" data-path="valtune.html"><a href="valtune.html#short-discussion-on-validation-in-backtests"><i class="fa fa-check"></i><b>10.4</b> Short discussion on validation in backtests</a></li>
</ul></li>
<li class="chapter" data-level="11" data-path="ensemble.html"><a href="ensemble.html"><i class="fa fa-check"></i><b>11</b> Ensemble models</a><ul>
<li class="chapter" data-level="11.1" data-path="ensemble.html"><a href="ensemble.html#linear-ensembles"><i class="fa fa-check"></i><b>11.1</b> Linear ensembles</a><ul>
<li class="chapter" data-level="11.1.1" data-path="ensemble.html"><a href="ensemble.html#principles"><i class="fa fa-check"></i><b>11.1.1</b> Principles</a></li>
<li class="chapter" data-level="11.1.2" data-path="ensemble.html"><a href="ensemble.html#example"><i class="fa fa-check"></i><b>11.1.2</b> Example</a></li>
</ul></li>
<li class="chapter" data-level="11.2" data-path="ensemble.html"><a href="ensemble.html#stacked-ensembles"><i class="fa fa-check"></i><b>11.2</b> Stacked ensembles</a><ul>
<li class="chapter" data-level="11.2.1" data-path="ensemble.html"><a href="ensemble.html#two-stage-training"><i class="fa fa-check"></i><b>11.2.1</b> Two-stage training</a></li>
<li class="chapter" data-level="11.2.2" data-path="ensemble.html"><a href="ensemble.html#code-and-results-3"><i class="fa fa-check"></i><b>11.2.2</b> Code and results</a></li>
</ul></li>
<li class="chapter" data-level="11.3" data-path="ensemble.html"><a href="ensemble.html#extensions-1"><i class="fa fa-check"></i><b>11.3</b> Extensions</a><ul>
<li class="chapter" data-level="11.3.1" data-path="ensemble.html"><a href="ensemble.html#exogenous-variables"><i class="fa fa-check"></i><b>11.3.1</b> Exogenous variables</a></li>
<li class="chapter" data-level="11.3.2" data-path="ensemble.html"><a href="ensemble.html#shrinking-inter-model-correlations"><i class="fa fa-check"></i><b>11.3.2</b> Shrinking inter-model correlations</a></li>
</ul></li>
<li class="chapter" data-level="11.4" data-path="ensemble.html"><a href="ensemble.html#exercise"><i class="fa fa-check"></i><b>11.4</b> Exercise</a></li>
</ul></li>
<li class="chapter" data-level="12" data-path="backtest.html"><a href="backtest.html"><i class="fa fa-check"></i><b>12</b> Portfolio backtesting</a><ul>
<li class="chapter" data-level="12.1" data-path="backtest.html"><a href="backtest.html#protocol"><i class="fa fa-check"></i><b>12.1</b> Setting the protocol</a></li>
<li class="chapter" data-level="12.2" data-path="backtest.html"><a href="backtest.html#turning-signals-into-portfolio-weights"><i class="fa fa-check"></i><b>12.2</b> Turning signals into portfolio weights</a></li>
<li class="chapter" data-level="12.3" data-path="backtest.html"><a href="backtest.html#perfmet"><i class="fa fa-check"></i><b>12.3</b> Performance metrics</a><ul>
<li class="chapter" data-level="12.3.1" data-path="backtest.html"><a href="backtest.html#discussion-1"><i class="fa fa-check"></i><b>12.3.1</b> Discussion</a></li>
<li class="chapter" data-level="12.3.2" data-path="backtest.html"><a href="backtest.html#pure-performance-and-risk-indicators"><i class="fa fa-check"></i><b>12.3.2</b> Pure performance and risk indicators</a></li>
<li class="chapter" data-level="12.3.3" data-path="backtest.html"><a href="backtest.html#factor-based-evaluation"><i class="fa fa-check"></i><b>12.3.3</b> Factor-based evaluation</a></li>
<li class="chapter" data-level="12.3.4" data-path="backtest.html"><a href="backtest.html#risk-adjusted-measures"><i class="fa fa-check"></i><b>12.3.4</b> Risk-adjusted measures</a></li>
<li class="chapter" data-level="12.3.5" data-path="backtest.html"><a href="backtest.html#transaction-costs-and-turnover"><i class="fa fa-check"></i><b>12.3.5</b> Transaction costs and turnover</a></li>
</ul></li>
<li class="chapter" data-level="12.4" data-path="backtest.html"><a href="backtest.html#common-errors-and-issues"><i class="fa fa-check"></i><b>12.4</b> Common errors and issues</a><ul>
<li class="chapter" data-level="12.4.1" data-path="backtest.html"><a href="backtest.html#forward-looking-data"><i class="fa fa-check"></i><b>12.4.1</b> Forward looking data</a></li>
<li class="chapter" data-level="12.4.2" data-path="backtest.html"><a href="backtest.html#backov"><i class="fa fa-check"></i><b>12.4.2</b> Backtest overfitting</a></li>
<li class="chapter" data-level="12.4.3" data-path="backtest.html"><a href="backtest.html#simple-safeguards"><i class="fa fa-check"></i><b>12.4.3</b> Simple safeguards</a></li>
</ul></li>
<li class="chapter" data-level="12.5" data-path="backtest.html"><a href="backtest.html#implication-of-non-stationarity-forecasting-is-hard"><i class="fa fa-check"></i><b>12.5</b> Implication of non-stationarity: forecasting is hard</a><ul>
<li class="chapter" data-level="12.5.1" data-path="backtest.html"><a href="backtest.html#general-comments"><i class="fa fa-check"></i><b>12.5.1</b> General comments</a></li>
<li class="chapter" data-level="12.5.2" data-path="backtest.html"><a href="backtest.html#the-no-free-lunch-theorem"><i class="fa fa-check"></i><b>12.5.2</b> The no free lunch theorem</a></li>
</ul></li>
<li class="chapter" data-level="12.6" data-path="backtest.html"><a href="backtest.html#first-example-a-complete-backtest"><i class="fa fa-check"></i><b>12.6</b> First example: a complete backtest</a></li>
<li class="chapter" data-level="12.7" data-path="backtest.html"><a href="backtest.html#second-example-backtest-overfitting"><i class="fa fa-check"></i><b>12.7</b> Second example: backtest overfitting</a></li>
<li class="chapter" data-level="12.8" data-path="backtest.html"><a href="backtest.html#coding-exercises-4"><i class="fa fa-check"></i><b>12.8</b> Coding exercises</a></li>
</ul></li>
<li class="part"><span><b>IV Further important topics</b></span></li>
<li class="chapter" data-level="13" data-path="interp.html"><a href="interp.html"><i class="fa fa-check"></i><b>13</b> Interpretability</a><ul>
<li class="chapter" data-level="13.1" data-path="interp.html"><a href="interp.html#global-interpretations"><i class="fa fa-check"></i><b>13.1</b> Global interpretations</a><ul>
<li class="chapter" data-level="13.1.1" data-path="interp.html"><a href="interp.html#surr"><i class="fa fa-check"></i><b>13.1.1</b> Simple models as surrogates</a></li>
<li class="chapter" data-level="13.1.2" data-path="interp.html"><a href="interp.html#variable-importance"><i class="fa fa-check"></i><b>13.1.2</b> Variable importance (tree-based)</a></li>
<li class="chapter" data-level="13.1.3" data-path="interp.html"><a href="interp.html#variable-importance-agnostic"><i class="fa fa-check"></i><b>13.1.3</b> Variable importance (agnostic)</a></li>
<li class="chapter" data-level="13.1.4" data-path="interp.html"><a href="interp.html#partial-dependence-plot"><i class="fa fa-check"></i><b>13.1.4</b> Partial dependence plot</a></li>
</ul></li>
<li class="chapter" data-level="13.2" data-path="interp.html"><a href="interp.html#local-interpretations"><i class="fa fa-check"></i><b>13.2</b> Local interpretations</a><ul>
<li class="chapter" data-level="13.2.1" data-path="interp.html"><a href="interp.html#lime"><i class="fa fa-check"></i><b>13.2.1</b> LIME</a></li>
<li class="chapter" data-level="13.2.2" data-path="interp.html"><a href="interp.html#shapley-values"><i class="fa fa-check"></i><b>13.2.2</b> Shapley values</a></li>
<li class="chapter" data-level="13.2.3" data-path="interp.html"><a href="interp.html#breakdown"><i class="fa fa-check"></i><b>13.2.3</b> Breakdown</a></li>
</ul></li>
</ul></li>
<li class="chapter" data-level="14" data-path="causality.html"><a href="causality.html"><i class="fa fa-check"></i><b>14</b> Two key concepts: causality and non-stationarity</a><ul>
<li class="chapter" data-level="14.1" data-path="causality.html"><a href="causality.html#causality-1"><i class="fa fa-check"></i><b>14.1</b> Causality</a><ul>
<li class="chapter" data-level="14.1.1" data-path="causality.html"><a href="causality.html#granger"><i class="fa fa-check"></i><b>14.1.1</b> Granger causality</a></li>
<li class="chapter" data-level="14.1.2" data-path="causality.html"><a href="causality.html#causal-additive-models"><i class="fa fa-check"></i><b>14.1.2</b> Causal additive models</a></li>
<li class="chapter" data-level="14.1.3" data-path="causality.html"><a href="causality.html#structural-time-series-models"><i class="fa fa-check"></i><b>14.1.3</b> Structural time series models</a></li>
</ul></li>
<li class="chapter" data-level="14.2" data-path="causality.html"><a href="causality.html#nonstat"><i class="fa fa-check"></i><b>14.2</b> Dealing with changing environments</a><ul>
<li class="chapter" data-level="14.2.1" data-path="causality.html"><a href="causality.html#non-stationarity-yet-another-illustration"><i class="fa fa-check"></i><b>14.2.1</b> Non-stationarity: yet another illustration</a></li>
<li class="chapter" data-level="14.2.2" data-path="causality.html"><a href="causality.html#online-learning"><i class="fa fa-check"></i><b>14.2.2</b> Online learning</a></li>
<li class="chapter" data-level="14.2.3" data-path="causality.html"><a href="causality.html#homogeneous-transfer-learning"><i class="fa fa-check"></i><b>14.2.3</b> Homogeneous transfer learning</a></li>
</ul></li>
</ul></li>
<li class="chapter" data-level="15" data-path="unsup.html"><a href="unsup.html"><i class="fa fa-check"></i><b>15</b> Unsupervised learning</a><ul>
<li class="chapter" data-level="15.1" data-path="unsup.html"><a href="unsup.html#corpred"><i class="fa fa-check"></i><b>15.1</b> The problem with correlated predictors</a></li>
<li class="chapter" data-level="15.2" data-path="unsup.html"><a href="unsup.html#principal-component-analysis-and-autoencoders"><i class="fa fa-check"></i><b>15.2</b> Principal component analysis and autoencoders</a><ul>
<li class="chapter" data-level="15.2.1" data-path="unsup.html"><a href="unsup.html#a-bit-of-algebra"><i class="fa fa-check"></i><b>15.2.1</b> A bit of algebra</a></li>
<li class="chapter" data-level="15.2.2" data-path="unsup.html"><a href="unsup.html#pca"><i class="fa fa-check"></i><b>15.2.2</b> PCA</a></li>
<li class="chapter" data-level="15.2.3" data-path="unsup.html"><a href="unsup.html#ae"><i class="fa fa-check"></i><b>15.2.3</b> Autoencoders</a></li>
<li class="chapter" data-level="15.2.4" data-path="unsup.html"><a href="unsup.html#application"><i class="fa fa-check"></i><b>15.2.4</b> Application</a></li>
</ul></li>
<li class="chapter" data-level="15.3" data-path="unsup.html"><a href="unsup.html#clustering-via-k-means"><i class="fa fa-check"></i><b>15.3</b> Clustering via k-means</a></li>
<li class="chapter" data-level="15.4" data-path="unsup.html"><a href="unsup.html#nearest-neighbors"><i class="fa fa-check"></i><b>15.4</b> Nearest neighbors</a></li>
<li class="chapter" data-level="15.5" data-path="unsup.html"><a href="unsup.html#coding-exercise-2"><i class="fa fa-check"></i><b>15.5</b> Coding exercise</a></li>
</ul></li>
<li class="chapter" data-level="16" data-path="RL.html"><a href="RL.html"><i class="fa fa-check"></i><b>16</b> Reinforcement learning</a><ul>
<li class="chapter" data-level="16.1" data-path="RL.html"><a href="RL.html#theoretical-layout"><i class="fa fa-check"></i><b>16.1</b> Theoretical layout</a><ul>
<li class="chapter" data-level="16.1.1" data-path="RL.html"><a href="RL.html#general-framework"><i class="fa fa-check"></i><b>16.1.1</b> General framework</a></li>
<li class="chapter" data-level="16.1.2" data-path="RL.html"><a href="RL.html#q-learning"><i class="fa fa-check"></i><b>16.1.2</b> Q-learning</a></li>
<li class="chapter" data-level="16.1.3" data-path="RL.html"><a href="RL.html#sarsa"><i class="fa fa-check"></i><b>16.1.3</b> SARSA</a></li>
</ul></li>
<li class="chapter" data-level="16.2" data-path="RL.html"><a href="RL.html#the-curse-of-dimensionality"><i class="fa fa-check"></i><b>16.2</b> The curse of dimensionality</a></li>
<li class="chapter" data-level="16.3" data-path="RL.html"><a href="RL.html#policy-gradient"><i class="fa fa-check"></i><b>16.3</b> Policy gradient</a><ul>
<li class="chapter" data-level="16.3.1" data-path="RL.html"><a href="RL.html#principle-2"><i class="fa fa-check"></i><b>16.3.1</b> Principle</a></li>
<li class="chapter" data-level="16.3.2" data-path="RL.html"><a href="RL.html#extensions-2"><i class="fa fa-check"></i><b>16.3.2</b> Extensions</a></li>
</ul></li>
<li class="chapter" data-level="16.4" data-path="RL.html"><a href="RL.html#simple-examples"><i class="fa fa-check"></i><b>16.4</b> Simple examples</a><ul>
<li class="chapter" data-level="16.4.1" data-path="RL.html"><a href="RL.html#q-learning-with-simulations"><i class="fa fa-check"></i><b>16.4.1</b> Q-learning with simulations</a></li>
<li class="chapter" data-level="16.4.2" data-path="RL.html"><a href="RL.html#RLemp2"><i class="fa fa-check"></i><b>16.4.2</b> Q-learning with market data</a></li>
</ul></li>
<li class="chapter" data-level="16.5" data-path="RL.html"><a href="RL.html#concluding-remarks"><i class="fa fa-check"></i><b>16.5</b> Concluding remarks</a></li>
<li class="chapter" data-level="16.6" data-path="RL.html"><a href="RL.html#exercises"><i class="fa fa-check"></i><b>16.6</b> Exercises</a></li>
</ul></li>
<li class="part"><span><b>V Appendix</b></span></li>
<li class="chapter" data-level="17" data-path="data-description.html"><a href="data-description.html"><i class="fa fa-check"></i><b>17</b> Data description</a></li>
<li class="chapter" data-level="18" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html"><i class="fa fa-check"></i><b>18</b> Solutions to exercises</a><ul>
<li class="chapter" data-level="18.1" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-3"><i class="fa fa-check"></i><b>18.1</b> Chapter 3</a></li>
<li class="chapter" data-level="18.2" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-4"><i class="fa fa-check"></i><b>18.2</b> Chapter 4</a></li>
<li class="chapter" data-level="18.3" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-5"><i class="fa fa-check"></i><b>18.3</b> Chapter 5</a></li>
<li class="chapter" data-level="18.4" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-6"><i class="fa fa-check"></i><b>18.4</b> Chapter 6</a></li>
<li class="chapter" data-level="18.5" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-7-the-autoencoder-model"><i class="fa fa-check"></i><b>18.5</b> Chapter 7: the autoencoder model</a></li>
<li class="chapter" data-level="18.6" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-8"><i class="fa fa-check"></i><b>18.6</b> Chapter 8</a></li>
<li class="chapter" data-level="18.7" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-11-ensemble-neural-network"><i class="fa fa-check"></i><b>18.7</b> Chapter 11: ensemble neural network</a></li>
<li class="chapter" data-level="18.8" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-12"><i class="fa fa-check"></i><b>18.8</b> Chapter 12</a><ul>
<li class="chapter" data-level="18.8.1" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#ew-portfolios-with-the-tidyverse"><i class="fa fa-check"></i><b>18.8.1</b> EW portfolios with the tidyverse</a></li>
<li class="chapter" data-level="18.8.2" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#advanced-weighting-function"><i class="fa fa-check"></i><b>18.8.2</b> Advanced weighting function</a></li>
<li class="chapter" data-level="18.8.3" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#functional-programming-in-the-backtest"><i class="fa fa-check"></i><b>18.8.3</b> Functional programming in the backtest</a></li>
</ul></li>
<li class="chapter" data-level="18.9" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-15"><i class="fa fa-check"></i><b>18.9</b> Chapter 15</a></li>
<li class="chapter" data-level="18.10" data-path="solutions-to-exercises.html"><a href="solutions-to-exercises.html#chapter-16"><i class="fa fa-check"></i><b>18.10</b> Chapter 16</a></li>
</ul></li>
</ul>

      </nav>
    </div>

    <div class="book-body">
      <div class="body-inner">
        <div class="book-header" role="navigation">
          <h1>
            <i class="fa fa-circle-o-notch fa-spin"></i><a href="./">Machine Learning for Factor Investing</a>
          </h1>
        </div>

        <div class="page-wrapper" tabindex="-1" role="main">
          <div class="page-inner">

            <section class="normal" id="section-">
<div id="NN" class="section level1">
<h1><span class="header-section-number">Chapter 7</span> Neural networks</h1>
<p>Neural networks (NNs) are an immensely rich and complicated topic. In this chapter, we introduce the simple ideas and concepts behind the most simple architectures of NNs. For more exhaustive treatments on NN idiosyncracies, we refer to the monographs by <span class="citation">Haykin (<a href="#ref-haykin2009neural" role="doc-biblioref">2009</a>)</span>, <span class="citation">Du and Swamy (<a href="#ref-du2013neural" role="doc-biblioref">2013</a>)</span> and <span class="citation">Goodfellow et al. (<a href="#ref-goodfellow2016deep" role="doc-biblioref">2016</a>)</span>. The latter is available freely online: www.deeplearningbook.org. For a practical introduction, we recommend the great book of <span class="citation">Chollet (<a href="#ref-chollet2017deep" role="doc-biblioref">2017</a>)</span>.</p>
<p>For starters, we briefly comment on the qualification “neural network”. Most experts agree that the term is not very well chosen, as NNs have little to do with how the human brain works (of which we know not that much). This explains why they are often referred to as “artificial neural networks” - we do not use the adjective for notational simplicity. Because we consider it more appropriate, we recall the definition of NNs given by François Chollet: “<em>chains of differentiable, parameterised geometric functions, trained with gradient descent (with gradients obtained via the chain rule)</em>”.</p>
<p>Early references of neural networks in finance are <span class="citation">Bansal and Viswanathan (<a href="#ref-bansal1993no" role="doc-biblioref">1993</a>)</span> and <span class="citation">Eakins, Stansell, and Buck (<a href="#ref-eakins1998analyzing" role="doc-biblioref">1998</a>)</span>. Both have very different goals. In the first one, the authors aim to estimate a <strong>nonlinear form</strong> for the pricing kernel. In the second one, the purpose is to identify and quantify relationships between institutional investments in stocks and the attributes of the firms (an early contribution towards factor investing). An early review (<span class="citation">Burrell and Folarin (<a href="#ref-burrell1997impact" role="doc-biblioref">1997</a>)</span>) lists financial applications of NNs during the 1990s. More recently, <span class="citation">Sezer, Gudelek, and Ozbayoglu (<a href="#ref-sezer2019financial" role="doc-biblioref">2019</a>)</span>, <span class="citation">W. Jiang (<a href="#ref-jiang2020applications" role="doc-biblioref">2020</a>)</span> and <span class="citation">Lim and Zohren (<a href="#ref-lim2020time" role="doc-biblioref">2020</a>)</span> survey the attempts to forecast financial time series with deep-learning models, mainly by computer science scholars.</p>
<p>The pure predictive ability of NNs in financial markets is a popular subject and we further cite for example <span class="citation">Kimoto et al. (<a href="#ref-kimoto1990stock" role="doc-biblioref">1990</a>)</span>, <span class="citation">Enke and Thawornwong (<a href="#ref-enke2005use" role="doc-biblioref">2005</a>)</span>, <span class="citation">Zhang and Wu (<a href="#ref-zhang2009stock" role="doc-biblioref">2009</a>)</span>, <span class="citation">Guresen, Kayakutlu, and Daim (<a href="#ref-guresen2011using" role="doc-biblioref">2011</a>)</span>, <span class="citation">Krauss, Do, and Huck (<a href="#ref-krauss2017deep" role="doc-biblioref">2017</a>)</span>, <span class="citation">Fischer and Krauss (<a href="#ref-fischer2018deep" role="doc-biblioref">2018</a>)</span>, <span class="citation">Aldridge and Avellaneda (<a href="#ref-aldridge2019neural" role="doc-biblioref">2019</a>)</span>, <span class="citation">Babiak and Barunik (<a href="#ref-babiak2020deep" role="doc-biblioref">2020</a>)</span>, <span class="citation">Y. Ma, Han, and Wang (<a href="#ref-ma2020portfolio" role="doc-biblioref">2020</a>)</span>, and <span class="citation">Soleymani and Paquet (<a href="#ref-soleymani2020financial" role="doc-biblioref">2020</a>)</span>.<a href="#fn17" class="footnote-ref" id="fnref17"><sup>17</sup></a> The last reference even combines several types of NNs embedded inside an overarching reinforcement learning structure. This list is very far from exhaustive. In the field of financial economics, recent research on neural networks includes:</p>
<ul>
<li><span class="citation">Feng, Polson, and Xu (<a href="#ref-feng2019deep" role="doc-biblioref">2019</a>)</span> use neural networks to find factors that are the best at explaining the cross-section of stock returns.<br />
</li>
<li><span class="citation">Gu, Kelly, and Xiu (<a href="#ref-gu2018empirical" role="doc-biblioref">2020</a><a href="#ref-gu2018empirical" role="doc-biblioref">b</a>)</span> map firm attributes and macro-economic variables into future returns. This creates a strong predictive tool that is able to forecast future returns very accurately.<br />
</li>
<li><span class="citation">Luyang Chen, Pelger, and Zhu (<a href="#ref-chen2019deep" role="doc-biblioref">2020</a>)</span> estimate the pricing kernel with a complex neural network structure including a generative adversarial network. This again gives crucial information on the structure of expected stock returns and can be used for portfolio construction (by building an accurate maximum Sharpe ratio policy).</li>
</ul>
<div id="the-original-perceptron" class="section level2">
<h2><span class="header-section-number">7.1</span> The original perceptron</h2>
<p>
The origins of NNs go back at least to <span class="citation">Rosenblatt (<a href="#ref-rosenblatt1958perceptron" role="doc-biblioref">1958</a>)</span>. Its aim is binary classification. For simplicity, let us assume that the output is <span class="math inline">\(\{0\)</span> = do not invest<span class="math inline">\(\}\)</span> versus <span class="math inline">\(\{1\)</span> = invest<span class="math inline">\(\}\)</span> (e.g., derived from return, negative versus positive). Given the current nomenclature, a perceptron can be defined as an activated linear mapping. The model is the following:</p>
<p><span class="math display">\[f(\mathbf{x})=\left\{ \begin{array}{lll}
1 &amp; \text{if } \mathbf{x}&#39;\mathbf{w}+b &gt;0\\
0  &amp;\text{otherwise}
\end{array}\right.\]</span>
The vector of weights <span class="math inline">\(\mathbf{w}\)</span> scales the variables and the bias <span class="math inline">\(b\)</span> shifts the decision barrier. Given values for <span class="math inline">\(b\)</span> and <span class="math inline">\(w_i\)</span>, the error is <span class="math inline">\(\epsilon_i=y_i-1_{\left\{\sum_{j=1}^Jx_{i,j}w_j+w_0&gt;0\right\}}\)</span>. As is customary, we set <span class="math inline">\(b=w_0\)</span> and add an initial constant column to <span class="math inline">\(x\)</span>: <span class="math inline">\(x_{i,0}=1\)</span>, so that <span class="math inline">\(\epsilon_i=y_i-1_{\left\{\sum_{j=0}^Jx_{i,j}w_j&gt;0\right\}}\)</span>. In contrast to regressions, perceptrons do not have closed-form solutions. The optimal weights can only be approximated. Just like for regression, one way to derive good weights is to minimize the sum of squared errors. To this purpose, the simplest way to proceed is to</p>
<ol style="list-style-type: decimal">
<li>compute the current model value at point <span class="math inline">\(\textbf{x}_i\)</span>: <span class="math inline">\(\tilde{y}_i=1_{\left\{\sum_{j=0}^Jw_jx_{i,j}&gt;0\right\}}\)</span>,</li>
<li>adjust the weight vector: <span class="math inline">\(w_j \leftarrow w_j + \eta (y_i-\tilde{y}_i)x_{i,j}\)</span>,</li>
</ol>
<p>which amounts to shifting the weights in the  direction. Just like for tree methods, the scaling factor <span class="math inline">\(\eta\)</span> is the learning rate. A large <span class="math inline">\(\eta\)</span> will imply large shifts: learning will be rapid but convergence may be slow or may even not occur. A small <span class="math inline">\(\eta\)</span> is usually preferable, as it helps reduce the risk of overfitting.</p>
<p>In Figure <a href="NN.html#fig:perceptron">7.1</a>, we illustrate this mechanism. The initial model (dashed grey line) was trained on 7 points (3 red and 4 blue). A new black point comes in.</p>
<div class="figure" style="text-align: center"><span id="fig:perceptron"></span>
<img src="images/NN_percep_scheme.png" alt="Scheme of a perceptron." width="450px" />
<p class="caption">
FIGURE 7.1: Scheme of a perceptron.
</p>
</div>
<ul>
<li>if the point is red, there is no need for adjustment: it is labelled correctly as it lies on the right side of the border.<br />
</li>
<li>if the point is blue, then the model needs to be updated appropriately. Given the rule mentioned above, this means adjusting the slope of the line downwards. Depending on <span class="math inline">\(\eta\)</span>, the shift will be sufficient to change the classification of the new point - or not.</li>
</ul>
<p>At the time of its inception, the perceptron was an immense breakthrough which received an intense media coverage (see <span class="citation">Olazaran (<a href="#ref-olazaran1996sociological" role="doc-biblioref">1996</a>)</span> and <span class="citation">Anderson and Rosenfeld (<a href="#ref-anderson2000talking" role="doc-biblioref">2000</a>)</span>). Its rather simple structure was progressively generalized to networks (combinations) of perceptrons. Each one of them is a simple unit, and units are gathered into layers. The next section describes the organization of simple multilayer perceptrons (MLPs).</p>
</div>
<div id="multilayer-perceptron" class="section level2">
<h2><span class="header-section-number">7.2</span> Multilayer perceptron</h2>
<p></p>
<div id="introduction-and-notations" class="section level3">
<h3><span class="header-section-number">7.2.1</span> Introduction and notations</h3>
<p>A perceptron can be viewed as a linear model to which is applied a particular function: the Heaviside (step) function. Other choices of functions are naturally possible. In the NN jargon, they are called activation functions. Their purpose is to introduce nonlinearity in otherwise very linear models.</p>
<p>Just like for random forests with trees, the idea behind neural networks is to combine perceptron-like building blocks. A popular representation of neural networks is shown in Figure <a href="NN.html#fig:NNnaive">7.2</a>. This scheme is overly simplistic. It hides what is really going on: there is a perceptron in each green circle and each output is activated by some function before it is sent to the final output aggregation. This is why such a model is called a Multilayer Perceptron (MLP).</p>
<div class="figure" style="text-align: center"><span id="fig:NNnaive"></span>
<img src="images/nn.png" alt="Simplified scheme of a multi-layer perceptron." width="480px" />
<p class="caption">
FIGURE 7.2: Simplified scheme of a multi-layer perceptron.
</p>
</div>
<p>A more faithful account of what is going on is laid out in Figure <a href="NN.html#fig:MLperceptron">7.3</a>.</p>
<div class="figure" style="text-align: center"><span id="fig:MLperceptron"></span>
<img src="images/NN_scheme.png" alt="Detailed scheme of a perceptron with 2 intermediate layers." width="793" />
<p class="caption">
FIGURE 7.3: Detailed scheme of a perceptron with 2 intermediate layers.
</p>
</div>
<p>Before we proceed with comments, we introduce some notation that will be used thoughout the chapter.</p>
<ul>
<li>The data is separated into a matrix <span class="math inline">\(\textbf{X}=x_{i,j}\)</span> of features and a vector of output values <span class="math inline">\(\textbf{y}=y_i\)</span>. <span class="math inline">\(\textbf{x}\)</span> or <span class="math inline">\(\textbf{x}_i\)</span> denotes one line of <span class="math inline">\(\textbf{X}\)</span>.</li>
<li>A neural network will have <span class="math inline">\(L\ge1\)</span> layers and for each layer <span class="math inline">\(l\)</span>, the number of units is <span class="math inline">\(U_l\ge1\)</span>.</li>
<li>The weights for unit <span class="math inline">\(k\)</span> located in layer <span class="math inline">\(l\)</span> are denoted with <span class="math inline">\(\textbf{w}_{k}^{(l)}=w_{k,j}^{(l)}\)</span> and the corresponding biases <span class="math inline">\(b_{k}^{(l)}\)</span>. The length of <span class="math inline">\(\textbf{w}_{k}^{(l)}\)</span> is equal to <span class="math inline">\(U_{l-1}\)</span>. <span class="math inline">\(k\)</span> refers to the location of the unit in layer <span class="math inline">\(l\)</span> while <span class="math inline">\(j\)</span> to the unit in layer <span class="math inline">\(l-1\)</span>.</li>
<li>Outputs (post-activation) are denoted <span class="math inline">\(o_{i,k}^{(l)}\)</span> for instance <span class="math inline">\(i\)</span>, layer <span class="math inline">\(l\)</span> and unit <span class="math inline">\(k\)</span>.</li>
</ul>
<p>The process is the following. When entering the network, the data goes though the initial linear mapping:<br />
<span class="math display">\[v_{i,k}^{(1)}=\textbf{x}_i&#39;\textbf{w}^{(1)}_k+b_k^{(1)},  \text{for } l=1, \quad k \in [1,U_1],  \]</span><br />
which is then transformed by a non-linear function <span class="math inline">\(f^{1}\)</span>. The result of this alteration is then given as input of the next layer and so on. The linear forms will be repeated (with different weights) for each layer of the network:
<span class="math display">\[v_{i,k}^{(l)}=(\textbf{o}^{(l-1)}_i)&#39;\textbf{w}^{(l)}_k+b_k^{(l)}, \text{for } l \ge 2,  \quad k \in [1,U_l]. \]</span><br />
The connections between the layers are the so-called outputs, which are basically the linear mappings to which the activation functions <span class="math inline">\(f^{(l)}\)</span> have been applied. The output of layer <span class="math inline">\(l\)</span> is the input of layer <span class="math inline">\(l+1\)</span>.
<span class="math display">\[o_{i,k}^{(l)}=f^{(l)}\left(v_{i,k}^{(l)}\right).\]</span><br />
Finally, the terminal stage aggregates the outputs from the last layer:<br />
<span class="math display">\[\tilde{y}_i =f^{(L+1)} \left((\textbf{o}^{(L)}_i)&#39;\textbf{w}^{(L+1)}+b^{(L+1)}\right).\]</span></p>
<p>In the forward-propagation of the input, the activation function naturally plays an important role. In Figure <a href="NN.html#fig:activationf">7.4</a>, we plot the most usual activation functions used by neural network libraries.</p>
<div class="figure" style="text-align: center"><span id="fig:activationf"></span>
<img src="images/activation.png" alt="Plot of the most common activation functions." width="580" />
<p class="caption">
FIGURE 7.4: Plot of the most common activation functions.
</p>
</div>
<p>Let us rephrase the process through the lens of factor investing. The input <span class="math inline">\(\textbf{x}\)</span> are the characteristics of the firms. The first step is to multiply their value by weights and add a bias. This is performed for all the units of the first layer. The output, which is a linear combination of the input is then transformed by the activation function. Each unit provides one value and all of these values are fed to the second layer following the same process. This is iterated until the end of the network. The purpose of the last layer is to yield an output shape that corresponds to the label: if the label is numerical, the output is a single number, if it is categorical, then usually it is a vector with length equal to the number of categories. This vector indicates the probability that the value belongs to one particular category.</p>
<p>It is possible to use a final activation function after the output. This can have a huge importance on the result. Indeed, if the labels are returns, applying a sigmoid function at the very end will be disastrous because the sigmoid is always positive.</p>
</div>
<div id="universal-approximation" class="section level3">
<h3><span class="header-section-number">7.2.2</span> Universal approximation</h3>
<p></p>
<p>One reason neural networks work well is that they are <em>universal approximators</em>. Given any bounded continuous function, there exists a one-layer network that can approximate this function up to arbitrary precision (see <span class="citation">Cybenko (<a href="#ref-cybenko1989approximation" role="doc-biblioref">1989</a>)</span> for early references, section 4.2 in <span class="citation">Du and Swamy (<a href="#ref-du2013neural" role="doc-biblioref">2013</a>)</span> and section 6.4.1 in <span class="citation">Goodfellow et al. (<a href="#ref-goodfellow2016deep" role="doc-biblioref">2016</a>)</span> for more exhaustive lists of papers, and <span class="citation">Guliyev and Ismailov (<a href="#ref-guliyev2018approximation" role="doc-biblioref">2018</a>)</span> for recent results).</p>
<p>Formally, a one-layer perceptron is defined by
<span class="math display">\[f_n(\textbf{x})=\sum_{l=1}^nc_l\phi(\textbf{x}\textbf{w}_l+\textbf{b}_l)+c_0,\]</span>
where <span class="math inline">\(\phi\)</span> is a (non-constant) bounded continuous function. Then, for any <span class="math inline">\(\epsilon&gt;0\)</span>, it is possible to find one <span class="math inline">\(n\)</span> such that for any continuous function <span class="math inline">\(f\)</span> on the unit hypercube <span class="math inline">\([0,1]^d\)</span>,
<span class="math display">\[|f(\textbf{x})-f_n(\textbf{x})|&lt; \epsilon, \quad \forall \textbf{x} \in [0,1]^d.\]</span></p>
<p>This result is rather intuitive: it suffices to add units to the layer to improve the fit. The process is more or less analogous to polynomial approximation, though some subtleties arise depending on the properties of the activations functions (boundedness, smoothness, convexity, etc.). We refer to <span class="citation">Costarelli, Spigler, and Vinti (<a href="#ref-costarelli2016survey" role="doc-biblioref">2016</a>)</span> for a survey on this topic.</p>
<p>The raw results on universal approximation imply that any well-behaved function <span class="math inline">\(f\)</span> can be approached sufficiently closely by a simple neural network, as long as the number of units can be arbitrarily large. Now, they do not directly relate to the learning phase, i.e., when the model is optimized with respect to a particular dataset. In a series of papers (<span class="citation">Barron (<a href="#ref-barron1993universal" role="doc-biblioref">1993</a>)</span> and <span class="citation">Barron (<a href="#ref-barron1994approximation" role="doc-biblioref">1994</a>)</span>, notably), Barron gives a much more precise characterization of what neural networks can achieve. In <span class="citation">Barron (<a href="#ref-barron1993universal" role="doc-biblioref">1993</a>)</span> it is for instance proved a more precise version of universal approximation: for particular neural networks (with sigmoid activation), <span class="math inline">\(\mathbb{E}[(f(\textbf{x})-f_n(\textbf{x}))^2]\le c_f/n\)</span>, which gives a speed of convergence related to the size of the network. In the expectation, the random term is <span class="math inline">\(\textbf{x}\)</span>: this corresponds to the case where the data is considered to be a sample of i.i.d. observations of a fixed distribution (this is the most common assumption in machine learning).</p>
<p>Below, we state one important result that is easy to interpret; it is taken from <span class="citation">Barron (<a href="#ref-barron1994approximation" role="doc-biblioref">1994</a>)</span>.</p>
<p>In the sequel, <span class="math inline">\(f_n\)</span> corresponds to a possibly penalized neural network with only one intermediate layer with <span class="math inline">\(n\)</span> units and sigmoid activation function. Moreover, both the supports of the predictors and the label are assumed to be bounded (which is not a major constraint). The most important metric in a regression exercise is the mean squared error (MSE) and the main result is a bound (in order of magnitude) on this quantity. For <span class="math inline">\(N\)</span> randomly sampled i.i.d. points <span class="math inline">\(y_i=f(x_i)+\epsilon_i\)</span> on which <span class="math inline">\(f_n\)</span> is trained, the best possible empirical MSE behaves like</p>
<p><span class="math display" id="eq:univapprox">\[\begin{equation}
\tag{7.1}
\mathbb{E}\left[(f(x)-f_n(x))^2 \right]=\underbrace{O\left(\frac{c_f}{n} \right)}_{\text{size of network}}+\ \underbrace{O\left(\frac{nK \log(N)}{N} \right)}_{\text{size of sample}},
\end{equation}\]</span>
where <span class="math inline">\(K\)</span> is the dimension of the input (number of columns) and <span class="math inline">\(c_f\)</span> is a constant that depends on the generator function <span class="math inline">\(f\)</span>. The above quantity provides a bound on the error that can be achieved by the best possible neural network given a dataset of size <span class="math inline">\(N\)</span>.</p>
<p>There are clearly two components in the decomposition of this bound. The first one pertains to the complexity of the network. Just as in the original universal approximation theorem, the error decreases with the number of units in the network. But this is not enough! Indeed, the sample size is of course a key driver in the quality of learning (of i.i.d. observations). The second component of the bound indicates that the error decreases at a slightly slower pace with respect to the number of observations (<span class="math inline">\(\log(N)/N\)</span>) and is linear in the number of units and the size of the input. This clearly underlines the link (trade-off?) between sample size and model complexity: having a very complex model is useless if the sample is small just like a simple model will not catch the fine relationships in a large dataset.</p>
<p>Overall, a neural network is a possibly very complicated function with a lot of parameters. In linear regressions, it is possible to increase the fit by spuriously adding exogenous variables. In neural networks, it suffices to increase the number of parameters by arbitrarily adding units to the layer(s). This is of course a very bad idea because high-dimensional networks will mostly capture the particularities of the sample they are trained on.</p>
</div>
<div id="backprop" class="section level3">
<h3><span class="header-section-number">7.2.3</span> Learning via back-propagation</h3>
<p></p>
<p>Just like for tree methods, neural networks are trained by minimizing some loss function subject to some penalization:
<span class="math display">\[O=\sum_{i=1}^I \text{loss}(y_i,\tilde{y}_i)+ \text{penalization},\]</span>
where <span class="math inline">\(\tilde{y}_i\)</span> are the values obtained by the model and <span class="math inline">\(y_i\)</span> are the <em>true</em> values of the instances. A simple requirement that eases computation is that the loss function be differentiable. The most common choices are the squared error for regression tasks and cross-entropy for classification tasks. We discuss the technicalities of classification in the next subsection.</p>
<p>The training of a neural network amounts to alter the weights (and biases) of all units in all layers so that <span class="math inline">\(O\)</span> defined above is the smallest possible. To ease the notation and given that the <span class="math inline">\(y_i\)</span> are fixed, let us write <span class="math inline">\(D(\tilde{y}_i(\textbf{W}))=\text{loss}(y_i,\tilde{y}_i)\)</span>, where <span class="math inline">\(\textbf{W}\)</span> denotes the entirety of weights and biases in the network. The updating of the weights will be performed via gradient descent, i.e., via</p>
<p><span class="math display" id="eq:graddesc">\[\begin{equation}
\tag{7.2}
\textbf{W} \leftarrow \textbf{W}-\eta  \frac{\partial D(\tilde{y}_i) }{\partial \textbf{W}}.
\end{equation}\]</span></p>
<p> </p>
<p>This mechanism is the most classical in the optimization literature and we illustrate it in Figure <a href="NN.html#fig:newton">7.5</a>. We highlight the possible suboptimality of large learning rates. In the diagram, the descent associated with the high <span class="math inline">\(\eta\)</span> will oscillate around the optimal point, whereas the one related to the small eta will converge more directly.</p>
<p>The complicated task in the above equation is to compute the gradient (derivative) which tells in which direction the adjustment should be done. The problem is that the successive nested layers and associated activations require many iterations of the chain rule for differentiation.</p>
<div class="figure" style="text-align: center"><span id="fig:newton"></span>
<img src="images/Newton.png" alt="Outline of gradient descent." width="300" />
<p class="caption">
FIGURE 7.5: Outline of gradient descent.
</p>
</div>
<p>The most common way to approximate a derivative is probably the finite difference method. Under the usual assumptions (the loss is twice differentiable), the centered difference satisfies:</p>
<p><span class="math display">\[\frac{\partial D(\tilde{y}_i(w_k))}{\partial w_k} = \frac{D(\tilde{y}_i(w_k+h))-D(\tilde{y}_i(w_k-h))}{2h}+O(h^2),\]</span>
where <span class="math inline">\(h&gt;0\)</span> is some arbitrarily small number. In spite of its apparent simplicity, this method is costly computationally because it requires a number of operations of the magnitude of the number of weights.</p>
<p>Luckily, there is a small trick that can considerably ease and speed up the computation. The idea is to simply follow the chain rule and recycle terms along the way. Let us start by recalling
<span class="math display">\[\tilde{y}_i =f^{(L+1)} \left((\textbf{o}^{(L)}_i)&#39;\textbf{w}^{(L+1)}+b^{(L+1)}\right)=f^{(L+1)}\left(b^{(L+1)}+\sum_{k=1}^{U_L} w^{(L+1)}_ko^{(L)}_{i,k} \right),\]</span> so that if we differentiate with the most immediate weights and biases, we get:
<span class="math display" id="eq:backprop1">\[\begin{align}
\frac{\partial D(\tilde{y}_i)}{\partial w_k^{(L+1)}}&amp;=D&#39;(\tilde{y}_i) \left(f^{(L+1)} \right)&#39;\left( b^{(L+1)}+\sum_{k=1}^{U_L} w^{(L+1)}_ko^{(L)}_{i,k}  \right)o^{(L)}_{i,k} \\   \tag{7.3}
&amp;= D&#39;(\tilde{y}_i) \left(f^{(L+1)} \right)&#39;\left( v^{(L+1)}_{i,k}  \right)o^{(L)}_{i,k} \\
\frac{\partial D(\tilde{y}_i)}{\partial b^{(L+1)}}&amp;=D&#39;(\tilde{y}_i) \left(f^{(L+1)} \right)&#39;\left( b^{(L+1)}+\sum_{k=1}^{U_L} w^{(L+1)}_ko^{(L)}_{i,k}  \right).
\end{align}\]</span></p>
<p>This is the easiest part. We must now go back one layer and this can only be done via the chain rule. To access layer <span class="math inline">\(L\)</span>, we recall identity <span class="math inline">\(v_{i,k}^{(L)}=(\textbf{o}^{(L-1)}_i)&#39;\textbf{w}^{(L)}_k+b_k^{(L)}=b_k^{(L)}+\sum_{j=1}^{U_L}o^{(L-1)}_{i,j}w^{(L)}_{k,j}\)</span>.
We can then proceed:</p>
<p><span class="math display">\[\begin{align}
\frac{\partial D(\tilde{y}_i)}{\partial w_{k,j}^{(L)}}&amp;=\frac{\partial D(\tilde{y}_i)}{\partial v^{(L)}_{i,k}}\frac{\partial v^{(L)}_{i,k}}{\partial w_{k,j}^{(L)}} = \frac{\partial D(\tilde{y}_i)}{\partial v^{(L)}_{i,k}}o^{(L-1)}_{i,j}\\
&amp;=\frac{\partial D(\tilde{y}_i)}{\partial o^{(L)}_{i,k}} \frac{\partial o^{(L)}_{i,k} }{\partial v^{(L)}_{i,k}}  o^{(L-1)}_{i,j} = \frac{\partial D(\tilde{y}_i)}{\partial o^{(L)}_{i,k}}  (f^{(L)})&#39;(v_{i,k}^{(L)})  o^{(lL1)}_{i,j} \\
&amp;=\underbrace{D&#39;(\tilde{y}_i) \left(f^{(L+1)} \right)&#39;\left(v^{(L+1)}_{i,k}  \right)}_{\text{computed above!}} w^{(L+1)}_k (f^{(L)})&#39;(v_{i,k}^{(L)})  o^{(L-1)}_{i,j},
\end{align}\]</span></p>
<p>where, as we show in the last line, one part of the derivative was already computed in the previous step (Equation <a href="NN.html#eq:backprop1">(7.3)</a>). Hence, we can recycle this number and only focus on the right part of the expression.</p>
<p>The magic of the so-called back-propagation is that this will hold true for each step of the differentiation. When computing the gradient for weights and biases in layer <span class="math inline">\(l\)</span>, there will be two parts: one that can be recycled from previous layers and another, local part, that depends only on the values and activation function of the current layer. A nice illustration of this process is given by the Google developer team: playground.tensorflow.org.</p>
<p>When the data is formatted using tensors, it is possible to resort to vectorization so that the number of calls is limited to an order of the magnitude of the number of nodes (units) in the network.</p>
<p>The back-propagation algorithm can be summarized as follows. Given a sample of points (possibly just one):</p>
<ol style="list-style-type: decimal">
<li>the data flows from left as is described in Figure <a href="NN.html#fig:backp">7.6</a>. The blue arrows show the <strong>forward pass</strong>;<br />
</li>
<li>this allows the computation of the error or loss function;<br />
</li>
<li>all derivatives of this function (w.r.t. weights and biases) are computed, starting from the last layer and diffusing to the left (hence the term back-propagation) - the green arrows show the <strong>backward pass</strong>;<br />
</li>
<li>all weights and biases can be updated to take the sample points into account (the model is adjusted to reduce the loss/error stemming from these points).</li>
</ol>
<div class="figure" style="text-align: center"><span id="fig:backp"></span>
<img src="images/backprop.png" alt="Diagram of back-propagation." width="768" />
<p class="caption">
FIGURE 7.6: Diagram of back-propagation.
</p>
</div>
<p>This operation can be performed any number of times with different sample sizes. We discuss this issue in Section <a href="NN.html#howdeep">7.3</a>.</p>
<p>The learning rate <span class="math inline">\(\eta\)</span> can be refined. One option to reduce overfitting is to impose that after each epoch, the intensity of the update decreases. One possible parametric form is <span class="math inline">\(\eta=\alpha e^{- \beta t}\)</span>, where <span class="math inline">\(t\)</span> is the epoch and <span class="math inline">\(\alpha,\beta&gt;0\)</span>. One further sophistication is to resort to so-called <em>momentum</em> (which originates from <span class="citation">Polyak (<a href="#ref-polyak1964some" role="doc-biblioref">1964</a>)</span>):
<span class="math display" id="eq:gradmom">\[\begin{align}
\tag{7.4}
\textbf{W}_{t+1} &amp; \leftarrow  \textbf{W}_{t} - \textbf{m}_t \quad \text{with} \nonumber \\
 \textbf{m}_t &amp; \leftarrow \eta  \frac{\partial D(\tilde{y}_i)}{\partial \textbf{W}_{t}}+\gamma \textbf{m}_{t-1},
\end{align}\]</span>
where <span class="math inline">\(t\)</span> is the index of the weight update. The idea of momentum is to speed up the convergence by including a memory term of the last adjustment (<span class="math inline">\(\textbf{m}_{t-1}\)</span>) and going in the same direction in the current update. The parameter <span class="math inline">\(\gamma\)</span> is often taken to be 0.9.</p>
<p>More complex and enhanced methods have progressively been developed:<br />
- <span class="citation">Nesterov (<a href="#ref-nesterov1983method" role="doc-biblioref">1983</a>)</span> improves the momentum term by forecasting the future shift in parameters;<br />
- Adagrad (<span class="citation">Duchi, Hazan, and Singer (<a href="#ref-duchi2011adaptive" role="doc-biblioref">2011</a>)</span>) uses a different learning rate for each parameter;<br />
- Adadelta (<span class="citation">Zeiler (<a href="#ref-zeiler2012adadelta" role="doc-biblioref">2012</a>)</span>) and Adam (<span class="citation">Kingma and Ba (<a href="#ref-kingma2014adam" role="doc-biblioref">2014</a>)</span>) combine the ideas of Adagrad and momentum.</p>
<p>Lastly, in some degenerate case, some gradients may explode and push weights far from their optimal values. In order to avoid this phenomenon, learning libraries implement gradient clipping. The user specifies a maximum magnitude for gradients, usually expressed as a norm. Whenever the gradient surpasses this magnitude, it is rescaled to reach the authorized threshold. Thus, the direction remains the same, but the adjustment is smaller.</p>
</div>
<div id="further-details-on-classification" class="section level3">
<h3><span class="header-section-number">7.2.4</span> Further details on classification</h3>
<p></p>
<p>In decision trees, the ultimate goal is to create homogeneous clusters, and the process to reach this goal was outlined in the previous chapter. For neural networks, things work differently because the objective is explicitly to minimize the error between the prediction <span class="math inline">\(\tilde{\textbf{y}}_i\)</span> and a target label <span class="math inline">\(\textbf{y}_i\)</span>. Again, here <span class="math inline">\(\textbf{y}_i\)</span> is a vector full of zeros with only one <em>one</em> denoting the class of the instance.</p>
<p>Facing a classification problem, the trick is to use an appropriate activation function at the very end of the network. The dimension of the terminal output of the network should be equal to <span class="math inline">\(J\)</span> (number of classes to predict), and if, for simplicity, we write <span class="math inline">\(\textbf{x}_i\)</span> for the values of this output, the most commonly used activation is the so-called <em>softmax</em> function:</p>
<p><span class="math display">\[\tilde{\textbf{y}}_i=s(\textbf{x})_i=\frac{e^{x_i}}{\sum_{j=1}^Je^{x_j}}.\]</span></p>
<p>The justification of this choice is straightforward: it can take any value as input (over the real line) and it sums to one over any (finite-valued) output. Similarly as for trees, this yields a ‘probability’ vector over the classes. Often, the chosen loss is a generalization of the entropy used for trees. Given the target label <span class="math inline">\(\textbf{y}_i=(y_{i,1},\dots,y_{i,L})=(0,0,\dots,0,1,0,\dots,0)\)</span> and the predicted output <span class="math inline">\(\tilde{\textbf{y}}_i=(\tilde{y}_{i,1},\dots,\tilde{y}_{i,L})\)</span>, the cross-entropy is defined as</p>
<p><span class="math display" id="eq:crossentropy">\[\begin{equation}
\tag{7.5}
\text{CE}(\textbf{y}_i,\tilde{\textbf{y}}_i)=-\sum_{j=1}^J\log(\tilde{y}_{i,j})y_{i,j}.
\end{equation}\]</span></p>
<p>Basically, it is a proxy of the dissimilarity between its two arguments. One simple interpretation is the following. For the nonzero label value, the loss is <span class="math inline">\(-\log(\tilde{y}_{i,l})\)</span>, while for all others, it is zero. In the log, the loss will be minimal if <span class="math inline">\(\tilde{y}_{i,l}=1\)</span>, which is exactly what we seek (i.e., <span class="math inline">\(y_{i,l}=\tilde{y}_{i,l}\)</span>). In applications, this best case scenario will not happen, and the loss will simply increase when <span class="math inline">\(\tilde{y}_{i,l}\)</span> drifts away downwards from one.</p>
</div>
</div>
<div id="howdeep" class="section level2">
<h2><span class="header-section-number">7.3</span> How deep we should go and other practical issues</h2>
<p>Beyond the ones presented in the previous sections, the user faces many degrees of freedom when building a neural network. We present a few classical choices that are available when constructing and training neural networks.</p>
<div id="architectural-choices" class="section level3">
<h3><span class="header-section-number">7.3.1</span> Architectural choices</h3>
<p>Arguably, the first choice pertains to the structure of the network. Beyond the dichotomy feed-forward versus recurrent (see Section <a href="NN.html#recurrent-networks">7.5</a>), the immediate question is: how big (or how deep) the networks should be.
First of all, let us calculate the number of parameters (i.e., weights plus biases) that are estimated (optimized) in a network.</p>
<ul>
<li>For the first layer, this gives <span class="math inline">\((U_0+1)U_1\)</span> parameters, where <span class="math inline">\(U_0\)</span> is the number of columns in <span class="math inline">\(\mathbb{X}\)</span> (i.e., number of explanatory variables) and <span class="math inline">\(U_1\)</span> is the number of units in the layer.<br />
</li>
<li>For layer <span class="math inline">\(l\in[2,L]\)</span>, the number of parameters is <span class="math inline">\((U_{l-1}+1)U_l\)</span>.<br />
</li>
<li>For the final output, there are simply <span class="math inline">\(U_L+1\)</span> parameters.<br />
</li>
<li>In total, this means the total number of values to optimize is
<span class="math display">\[\mathcal{N}=\left(\sum_{l=1}^L(U_{l-1}+1)U_l\right)+U_L+1\]</span></li>
</ul>
<p>As in any model, the number of parameters should be much smaller than the number of instances. There is no fixed ratio, but it is preferable if the sample size is <em>at least</em> ten times larger than the number of parameters. Below a ratio of 5, the risk of overfitting is high. Given the amount of data readily available, this constraint is seldom an issue, unless one wishes to work with a very large network.</p>
<p>The number of hidden layers in current financial applications rarely exceeds three or four. The number of units per layer <span class="math inline">\((U_k)\)</span> is often chosen to follow the geometric pyramid rule (see, e.g., <span class="citation">Masters (<a href="#ref-masters1993practical" role="doc-biblioref">1993</a>)</span>). If there are <span class="math inline">\(L\)</span> hidden layers, with <span class="math inline">\(I\)</span> features in the input and <span class="math inline">\(O\)</span> dimensions in the output (for regression tasks, <span class="math inline">\(O=1\)</span>), then, for the <span class="math inline">\(k^{th}\)</span> layer, a rule of thumb for the number of units is
<span class="math display">\[U_k\approx \left\lfloor O\left( \frac{I}{O}\right)^{\frac{L+1-k}{L+1}}\right\rfloor.\]</span>
If there is only one intermediate layer, the recommended proxy is the integer part of <span class="math inline">\(\sqrt{IO}\)</span>. If not, the network starts with many units and the number of unit decreases exponentially towards the output size. Often, the number of layers is a power of two because, in high dimensions, networks are trained on Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). Both pieces of hardware can be used optimally when the inputs have sizes equals to powers of two.</p>
<p>Several studies have shown that very large architectures do not always perform better than more shallow ones (e.g., <span class="citation">Gu, Kelly, and Xiu (<a href="#ref-gu2018empirical" role="doc-biblioref">2020</a><a href="#ref-gu2018empirical" role="doc-biblioref">b</a>)</span> and <span class="citation">Orimoloye et al. (<a href="#ref-orimoloye2019comparing" role="doc-biblioref">2019</a>)</span> for high frequency data, i.e., not factor-based). As a rule of thumb, a maximum of three hidden layers seem to be sufficient for prediction purposes.</p>
</div>
<div id="frequency-of-weight-updates-and-learning-duration" class="section level3">
<h3><span class="header-section-number">7.3.2</span> Frequency of weight updates and learning duration</h3>
<p>In the expression <a href="NN.html#eq:graddesc">(7.2)</a>, it is implicit that the computation is performed for one given instance. If the sample size is very large (hundreds of thousands or millions of instances), updating the weights according to each point is computationally too costly. The updating is then performed on groups of instances which are called batches. The sample is (randomly) split into batches of fixed sizes and each update is performed following the rule:</p>
<p><span class="math display" id="eq:gradbatch">\[\begin{equation}
\tag{7.6}
\textbf{W} \leftarrow \textbf{W}-\eta  \frac{\partial \sum_{i \in \text{batch}} D(\tilde{y}_i)/\text{card}(\text{batch}) }{\partial \textbf{W}}.
\end{equation}\]</span></p>
<p>The change in weights is computed over the average loss computed over all instances in the batch. The terminology for training includes:</p>
<ul>
<li><strong>epoch</strong>: one epoch is reached when each instance of the sample has contributed to the update of the weights (i.e., the training). Often, training a NN requires several epochs and up to a few dozen.<br />
</li>
<li><strong>batch size</strong>: the batch size is the number of samples used for one single update of weights.<br />
</li>
<li><strong>iterations</strong>: the number of iterations can mean alternatively the ratio of sample size divided by batch size or this ratio multiplied by the number of epochs. It’s either the number of weight updates required to reach one epoch or the total number of updates during the whole training.</li>
</ul>
<p>When the batch is equal to only one instance, the method is referred to as ‘stochastic gradient descent’ (SGD): the instance is chosen randomly. When the batch size is strictly above one and below the total number of instances, the learning is performed via ‘mini’ batches, that is, small groups of instances. The batches are also chosen randomly, but without replacement in the sample because for one epoch, the union of batches must be equal to the full training sample.</p>
<p>It is impossible to know in advance what a good number of epochs is. Sometimes, the network stops learning after just 5 epochs (the validation loss does not decrease anymore). In some cases when the validation sample is drawn from a distribution close to that of the training sample, the network continues to learn even after 200 epochs. It is up to the user to test different values to evaluate the learning speed. In the examples below, we keep the number of epochs low for computational purposes.</p>
</div>
<div id="penalizations-and-dropout" class="section level3">
<h3><span class="header-section-number">7.3.3</span> Penalizations and dropout</h3>
<p>
At each level (layer), it is possible to enforce constraints or penalizations on the weights (and biases). Just as for tree methods, this helps slow down the learning to prevent overfitting on the training sample. Penalizations are enforced directly on the loss function and the objective function takes the form</p>
<p><span class="math display">\[O=\sum_{i=1}^I \text{loss}(y_i,\tilde{y}_i)+ \sum_{k} \lambda_k||\textbf{W}_k||_1+ \sum_j\delta_j||\textbf{W}_j||_2^2,\]</span>
where the subscripts <span class="math inline">\(k\)</span> and <span class="math inline">\(j\)</span> pertain to the weights to which the <span class="math inline">\(L^1\)</span> and (or) <span class="math inline">\(L^2\)</span> penalization is applied.</p>
<p>In addition, specific constraints can be enforced on the weights directly during the training. Typically, two types of constraints are used:</p>
<ul>
<li>norm constraints: a maximum norm is fixed for the weight vectors or matrices;<br />
</li>
<li>non-negativity constraint: all weights must be positive or zero.</li>
</ul>
<p>Lastly, another (somewhat exotic) way to reduce the risk of overfitting is simply to reduce the size (number of parameters) of the model. <span class="citation">Srivastava et al. (<a href="#ref-srivastava2014dropout" role="doc-biblioref">2014</a>)</span> propose to omit units during training (hence the term ‘<strong>dropout</strong>’). The weights of randomly chosen units are set to zero during training. All links from and to the unit are ignored, which mechanically shrinks the network. In the testing phase, all units are back, but the values (weights) must be scaled to account for the missing activations during the training phase.</p>
<p>The interested reader can check the advice compiled in <span class="citation">Bengio (<a href="#ref-bengio2012practical" role="doc-biblioref">2012</a>)</span>, <span class="citation">Hanin and Rolnick (<a href="#ref-hanin2018start" role="doc-biblioref">2018</a>)</span>, and <span class="citation">Smith (<a href="#ref-smith2018disciplined" role="doc-biblioref">2018</a>)</span> for further tips on how to configure neural networks. A paper dedicated to hyperparameter tuning for stock return prediction is <span class="citation">Lee (<a href="#ref-lee2020hyperparameter" role="doc-biblioref">2020</a>)</span>.</p>
</div>
</div>
<div id="code-samples-and-comments-for-vanilla-mlp" class="section level2">
<h2><span class="header-section-number">7.4</span> Code samples and comments for vanilla MLP</h2>
<p>There are several frameworks and libraries that allow robust and flexible constructions of neural networks. Among them, Keras and Tensorflow (developed by Google) are probably the most used at the time we write this book (PyTorch, from Facebook, is one alternative). For simplicity and because we believe it is the best choice, we implement the NN with Keras (which is the high level API of Tensorflow, see <a href="https://www.tensorflow.org" class="uri">https://www.tensorflow.org</a>). The original Python implementation is referenced on <a href="https://keras.io" class="uri">https://keras.io</a>, and the details for the R version can be found here: <a href="https://keras.rstudio.com" class="uri">https://keras.rstudio.com</a>. We recommend a thorough installation before proceeding. Because the native versions of Tensorflow and Keras are written in Python (and accessed by R via the <em>reticulate</em> package), a running version of Python is required below. To install Keras, please follow the instructions provided at <a href="https://keras.rstudio.com" class="uri">https://keras.rstudio.com</a>.</p>
<p>In this section, we provide a detailed (though far from exhaustive) account of how to train neural networks with Keras. For the sake of completeness, we proceed in two steps. The first one relates to a very simple regression exercise. Its purpose is to get the reader familiar with the syntax of Keras. In the second step, we lay out many of the options proposed by Keras to perform a classification exercise. With these two examples, we thus cover most of the mainstream topics falling under the umbrella of feed-forward multilayered perceptrons.</p>
<div id="regression-example" class="section level3">
<h3><span class="header-section-number">7.4.1</span> Regression example</h3>
<p>Before we head to the core of the NN, a short stage of data preparation is required. Just as for penalized regressions (glmnet package) and boosted trees (xgboost package), the data must be sorted into four parts which are the combination of two dichotomies: training versus testing and labels versus features. We define the corresponding variables below. For simplicity, the first example is a regression exercise. A classification task will be detailed below.</p>

<div class="sourceCode" id="cb78"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb78-1"><a href="NN.html#cb78-1"></a>NN_train_features &lt;-<span class="st"> </span>dplyr<span class="op">::</span><span class="kw">select</span>(training_sample, features) <span class="op">%&gt;%</span><span class="st">    </span><span class="co"># Training features</span></span>
<span id="cb78-2"><a href="NN.html#cb78-2"></a><span class="st">    </span><span class="kw">as.matrix</span>()                                                      <span class="co"># Matrix = important</span></span>
<span id="cb78-3"><a href="NN.html#cb78-3"></a>NN_train_labels &lt;-<span class="st"> </span>training_sample<span class="op">$</span>R1M_Usd                           <span class="co"># Training labels</span></span>
<span id="cb78-4"><a href="NN.html#cb78-4"></a>NN_test_features &lt;-<span class="st"> </span>dplyr<span class="op">::</span><span class="kw">select</span>(testing_sample, features) <span class="op">%&gt;%</span><span class="st">      </span><span class="co"># Testing features</span></span>
<span id="cb78-5"><a href="NN.html#cb78-5"></a><span class="st">    </span><span class="kw">as.matrix</span>()                                                      <span class="co"># Matrix = important</span></span>
<span id="cb78-6"><a href="NN.html#cb78-6"></a>NN_test_labels &lt;-<span class="st"> </span>testing_sample<span class="op">$</span>R1M_Usd                             <span class="co"># Testing labels</span></span></code></pre></div>

<p>In Keras, the training of neural networks is performed through three steps:</p>
<ol style="list-style-type: decimal">
<li>Defining the structure/architecture of the network;<br />
</li>
<li>Setting the loss function and learning process (options on the updating of weights);<br />
</li>
<li>Train by specifying the batch sizes and number of rounds (epochs).</li>
</ol>
<p>We start with a very simple architecture with two hidden layers.</p>

<div class="sourceCode" id="cb79"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb79-1"><a href="NN.html#cb79-1"></a><span class="kw">library</span>(keras)</span>
<span id="cb79-2"><a href="NN.html#cb79-2"></a><span class="co"># install_keras() # To complete installation</span></span>
<span id="cb79-3"><a href="NN.html#cb79-3"></a>model &lt;-<span class="st"> </span><span class="kw">keras_model_sequential</span>()</span>
<span id="cb79-4"><a href="NN.html#cb79-4"></a>model <span class="op">%&gt;%</span><span class="st">   </span><span class="co"># This defines the structure of the network, i.e. how layers are organized</span></span>
<span id="cb79-5"><a href="NN.html#cb79-5"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">16</span>, <span class="dt">activation =</span> <span class="st">&#39;relu&#39;</span>, <span class="dt">input_shape =</span> <span class="kw">ncol</span>(NN_train_features)) <span class="op">%&gt;%</span></span>
<span id="cb79-6"><a href="NN.html#cb79-6"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">8</span>, <span class="dt">activation =</span> <span class="st">&#39;tanh&#39;</span>) <span class="op">%&gt;%</span></span>
<span id="cb79-7"><a href="NN.html#cb79-7"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">1</span>) <span class="co"># No activation means linear activation: f(x) = x.</span></span></code></pre></div>

<p>The definition of the structure is very intuitive and uses the <em>sequential</em> syntax in which one input is iteratively transformed by a layer until the last iteration which gives the output. Each layer depends on two parameters: the number of units and the activation function that is applied to the output of the layer. One important point is the input_shape parameter for the first layer. It is required for the first layer and is equal to the number of features. For the subsequent layers, the input_shape is dictated by the number of units of the previous layer; hence it is not required. The activations that are currently available are listed on <a href="https://keras.io/activations/" class="uri">https://keras.io/activations/</a>. We use the hyperbolic tangent in the second-to-last layer because it yields both positive and negative outputs. Of course, the last layer can generate negative values as well, but it’s preferable to satisfy this property one step ahead of the final output.</p>

<div class="sourceCode" id="cb80"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb80-1"><a href="NN.html#cb80-1"></a>model <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">compile</span>(                             <span class="co"># Model specification</span></span>
<span id="cb80-2"><a href="NN.html#cb80-2"></a>    <span class="dt">loss =</span> <span class="st">&#39;mean_squared_error&#39;</span>,               <span class="co"># Loss function</span></span>
<span id="cb80-3"><a href="NN.html#cb80-3"></a>    <span class="dt">optimizer =</span> <span class="kw">optimizer_rmsprop</span>(),           <span class="co"># Optimisation method (weight updating)</span></span>
<span id="cb80-4"><a href="NN.html#cb80-4"></a>    <span class="dt">metrics =</span> <span class="kw">c</span>(<span class="st">&#39;mean_absolute_error&#39;</span>)         <span class="co"># Output metric</span></span>
<span id="cb80-5"><a href="NN.html#cb80-5"></a>)</span>
<span id="cb80-6"><a href="NN.html#cb80-6"></a><span class="kw">summary</span>(model)                                 <span class="co"># Model architecture</span></span></code></pre></div>
<pre><code>## Model: &quot;sequential&quot;
## __________________________________________________________________________________________
## Layer (type)                            Output Shape                        Param #
## ==========================================================================================
## dense (Dense)                           (None, 16)                          1504
## __________________________________________________________________________________________
## dense_1 (Dense)                         (None, 8)                           136
## __________________________________________________________________________________________
## dense_2 (Dense)                         (None, 1)                           9
## ==========================================================================================
## Total params: 1,649
## Trainable params: 1,649
## Non-trainable params: 0
## __________________________________________________________________________________________</code></pre>

<p>The summary of the model lists the layers in their order from input to output (forward pass). Because we are working with 93 features, the number of parameters for the first layer (16 units) is 93 plus one (for the bias) multiplied by 16, which makes 1504. For the second layer, the number of inputs is equal to the size of the output from the previous layer (16). Hence given the fact that the second layer has 8 units, the total number of parameters is (16+1)*8 = 136.</p>
<p>We set the loss function to the standard mean squared error. Other losses are listed on <a href="https://keras.io/losses/" class="uri">https://keras.io/losses/</a>, some of them work only for regressions (MSE, MAE) and others only for classification (categorical cross-entropy, see Equation <a href="NN.html#eq:crossentropy">(7.5)</a>). The RMS propragation optimizer is the classical mini-batch back-propagation implementation. For other weight updating algorithms, we refer to <a href="https://keras.io/optimizers/" class="uri">https://keras.io/optimizers/</a>. The metric is the function used to assess the quality of the model. It can be different from the loss: for instance, using entropy for training and accuracy as the performance metric.</p>
<p>The final stage fits the model to the data and requires some additional training parameters:</p>

<div class="sourceCode" id="cb82"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb82-1"><a href="NN.html#cb82-1"></a>fit_NN &lt;-<span class="st"> </span>model <span class="op">%&gt;%</span><span class="st"> </span></span>
<span id="cb82-2"><a href="NN.html#cb82-2"></a><span class="st">    </span><span class="kw">fit</span>(NN_train_features,                                       <span class="co"># Training features</span></span>
<span id="cb82-3"><a href="NN.html#cb82-3"></a>        NN_train_labels,                                         <span class="co"># Training labels</span></span>
<span id="cb82-4"><a href="NN.html#cb82-4"></a>        <span class="dt">epochs =</span> <span class="dv">10</span>, <span class="dt">batch_size =</span> <span class="dv">512</span>,                           <span class="co"># Training parameters</span></span>
<span id="cb82-5"><a href="NN.html#cb82-5"></a>        <span class="dt">validation_data =</span> <span class="kw">list</span>(NN_test_features, NN_test_labels) <span class="co"># Test data</span></span>
<span id="cb82-6"><a href="NN.html#cb82-6"></a>) </span>
<span id="cb82-7"><a href="NN.html#cb82-7"></a><span class="kw">plot</span>(fit_NN)                                                     <span class="co"># Plot, evidently!</span></span></code></pre></div>
<div class="figure" style="text-align: center"><span id="fig:NN3"></span>
<img src="ML_factor_files/figure-html/NN3-1.png" alt="Output from a trained neural network (regression task)." width="480" />
<p class="caption">
FIGURE 7.7: Output from a trained neural network (regression task).
</p>
</div>

<p>The batch size is quite arbitrary. For technical reasons pertaining to training on GPUs, these sizes are often powers of 2.</p>
<p>In Keras, the plot of the trained model shows four different curves (shown here in Figure <a href="NN.html#fig:NN3">7.7</a>). The top graph displays the improvement (or lack thereof) in loss as the number of epochs increases. Usually, the algorithm starts by learning rapidly and then converges to a point where any additional epoch does not improve the fit. In the example above, this point arrives rather quickly because it is hard to notice any gain beyond the fourth epoch. The two colors show the performance on the two samples: the training sample and the testing sample. By construction, the loss will always improve (even marginally) on the training sample. When the impact is negligible on the testing sample (the curve is flat, as is the case here), the model fails to generalize out-of-sample: the gains obtained by training on the original sample do not translate to gains on previously unseen data; thus, the model seems to be learning noise.</p>
<p>The second graph shows the same behavior but is computed using the metric function. The correlation (in absolute terms) between the two curves (loss and metric) is usually high. If one of them is flat, the other should be as well.</p>
<p>In order to obtain the parameters of the model, the user can call get_weights(model).<a href="#fn18" class="footnote-ref" id="fnref18"><sup>18</sup></a> We do not execute the code here because the size of the output is much too large, as there are thousands of weights.</p>
<p>Finally, from a practical point of view, the prediction is obtained via the usual predict() function. We use this function below on the testing sample to calculate the hit ratio.</p>

<div class="sourceCode" id="cb83"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb83-1"><a href="NN.html#cb83-1"></a><span class="kw">mean</span>(<span class="kw">predict</span>(model, NN_test_features) <span class="op">*</span><span class="st"> </span>NN_test_labels <span class="op">&gt;</span><span class="st"> </span><span class="dv">0</span>) <span class="co"># Hit ratio</span></span></code></pre></div>
<pre><code>## [1] 0.5427159</code></pre>

<p>Again, the hit ratio lies between 50% and 55%, which <em>seems</em> reasonably good. Most of the time, neural networks have their weights initialized randomly. Hence, two independently trained networks with the same architecture and same training data may well lead to very different predictions and performance! One way to bypass this issue is to freeze the random number generator. Models can also be easily exchanged by loading weights via the set_weights() function.</p>
</div>
<div id="classification-example" class="section level3">
<h3><span class="header-section-number">7.4.2</span> Classification example</h3>
<p>
We pursue our exploration of neural networks with a much more detailed example. The aim is to carry out a classification task on the binary label R1M_Usd_C. Before we proceed, we need to format the label properly. To this purpose, we resort to one-hot encoding (see Section <a href="Data.html#categorical-labels">4.5.2</a>).</p>

<div class="sourceCode" id="cb85"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb85-1"><a href="NN.html#cb85-1"></a><span class="kw">library</span>(dummies)                                            <span class="co"># Package for one-hot encoding</span></span>
<span id="cb85-2"><a href="NN.html#cb85-2"></a>NN_train_labels_C &lt;-<span class="st"> </span>training_sample<span class="op">$</span>R1M_Usd_C <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">dummy</span>()  <span class="co"># One-hot encoding of the label</span></span>
<span id="cb85-3"><a href="NN.html#cb85-3"></a>NN_test_labels_C &lt;-<span class="st"> </span>testing_sample<span class="op">$</span>R1M_Usd_C <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">dummy</span>()    <span class="co"># One-hot encoding of the label</span></span></code></pre></div>

<p>The labels NN_train_labels_C and NN_test_labels_C have two columns: the first flags the instances with above median returns and the second flags those with below median returns. Note that we do not alter the feature variables: they remain unchanged. Below, we set the structure of the networks with many additional features compared to the first one.</p>

<div class="sourceCode" id="cb86"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb86-1"><a href="NN.html#cb86-1"></a>model_C &lt;-<span class="st"> </span><span class="kw">keras_model_sequential</span>()</span>
<span id="cb86-2"><a href="NN.html#cb86-2"></a>model_C <span class="op">%&gt;%</span><span class="st">   </span><span class="co"># This defines the structure of the network, i.e. how layers are organized</span></span>
<span id="cb86-3"><a href="NN.html#cb86-3"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">16</span>, <span class="dt">activation =</span> <span class="st">&#39;tanh&#39;</span>,               <span class="co"># Nb units &amp; activation</span></span>
<span id="cb86-4"><a href="NN.html#cb86-4"></a>                <span class="dt">input_shape =</span> <span class="kw">ncol</span>(NN_train_features),         <span class="co"># Size of input</span></span>
<span id="cb86-5"><a href="NN.html#cb86-5"></a>                <span class="dt">kernel_initializer =</span> <span class="st">&quot;random_normal&quot;</span>,          <span class="co"># Initialization of weights</span></span>
<span id="cb86-6"><a href="NN.html#cb86-6"></a>                <span class="dt">kernel_constraint =</span> <span class="kw">constraint_nonneg</span>()) <span class="op">%&gt;%</span><span class="st">   </span><span class="co"># Weights should be nonneg</span></span>
<span id="cb86-7"><a href="NN.html#cb86-7"></a><span class="st">    </span><span class="kw">layer_dropout</span>(<span class="dt">rate =</span> <span class="fl">0.25</span>) <span class="op">%&gt;%</span><span class="st">                             </span><span class="co"># Dropping out 25% units</span></span>
<span id="cb86-8"><a href="NN.html#cb86-8"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">8</span>, <span class="dt">activation =</span> <span class="st">&#39;elu&#39;</span>,                 <span class="co"># Nb units &amp; activation</span></span>
<span id="cb86-9"><a href="NN.html#cb86-9"></a>                <span class="dt">bias_initializer =</span> <span class="kw">initializer_constant</span>(<span class="fl">0.2</span>),  <span class="co"># Initialization of biases</span></span>
<span id="cb86-10"><a href="NN.html#cb86-10"></a>                <span class="dt">kernel_regularizer =</span> <span class="kw">regularizer_l2</span>(<span class="fl">0.01</span>)) <span class="op">%&gt;%</span><span class="st"> </span><span class="co"># Penalization of weights </span></span>
<span id="cb86-11"><a href="NN.html#cb86-11"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">2</span>, <span class="dt">activation =</span> <span class="st">&#39;softmax&#39;</span>)             <span class="co"># Softmax for categorical output</span></span></code></pre></div>

<p>Before we start commenting on the many options used above, we highlight that Keras models, unlike many R variables, are mutable objects. This means that any piping %&gt;% after calling a model will alter it. Hence, successive trainings do not start from scratch but from the result of the previous training.</p>
<p>First, the options used above and below were chosen as illustrative examples and do not serve to particularly improve the quality of the model. The first change compared to Section <a href="NN.html#regression-example">7.4.1</a> is the activation functions. The first two are simply new cases, while the third one (for the output layer) is imperative. Indeed, since the goal is classification, the dimension of the output must be equal to the number of categories of the labels. The activation that yields a multivariate is the softmax function. Note that we must also specify the number of classes (categories) in the terminal layer.</p>
<p>The second major innovation is options pertaining to parameters. One family of options deals with the initialization of weights and biases. In Keras, weights are referred to as the ‘kernel’. The list of initializers is quite long and we suggest the interested reader has a look at the Keras reference (<a href="https://keras.io/initializers/" class="uri">https://keras.io/initializers/</a>). Most of them are random, but some of them are constant.</p>
<p>Another family of options is the constraints and norm penalization that are applied on the weights and biases during training. In the above example, the weights of the first layer are coerced to be non-negative, while the weights of the second layer see their magnitude penalized by a factor (0.01) times their <span class="math inline">\(L^2\)</span> norm. </p>
<p>Lastly, the final novelty is the dropout layer (see Section <a href="NN.html#penalizations-and-dropout">7.3.3</a>) between the first and second layers. According to this layer, one fourth of the units in the first layer will be (randomly) omitted during training.</p>
<p>The specification of the training is outlined below.</p>

<div class="sourceCode" id="cb87"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb87-1"><a href="NN.html#cb87-1"></a>model_C <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">compile</span>(                               <span class="co"># Model specification</span></span>
<span id="cb87-2"><a href="NN.html#cb87-2"></a>    <span class="dt">loss =</span> <span class="st">&#39;binary_crossentropy&#39;</span>,                  <span class="co"># Loss function</span></span>
<span id="cb87-3"><a href="NN.html#cb87-3"></a>    <span class="dt">optimizer =</span> <span class="kw">optimizer_adam</span>(<span class="dt">lr =</span> <span class="fl">0.005</span>,         <span class="co"># Optimisation method (weight updating)</span></span>
<span id="cb87-4"><a href="NN.html#cb87-4"></a>                               <span class="dt">beta_1 =</span> <span class="fl">0.9</span>, </span>
<span id="cb87-5"><a href="NN.html#cb87-5"></a>                               <span class="dt">beta_2 =</span> <span class="fl">0.95</span>),        </span>
<span id="cb87-6"><a href="NN.html#cb87-6"></a>    <span class="dt">metrics =</span> <span class="kw">c</span>(<span class="st">&#39;categorical_accuracy&#39;</span>)            <span class="co"># Output metric</span></span>
<span id="cb87-7"><a href="NN.html#cb87-7"></a>)</span>
<span id="cb87-8"><a href="NN.html#cb87-8"></a><span class="kw">summary</span>(model_C)                                   <span class="co"># Model structure</span></span></code></pre></div>
<pre><code>## Model: &quot;sequential_1&quot;
## __________________________________________________________________________________________
## Layer (type)                            Output Shape                        Param #
## ==========================================================================================
## dense_3 (Dense)                         (None, 16)                          1504
## __________________________________________________________________________________________
## dropout (Dropout)                       (None, 16)                          0
## __________________________________________________________________________________________
## dense_4 (Dense)                         (None, 8)                           136
## __________________________________________________________________________________________
## dense_5 (Dense)                         (None, 2)                           18
## ==========================================================================================
## Total params: 1,658
## Trainable params: 1,658
## Non-trainable params: 0
## __________________________________________________________________________________________</code></pre>

<p>Here again, many changes have been made: all levels have been revised. The loss is now the cross-entropy. Because we work with two categories, we resort to a specific choice (binary cross-entropy), but the more general form is the option categorical_crossentropy and works for any number of classes (strictly above 1). The optimizer is also different and allows for several parameters and we refer to <span class="citation">Kingma and Ba (<a href="#ref-kingma2014adam" role="doc-biblioref">2014</a>)</span>. Simply put, the two beta parameters control decay rates for exponentially weighted moving averages used in the update of weights. The two averages are estimates for the first and second moment of the gradient and can be exploited to increase the speed of learning. The performance metric in the above chunk is the categorical accuracy. In multiclass classification, the accuracy is defined as the average accuracy over all classes and all predictions. Since a prediction for one instance is a vector of weights, the ‘terminal’ prediction is the class that is associated with the largest weight. The accuracy then measures the proportion of times when the prediction is equal to the realized value (i.e., when the class is correctly guessed by the model).</p>
<p>Finally, we proceed with the training of the model.</p>

<div class="sourceCode" id="cb89"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb89-1"><a href="NN.html#cb89-1"></a>fit_NN_C &lt;-<span class="st"> </span>model_C <span class="op">%&gt;%</span><span class="st"> </span></span>
<span id="cb89-2"><a href="NN.html#cb89-2"></a><span class="st">    </span><span class="kw">fit</span>(NN_train_features,                                   <span class="co"># Training features</span></span>
<span id="cb89-3"><a href="NN.html#cb89-3"></a>        NN_train_labels_C,                                   <span class="co"># Training labels</span></span>
<span id="cb89-4"><a href="NN.html#cb89-4"></a>        <span class="dt">epochs =</span> <span class="dv">20</span>, <span class="dt">batch_size =</span> <span class="dv">512</span>,                       <span class="co"># Training parameters</span></span>
<span id="cb89-5"><a href="NN.html#cb89-5"></a>        <span class="dt">validation_data =</span> <span class="kw">list</span>(NN_test_features, </span>
<span id="cb89-6"><a href="NN.html#cb89-6"></a>                               NN_test_labels_C),            <span class="co"># Test data</span></span>
<span id="cb89-7"><a href="NN.html#cb89-7"></a>        <span class="dt">verbose =</span> <span class="dv">0</span>,                                         <span class="co"># No comments from algo</span></span>
<span id="cb89-8"><a href="NN.html#cb89-8"></a>        <span class="dt">callbacks =</span> <span class="kw">list</span>(</span>
<span id="cb89-9"><a href="NN.html#cb89-9"></a>            <span class="kw">callback_early_stopping</span>(<span class="dt">monitor =</span> <span class="st">&quot;val_loss&quot;</span>,    <span class="co"># Early stopping:</span></span>
<span id="cb89-10"><a href="NN.html#cb89-10"></a>                                    <span class="dt">min_delta =</span> <span class="fl">0.001</span>,       <span class="co"># Improvement threshold</span></span>
<span id="cb89-11"><a href="NN.html#cb89-11"></a>                                    <span class="dt">patience =</span> <span class="dv">3</span>,            <span class="co"># Nb epochs with no improvmt </span></span>
<span id="cb89-12"><a href="NN.html#cb89-12"></a>                                    <span class="dt">verbose =</span> <span class="dv">0</span>              <span class="co"># No warnings</span></span>
<span id="cb89-13"><a href="NN.html#cb89-13"></a>                                    )</span>
<span id="cb89-14"><a href="NN.html#cb89-14"></a>        )</span>
<span id="cb89-15"><a href="NN.html#cb89-15"></a>    )</span>
<span id="cb89-16"><a href="NN.html#cb89-16"></a><span class="kw">plot</span>(fit_NN_C) </span></code></pre></div>
<div class="figure" style="text-align: center"><span id="fig:NN3C"></span>
<img src="ML_factor_files/figure-html/NN3C-1.png" alt="Output from a trained neural network (classification task) with early stopping." width="400px" />
<p class="caption">
FIGURE 7.8: Output from a trained neural network (classification task) with early stopping.
</p>
</div>

<p>There is only one major difference here compared to the previous training call. In Keras, callbacks are functions that can be used at given stages of the learning process. In the above example, we use one such function to stop the algorithm when no progress has been made for some time.</p>
<p>When datasets are large, the training can be long, especially when batch sizes are small and/or the number of epochs is high. It is not guaranteed that going to the full number of epochs is useful, as the loss or metric functions may be plateauing much sooner. Hence, it can be very convenient to stop the process if no improvement is achieved during a specified time-frame. We set the number of epochs to 20, but the process will likely stop before that.</p>
<p>In the above code, the improvement is focused on validation accuracy (“val_loss”; one alternative is “val_acc”). The min_delta value sets the minimum improvement that needs to be attained for the algorithm to continue. Therefore, unless the validation accuracy gains 0.001 points at each epoch, the training will stop. Nevertheless, some flexibility is introduced via the patience parameter, which in our case asserts that the halting decision is made only after three consecutive epochs with no improvement. In the option, the verbose parameter dictates the amount of comments that is made by the function. For simplicity, we do not want any comments, hence this value is set to zero.</p>
<p>In Figure <a href="NN.html#fig:NN3C">7.8</a>, the two graphs yield very different curves. One reason for that is the scale of the second graph. The range of accuracies is very narrow. Any change in this range does not represent much variation overall. The pattern is relatively clear on the training sample: the loss decreases, while the accuracy improves. Unfortunately, this does not translate to the testing sample which indicates that the model does not generalize well out-of-sample.</p>
</div>
<div id="custloss" class="section level3">
<h3><span class="header-section-number">7.4.3</span> Custom losses</h3>
<p>
In Keras, it is possible to define user-specified loss functions. This may be interesting in some cases. For instance, the quadratic error has three terms <span class="math inline">\(y_i^2\)</span>, <span class="math inline">\(\tilde{y}_i^2\)</span> and <span class="math inline">\(-2y_i\tilde{y}_i\)</span>. In practice, it can make sense to focus more on the latter term because it is the most essential: we do want predictions and realized values to have the same sign! Below we show how to optimize on a simple (product) function in Keras, <span class="math inline">\(l(y_i,\tilde{y}_i)=(\tilde{y}_i-\tilde{m})^2-\gamma (y_i-m)(\tilde{y}_i-\tilde{m})\)</span>, where <span class="math inline">\(m\)</span> and <span class="math inline">\(\tilde{m}\)</span> are the sample averages of <span class="math inline">\(y_i\)</span> and <span class="math inline">\(\tilde{y}_i\)</span>. With <span class="math inline">\(\gamma&gt;2\)</span>, we give more weight to the cross term. We start with a simple architecture.</p>

<div class="sourceCode" id="cb90"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb90-1"><a href="NN.html#cb90-1"></a>model_custom &lt;-<span class="st"> </span><span class="kw">keras_model_sequential</span>()</span>
<span id="cb90-2"><a href="NN.html#cb90-2"></a>model_custom <span class="op">%&gt;%</span><span class="st">   </span><span class="co"># This defines the structure of the network, i.e. how layers are organized</span></span>
<span id="cb90-3"><a href="NN.html#cb90-3"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">16</span>, <span class="dt">activation =</span> <span class="st">&#39;relu&#39;</span>, <span class="dt">input_shape =</span> <span class="kw">ncol</span>(NN_train_features)) <span class="op">%&gt;%</span></span>
<span id="cb90-4"><a href="NN.html#cb90-4"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">8</span>, <span class="dt">activation =</span> <span class="st">&#39;sigmoid&#39;</span>) <span class="op">%&gt;%</span></span>
<span id="cb90-5"><a href="NN.html#cb90-5"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">1</span>) <span class="co"># No activation means linear activation: f(x) = x.</span></span></code></pre></div>

<p>Then we code the loss function and integrate it to the model. The important trick is to resort to functions that are specific to the library (the k_<em>functions</em>). We code the variance of predicted values minus the scaled covariance between realized and predicted values. Below we use a scale of five.</p>

<div class="sourceCode" id="cb91"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb91-1"><a href="NN.html#cb91-1"></a><span class="co"># Defines the loss, we use gamma = 5</span></span>
<span id="cb91-2"><a href="NN.html#cb91-2"></a>metric_cust &lt;-<span class="st"> </span><span class="kw">custom_metric</span>(<span class="st">&quot;custom_loss&quot;</span>, </span>
<span id="cb91-3"><a href="NN.html#cb91-3"></a>                             <span class="cf">function</span>(y_true, y_pred) {</span>
<span id="cb91-4"><a href="NN.html#cb91-4"></a>  <span class="kw">k_mean</span>((y_pred <span class="op">-</span><span class="st"> </span><span class="kw">k_mean</span>(y_pred))<span class="op">*</span>(y_pred <span class="op">-</span><span class="st"> </span><span class="kw">k_mean</span>(y_pred)))<span class="op">-</span><span class="dv">5</span><span class="op">*</span><span class="kw">k_mean</span>((y_true <span class="op">-</span><span class="st"> </span><span class="kw">k_mean</span>(y_true))<span class="op">*</span>(y_pred <span class="op">-</span><span class="st"> </span><span class="kw">k_mean</span>(y_pred)))</span>
<span id="cb91-5"><a href="NN.html#cb91-5"></a>})</span>
<span id="cb91-6"><a href="NN.html#cb91-6"></a></span>
<span id="cb91-7"><a href="NN.html#cb91-7"></a>model_custom <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">compile</span>(                                          <span class="co"># Model specification</span></span>
<span id="cb91-8"><a href="NN.html#cb91-8"></a>    <span class="dt">loss =</span>  metric_cust, <span class="co">#function(y_true, y_pred) custom_loss(y_true, y_pred),  # New loss function!</span></span>
<span id="cb91-9"><a href="NN.html#cb91-9"></a>    <span class="dt">optimizer =</span> <span class="kw">optimizer_rmsprop</span>(),                               <span class="co"># Optim method </span></span>
<span id="cb91-10"><a href="NN.html#cb91-10"></a>    <span class="dt">metrics =</span> <span class="kw">c</span>(<span class="st">&#39;mean_absolute_error&#39;</span>)                             <span class="co"># Output metric</span></span>
<span id="cb91-11"><a href="NN.html#cb91-11"></a>)</span></code></pre></div>

<p>Finally, we are ready to train and briefly evaluate the performance of the model.</p>

<div class="sourceCode" id="cb92"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb92-1"><a href="NN.html#cb92-1"></a>fit_NN_cust &lt;-<span class="st"> </span>model_custom <span class="op">%&gt;%</span><span class="st"> </span></span>
<span id="cb92-2"><a href="NN.html#cb92-2"></a><span class="st">    </span><span class="kw">fit</span>(NN_train_features,                                       <span class="co"># Training features</span></span>
<span id="cb92-3"><a href="NN.html#cb92-3"></a>        NN_train_labels,                                         <span class="co"># Training labels</span></span>
<span id="cb92-4"><a href="NN.html#cb92-4"></a>        <span class="dt">epochs =</span> <span class="dv">10</span>, <span class="dt">batch_size =</span> <span class="dv">512</span>,                           <span class="co"># Training parameters</span></span>
<span id="cb92-5"><a href="NN.html#cb92-5"></a>        <span class="dt">validation_data =</span> <span class="kw">list</span>(NN_test_features, NN_test_labels) <span class="co"># Test data</span></span>
<span id="cb92-6"><a href="NN.html#cb92-6"></a>) </span>
<span id="cb92-7"><a href="NN.html#cb92-7"></a><span class="kw">plot</span>(fit_NN_cust)   </span></code></pre></div>
<p><img src="ML_factor_files/figure-html/NN2cust-1.png" width="672" /></p>

<p>The curves may go in opposite direction. One reason for that is that while improving correlation between realized and predicted values, we are also increasing the sum of squared predicted returns.</p>

<div class="sourceCode" id="cb93"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb93-1"><a href="NN.html#cb93-1"></a><span class="kw">mean</span>(<span class="kw">predict</span>(model_custom, NN_test_features) <span class="op">*</span><span class="st"> </span>NN_test_labels <span class="op">&gt;</span><span class="st"> </span><span class="dv">0</span>) <span class="co"># Hit ratio</span></span></code></pre></div>
<pre><code>## [1] 0.5460346</code></pre>

<p>The outcome could be improved. There are several directions that could help. One of them is arguably that the model should be dynamic and not static (see Chapter <a href="backtest.html#backtest">12</a>).</p>
</div>
</div>
<div id="recurrent-networks" class="section level2">
<h2><span class="header-section-number">7.5</span> Recurrent networks</h2>
<div id="presentation" class="section level3">
<h3><span class="header-section-number">7.5.1</span> Presentation</h3>
<p>
Multilayer perceptrons are feed-forward networks because the data flows from left to right with no looping in between. For some particular tasks with sequential linkages (e.g., time-series or speech recognition), it might be useful to keep track of what happened with the previous sample (i.e., there is a natural ordering). One simple way to model ‘memory’ would be to consider the following network with only one intermediate layer:
<span class="math display">\[\begin{align*}
\tilde{y}_i&amp;=f^{(y)}\left(\sum_{j=1}^{U_1}h_{i,j}w^{(y)}_j+b^{(2)}\right) \\
\textbf{h}_{i} &amp;=f^{(h)}\left(\sum_{k=1}^{U_0}x_{i,k}w^{(h,1)}_k+b^{(1)}+ \underbrace{\sum_{k=1}^{U_1}  w^{(h,2)}_{k}h_{i-1,k}}_{\text{memory part}} \right),
\end{align*}\]</span></p>
<p>where <span class="math inline">\(h_0\)</span> is customarily set at zero (vector-wise).</p>
<p>These kinds of models are often referred to as <span class="citation">Elman (<a href="#ref-elman1990finding" role="doc-biblioref">1990</a>)</span> models or to <span class="citation">Jordan (<a href="#ref-jordan1997serial" role="doc-biblioref">1997</a>)</span> models if in the latter case <span class="math inline">\(h_{i-1}\)</span> is replaced by <span class="math inline">\(y_{i-1}\)</span> in the computation of <span class="math inline">\(h_i\)</span>. Both types of models fall under the overarching umbrella of Recurrent Neural Networks (RNNs).</p>
<p>The <span class="math inline">\(h_i\)</span> is usually called the state or the hidden layer. The training of this model is complicated and must be done by unfolding the network over all instances to obtain a simple feed-forward network and train it regularly. We illustrate the unfolding principle in Figure <a href="NN.html#fig:recnet">7.9</a>. It shows a very deep network. The first input impacts the first layer and then the second one via <span class="math inline">\(h_1\)</span> and all following layers in the same fashion. Likewise, the second input impacts all layers except the first and each instance <span class="math inline">\(i-1\)</span> is going to impact the output <span class="math inline">\(\tilde{y}_i\)</span> and all outputs <span class="math inline">\(\tilde{y}_j\)</span> for <span class="math inline">\(j \ge i\)</span>. In Figure <a href="NN.html#fig:recnet">7.9</a>, the parameters that are trained are shown in blue. They appear many times, in fact, at each level of the unfolded network.</p>
<div class="figure" style="text-align: center"><span id="fig:recnet"></span>
<img src="images/RN.png" alt="Unfolding a recurrent network." width="480px" />
<p class="caption">
FIGURE 7.9: Unfolding a recurrent network.
</p>
</div>
<p>The main problem with the above architecture is the loss of memory induced by <strong>vanishing gradients</strong>. Because of the depth of the model, the chain rule used in the back-propagation will imply a large number of products of derivatives of activation functions. Now, as is shown in Figure <a href="NN.html#fig:activationf">7.4</a>, these functions are very smooth and their derivatives are most of the time smaller than one (in absolute value). Hence, multiplying many numbers smaller than one leads to very small figures: beyond some layers, the learning does not propagate because the adjustments are too small.</p>
<p>One way to prevent this progressive discounting of the memory was introduced in <span class="citation">Hochreiter and Schmidhuber (<a href="#ref-hochreiter1997long" role="doc-biblioref">1997</a>)</span> (Long-Short Term Memory - LSTM model). This model was subsequently simplified by the authors <span class="citation">Chung et al. (<a href="#ref-chung2015gated" role="doc-biblioref">2015</a>)</span> and we present this more parsimonious model below. The Gated Recurrent Unit (GRU) is a slightly more complicated version of the vanilla recurrent network defined above. It has the following representation:
<span class="math display">\[\begin{align*}
\tilde{y}_i&amp;=z_i\tilde{y}_{i-1}+ (1-z_i)\tanh \left(\textbf{w}_y&#39;\textbf{x}_i+ b_y+ u_yr_i\tilde{y}_{i-1}\right) \quad \text{output (prediction)} \\
z_i &amp;= \text{sig}(\textbf{w}_z&#39;\textbf{x}_i+b_z+u_z\tilde{y}_{i-1})  \hspace{9mm} \text{`update gate&#39;} \ \in (0,1)\\
r_i &amp;= \text{sig}(\textbf{w}_r&#39;\textbf{x}_i+b_r+u_r\tilde{y}_{i-1}) \hspace{9mm} \text{`reset gate&#39;}  \ \in (0,1).
\end{align*}\]</span>
In compact form, this gives
<span class="math display">\[\tilde{y}_i=\underbrace{z_i}_{\text{weight}}\underbrace{\tilde{y}_{i-1}}_{\text{past value}}+ \underbrace{(1-z_i)}_{\text{weight}}\underbrace{\tanh \left(\textbf{w}_y&#39;\textbf{x}_i+ b_y+ u_yr_i\tilde{y}_{i-1}\right)}_{\text{candidate value (classical RNN)}}, \]</span></p>
<p>where the <span class="math inline">\(z_i\)</span> decides the optimal mix between the current and past values. For the candidate value, <span class="math inline">\(r_i\)</span> decides which amount of past/memory to retain. <span class="math inline">\(r_i\)</span> is commonly referred to as the ‘<em>reset gate</em>’ and <span class="math inline">\(z_i\)</span> to the ‘<em>update gate</em>’.</p>
<p>There are some subtleties in the training of a recurrent network. Indeed, because of the chaining between the instances, each batch must correspond to a coherent time series. A logical choice is thus one batch per asset with instances (logically) chronologically ordered. Lastly, one option in some frameworks is to keep some memory between the batches by passing the final value of <span class="math inline">\(\tilde{y}_i\)</span> to the next batch (for which it will be <span class="math inline">\(\tilde{y}_0\)</span>). This is often referred to as the stateful mode and should be considered meticulously. It does not seem desirable in a portfolio prediction setting if the batch size corresponds to all observations for each asset: there is no particular link between assets. If the dataset is divided into several parts for each given asset, then the training must be handled very cautiously.</p>
<p>Reccurrent networks and LSTM especially have been found to be good forecasting tools in financial contexts (see, e.g., <span class="citation">Fischer and Krauss (<a href="#ref-fischer2018deep" role="doc-biblioref">2018</a>)</span> and <span class="citation">Wang et al. (<a href="#ref-wang2019portfolio" role="doc-biblioref">2020</a>)</span>).</p>
</div>
<div id="code-and-results-2" class="section level3">
<h3><span class="header-section-number">7.5.2</span> Code and results</h3>
<p>Recurrent networks are theoretically more complicated compared to multilayered perceptrons. In practice, they are also more challenging in their implementation. Indeed, the serial linkages require more attention compared to feed-forward architectures. In an asset pricing framework, we must separate the assets because the stock-specific time series cannot be bundled together. The learning will be sequential, one stock at a time.</p>
<p>The dimensions of variables are crucial. In Keras, they are defined for RNNs as:</p>
<ol style="list-style-type: decimal">
<li>The size of the batch: in our case, it will be the number of assets. Indeed, the recurrence relationship holds at the asset level, hence each asset will represent a new batch on which the model will learn.<br />
</li>
<li>The time steps: in our case, it will simply be the number of dates.<br />
</li>
<li>The number of features: in our case, there is only one possible figure which is the number of predictors.</li>
</ol>
<p>For simplicity and in order to reduce computation times, we will use the same subset of stocks as that from Section <a href="lasso.html#sparseex">5.2.2</a>. This yields a perfectly rectangular dataset in which all dates have the same number of observations.</p>
<p>First, we create some new, intermediate variables.</p>

<div class="sourceCode" id="cb95"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb95-1"><a href="NN.html#cb95-1"></a>data_rnn &lt;-<span class="st"> </span>data_ml <span class="op">%&gt;%</span><span class="st">                                  </span><span class="co"># Dedicated dataset</span></span>
<span id="cb95-2"><a href="NN.html#cb95-2"></a><span class="st">    </span><span class="kw">filter</span>(stock_id <span class="op">%in%</span><span class="st"> </span>stock_ids_short)</span>
<span id="cb95-3"><a href="NN.html#cb95-3"></a>training_sample_rnn &lt;-<span class="st"> </span><span class="kw">filter</span>(data_rnn, date <span class="op">&lt;</span><span class="st"> </span>separation_date)</span>
<span id="cb95-4"><a href="NN.html#cb95-4"></a>testing_sample_rnn &lt;-<span class="st"> </span><span class="kw">filter</span>(data_rnn, date <span class="op">&gt;</span><span class="st"> </span>separation_date)</span>
<span id="cb95-5"><a href="NN.html#cb95-5"></a>nb_stocks &lt;-<span class="st"> </span><span class="kw">length</span>(stock_ids_short)                     <span class="co"># Nb stocks </span></span>
<span id="cb95-6"><a href="NN.html#cb95-6"></a>nb_feats &lt;-<span class="st"> </span><span class="kw">length</span>(features)                             <span class="co"># Nb features</span></span>
<span id="cb95-7"><a href="NN.html#cb95-7"></a>nb_dates_train &lt;-<span class="st"> </span><span class="kw">nrow</span>(training_sample) <span class="op">/</span><span class="st"> </span>nb_stocks      <span class="co"># Nb training dates (size of sample)</span></span>
<span id="cb95-8"><a href="NN.html#cb95-8"></a>nb_dates_test &lt;-<span class="st"> </span><span class="kw">nrow</span>(testing_sample) <span class="op">/</span><span class="st"> </span>nb_stocks        <span class="co"># Nb testing dates</span></span></code></pre></div>

<p>Then, we construct the variables we will pass as arguments. We recall that the data file was ordered first by stocks and then by date (see Section <a href="notdata.html#dataset">1.2</a>).</p>

<div class="sourceCode" id="cb96"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb96-1"><a href="NN.html#cb96-1"></a>train_features_rnn &lt;-<span class="st"> </span><span class="kw">array</span>(NN_train_features,           <span class="co"># Formats the training data into array</span></span>
<span id="cb96-2"><a href="NN.html#cb96-2"></a>                            <span class="dt">dim =</span> <span class="kw">c</span>(nb_dates_train, nb_stocks, nb_feats)) <span class="op">%&gt;%</span><span class="st"> </span><span class="co"># Tricky order</span></span>
<span id="cb96-3"><a href="NN.html#cb96-3"></a><span class="st">    </span><span class="kw">aperm</span>(<span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">1</span>,<span class="dv">3</span>))                                      <span class="co"># The order is: stock, date, feature </span></span>
<span id="cb96-4"><a href="NN.html#cb96-4"></a>test_features_rnn &lt;-<span class="st"> </span><span class="kw">array</span>(NN_test_features,             <span class="co"># Formats the testing data into array</span></span>
<span id="cb96-5"><a href="NN.html#cb96-5"></a>                            <span class="dt">dim =</span> <span class="kw">c</span>(nb_dates_test, nb_stocks, nb_feats)) <span class="op">%&gt;%</span><span class="st">  </span><span class="co"># Tricky order</span></span>
<span id="cb96-6"><a href="NN.html#cb96-6"></a><span class="st">    </span><span class="kw">aperm</span>(<span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">1</span>,<span class="dv">3</span>))                                      <span class="co"># The order is: stock, date, feature </span></span>
<span id="cb96-7"><a href="NN.html#cb96-7"></a>train_labels_rnn &lt;-<span class="st"> </span><span class="kw">as.matrix</span>(NN_train_labels) <span class="op">%&gt;%</span><span class="st"> </span></span>
<span id="cb96-8"><a href="NN.html#cb96-8"></a><span class="st">    </span><span class="kw">array</span>(<span class="dt">dim =</span> <span class="kw">c</span>(nb_dates_train, nb_stocks, <span class="dv">1</span>)) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">aperm</span>(<span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">1</span>,<span class="dv">3</span>))</span>
<span id="cb96-9"><a href="NN.html#cb96-9"></a>test_labels_rnn &lt;-<span class="st"> </span><span class="kw">as.matrix</span>(NN_test_labels) <span class="op">%&gt;%</span><span class="st"> </span></span>
<span id="cb96-10"><a href="NN.html#cb96-10"></a><span class="st">    </span><span class="kw">array</span>(<span class="dt">dim =</span> <span class="kw">c</span>(nb_dates_test, nb_stocks, <span class="dv">1</span>)) <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">aperm</span>(<span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">1</span>,<span class="dv">3</span>))</span></code></pre></div>

<p>Finally, we move towards the training part. For simplicity, we only consider a simple RNN with only one layer. The structure is outlined below. In terms of recurrence structure, we pick a Gated Recurrent Unit (GRU). </p>

<div class="sourceCode" id="cb97"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb97-1"><a href="NN.html#cb97-1"></a>model_RNN &lt;-<span class="st"> </span><span class="kw">keras_model_sequential</span>() <span class="op">%&gt;%</span><span class="st"> </span></span>
<span id="cb97-2"><a href="NN.html#cb97-2"></a><span class="st">    </span><span class="kw">layer_gru</span>(<span class="dt">units =</span> <span class="dv">16</span>,                              <span class="co"># Nb units in hidden layer</span></span>
<span id="cb97-3"><a href="NN.html#cb97-3"></a>              <span class="dt">batch_input_shape =</span> <span class="kw">c</span>(nb_stocks,         <span class="co"># Dimensions = tricky part!</span></span>
<span id="cb97-4"><a href="NN.html#cb97-4"></a>                                    nb_dates_train, </span>
<span id="cb97-5"><a href="NN.html#cb97-5"></a>                                    nb_feats), </span>
<span id="cb97-6"><a href="NN.html#cb97-6"></a>              <span class="dt">activation =</span> <span class="st">&#39;tanh&#39;</span>,                     <span class="co"># Activation function</span></span>
<span id="cb97-7"><a href="NN.html#cb97-7"></a>              <span class="dt">return_sequences =</span> <span class="ot">TRUE</span>) <span class="op">%&gt;%</span><span class="st">             </span><span class="co"># Return all the sequence</span></span>
<span id="cb97-8"><a href="NN.html#cb97-8"></a><span class="st">    </span><span class="kw">layer_dense</span>(<span class="dt">units =</span> <span class="dv">1</span>)                             <span class="co"># Final aggregation layer</span></span>
<span id="cb97-9"><a href="NN.html#cb97-9"></a>model_RNN <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">compile</span>(</span>
<span id="cb97-10"><a href="NN.html#cb97-10"></a>    <span class="dt">loss =</span> <span class="st">&#39;mean_squared_error&#39;</span>,                       <span class="co"># Loss = quadratic</span></span>
<span id="cb97-11"><a href="NN.html#cb97-11"></a>    <span class="dt">optimizer =</span> <span class="kw">optimizer_rmsprop</span>(),                   <span class="co"># Backprop</span></span>
<span id="cb97-12"><a href="NN.html#cb97-12"></a>    <span class="dt">metrics =</span> <span class="kw">c</span>(<span class="st">&#39;mean_absolute_error&#39;</span>)                 <span class="co"># Output metric MAE</span></span>
<span id="cb97-13"><a href="NN.html#cb97-13"></a>)</span></code></pre></div>

<p>There are many options available for recurrent layers. For GRUs, we refer to the Keras documentation <a href="https://keras.rstudio.com/reference/layer_gru.html" class="uri">https://keras.rstudio.com/reference/layer_gru.html</a>. We comment briefly on the option return_sequences which we activate. In many cases, the output is simply the terminal value of the sequence. If we do not require the entirety of the sequence to be returned, we will face a problem in the dimensionality because the label is indeed a full sequence.
Once the structure is determined, we can move forward to the training stage.</p>

<div class="sourceCode" id="cb98"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb98-1"><a href="NN.html#cb98-1"></a>fit_RNN &lt;-<span class="st"> </span>model_RNN <span class="op">%&gt;%</span><span class="st"> </span><span class="kw">fit</span>(train_features_rnn,   <span class="co"># Training features        </span></span>
<span id="cb98-2"><a href="NN.html#cb98-2"></a>                  train_labels_rnn,                <span class="co"># Training labels</span></span>