-
Notifications
You must be signed in to change notification settings - Fork 8
Expand file tree
/
Copy pathindex.xml
More file actions
7641 lines (7395 loc) · 496 KB
/
index.xml
File metadata and controls
7641 lines (7395 loc) · 496 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Tea & Stats</title>
<link>https://selbydavid.com/</link>
<description>Tea, statistics and t-statistics: a data science blog by David Selby.</description>
<atom:author>
<atom:name>David Selby</atom:name>
<atom:uri>https://selbydavid.com</atom:uri>
</atom:author>
<image>
<link>https://selbydavid.com/</link>
<title>Tea & Stats</title>
<url>https://selbydavid.com/logo.png</url>
</image>
<generator>Hugo -- gohugo.io</generator>
<language>en-GB</language>
<lastBuildDate>Thu, 13 Mar 2025 15:00:00 +0100</lastBuildDate>
<atom:link href="https://selbydavid.com/" rel="self" type="application/rss+xml" />
<item>
<title>Learning to Denglisch</title>
<link>https://selbydavid.com/2025/03/13/denglisch/</link>
<pubDate>Thu, 13 Mar 2025 15:00:00 +0100</pubDate>
<guid>https://selbydavid.com/2025/03/13/denglisch/</guid>
<description><p>At the railway station, a lost-looking US soldier asked me if I spoke English.
Do I? At times it feels like it, but the Germans keep me guessing.</p>
<p>Since moving to Germany, I have been continually tested on the true meanings of English words.
Here are a few examples.</p>
<h2 id="denglisch">Denglisch</h2>
<p>It turns out I&rsquo;d been using these words wrong all along.</p>
<table>
<thead>
<tr>
<th>Denglisch</th>
<th>German meaning</th>
<th>to Anglophones</th>
</tr>
</thead>
<tbody>
<tr>
<td><em>der Body</em></td>
<td>a babygrow or bodysuit</td>
<td>a corpse</td>
</tr>
<tr>
<td><em>die Bodybag</em></td>
<td>a messenger bag</td>
<td>a bag for corpses</td>
</tr>
<tr>
<td><em>das Public Viewing</em></td>
<td>an open-air screening</td>
<td>showing a corpse at a wake</td>
</tr>
<tr>
<td><em>das Shooting</em></td>
<td>a photoshoot</td>
<td>somebody getting shot</td>
</tr>
<tr>
<td><em>der Beamer</em></td>
<td>a projector</td>
<td>a BMW</td>
</tr>
<tr>
<td><em>das Gymnasium</em></td>
<td>a grammar school</td>
<td>a gym</td>
</tr>
<tr>
<td><em>Homeoffice</em></td>
<td>remote working from home</td>
<td>the interior ministry</td>
</tr>
<tr>
<td><em>das Handy</em></td>
<td>a mobile phone</td>
<td>useful</td>
</tr>
<tr>
<td><em>der Smoking</em></td>
<td>a dinner jacket</td>
<td>a smoking jacket</td>
</tr>
<tr>
<td><em>der Oldtimer</em></td>
<td>classic car</td>
<td>old person</td>
</tr>
<tr>
<td><em>das Notebook</em></td>
<td>a laptop computer</td>
<td>a notepad</td>
</tr>
<tr>
<td><em>die Mail</em></td>
<td>an Email</td>
<td>snail mail</td>
</tr>
</tbody>
</table>
<p><img src="https://images.unsplash.com/photo-1535016120720-40c646be5580?q=80&amp;w=1740&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" alt="Ein Beamer" title="ein Beamer"></p>
<h2 id="scheinanglizismus">Scheinanglizismus</h2>
<p>I am grateful to German speakers for teaching me these words, which I would never have learnt otherwise.</p>
<table>
<thead>
<tr>
<th>Pseudo-anglicism</th>
<th>German definition</th>
</tr>
</thead>
<tbody>
<tr>
<td><em>Fully</em></td>
<td>a full-suspension mountain bike</td>
</tr>
<tr>
<td><em>Highboard</em></td>
<td>a tall sideboard, a drinks cabinet</td>
</tr>
<tr>
<td><em>Jobticket</em></td>
<td>a subsidized transport pass</td>
</tr>
<tr>
<td><em>Kicker</em></td>
<td>table football</td>
</tr>
<tr>
<td><em>Lowboard</em></td>
<td>a low sideboard, a media cabinet</td>
</tr>
<tr>
<td><em>Partnerlook</em></td>
<td>a couple wearing matching outfits</td>
</tr>
<tr>
<td><em>Pullunder</em></td>
<td>a sleeveless cardigan or sweater vest (US)</td>
</tr>
<tr>
<td><em>eine suggestive Frage</em></td>
<td>I don&rsquo;t live here, I&rsquo;m just visiting</td>
</tr>
<tr>
<td><em>Wellness</em></td>
<td>a spa</td>
</tr>
</tbody>
</table>
<p><img src="https://images.unsplash.com/photo-1608531428470-4471739c4359?q=80&amp;w=1727&amp;auto=format&amp;fit=crop&amp;ixlib=rb-4.0.3&amp;ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" alt="Ein Fully" title="ein Fully"></p>
<!--
## Miscellaneous
- I'd probably say *rucksack* over *backpack*.
- A *dachshund* is *ein Dackel*.
- Nobody every says *auf wiedersehen*.
-
-->
<p>I might update this list as I encounter more.</p>
</description>
</item>
<item>
<title>Turning over a new leaf</title>
<link>https://selbydavid.com/2025/03/04/overleaf/</link>
<pubDate>Tue, 04 Mar 2025 15:00:00 +0100</pubDate>
<guid>https://selbydavid.com/2025/03/04/overleaf/</guid>
<description><p><a href="https://www.overleaf.com/">Overleaf</a>, formerly known as Share$\LaTeX$, is the go-to collaborative document editor for many researchers, who have taken advantage of its free tier.
It&rsquo;s a web-based editor that compiles $\LaTeX$ documents in real time, with several word-processor-like features like commenting, tracked changes and a GUI.
However, the company has made collaboration a paid feature, meaning you either need to pay for a premium membership or find an alternative if you want to continue editing academic papers with your colleagues in real time.</p>
<p>Despite having enjoyed access to an institutional subscription to Overleaf Premium for some time, I find this a great opportunity to explore alternatives.</p>
<h2 id="collaborative-markdown-editing">Collaborative Markdown editing</h2>
<p>If you have read <a href="https://selbydavid.com/">other posts on this blog</a>, you&rsquo;ll be aware that I am a fan of literate programming using Markdown. This blog, many of <a href="https://scholar.google.com/citations?user=e7e8nfUAAAAJ&amp;hl=en">my papers</a> and <a href="http://webcat.warwick.ac.uk/record=b3690782">my PhD thesis</a> are all written in R Markdown, which integrates text and generation of plots and tables for truly reproducible writing.
<a href="https://quarto.org/">Quarto</a>, R Markdown&rsquo;s less R-centric spiritual successor, integrates a few popular extensions to provide better native support for things like equation cross-referencing and writing books.</p>
<p>One of the biggest weaknesses of (R) Markdown is its lack of a simple collaborative editing feature to rival Overleaf. Although <a href="https://www.overleaf.com/learn/latex/Knitr">Overleaf does support <strong>knitr</strong></a>, you still have to write in $\LaTeX$ syntax, which is less intuitive than Markdown and offers fewer output formats.
Some niche solutions exist, such as <a href="https://bookdown.org/yihui/rmarkdown-cookbook/google-drive.html"><strong>trackdown</strong></a>, which lets you do tracked changes via Google Drive, <del>RStudio</del> <a href="https://posit.cloud/">Posit Cloud</a>, which has a free tier allowing one shared space with up to five collaborators, and one or two experimental online R Markdown editors that I can no longer find.</p>
<p><img src="https://sharelatex-wiki-cdn-671420.c.cdn77.org/learn-scripts/images/3/39/KnitrDemo3.png" alt="Knitr in Overleaf" title="knitr in Overleaf"></p>
<p>The &lsquo;proper&rsquo; solution is of course to use a shared Git repository, but although this is apparently even how <a href="https://github.com/overleaf/overleaf/issues/10">Overleaf works behind the scenes</a>, it&rsquo;s not quite as instant or accessible as editing a Word or Overleaf document online, and it doesn&rsquo;t run your code or render the output automatically.</p>
<h2 id="visual-studio-code-live-share">Visual Studio Code Live Share</h2>
<p>However, <a href="https://code.visualstudio.com/">Visual Studio Code</a> has a few handy features that makes it a promising alternative.
As well as being able to <a href="https://code.visualstudio.com/docs/remote/wsl">edit files on WSL</a> and <a href="https://code.visualstudio.com/docs/remote/ssh">on remote servers</a>, there are countless extensions for <a href="https://code.visualstudio.com/docs/languages/r">linting R code</a>, <a href="https://marketplace.visualstudio.com/items?itemName=mechatroner.rainbow-csv">highlighting columns in CSV files</a> and <a href="https://code.visualstudio.com/docs/copilot/overview">AI integration with GitHub Copilot</a>.
Obviously you can therefore write your R Markdown and Quarto documents in VS Code, just as you might have done using RStudio.
But the real killer feature is the <a href="https://visualstudio.microsoft.com/services/live-share/">Live Share extension</a>, which lets you share a workspace with others, who can edit the same files in real time, with syntax highlighting and code completion.</p>
<p><img src="https://visualstudio.microsoft.com/wp-content/uploads/2023/01/v2-Edit-Comp_FINAL-optimized750-1.gif" alt="Visual Studio Live Share" title="Visual Studio Live Share"></p>
<p>This means collaborators can edit the same document in real time, with one author hosting the live session and nobody else needing to install all the same dependencies or have access to datasets on their own machines.
When the host saves the document at the end of the session, it can then be committed to Git or your shared repository as normal, without conflicts.
And rather than the Google Docs and Overleaf chat functions, which I don&rsquo;t believe anybody on the history of this Earth has ever used, you can communicate using your usual medium, be it Teams, Slack, WhatsApp etc.</p>
<h2 id="github-actions">GitHub Actions</h2>
<p>The obvious limitation here is that at least one user has to be hosting a Live Share session from a computer with all relevant dependencies for rendering the document.
The other half of the solution is <a href="https://github.com/features/actions">GitHub Actions</a> (or your favourite equivalent continuous integration tool).
On committing to your repository, a script will run on a virtual machine to compile the document and save the results as an &lsquo;artifact&rsquo;.
This can then be downloaded like a software release or committed back into a branch of the repository.
It is not quite instantaneous, and can require a bit a TLC to keep it running, but means you can edit text and commit from anywhere.</p>
<p><a href="https://github.com/quarto-dev/quarto-actions">Quarto provides several GitHub Actions</a> for rendering and publishing projects, and <a href="https://github.com/r-lib/actions/blob/v2-branch/examples/render-rmarkdown.yaml">the same are available for R Markdown</a>.</p>
<h2 id="putting-it-all-together">Putting it all together</h2>
<p><em>Coming soon: along with a bonus post on reproducible collaborative poster presentations.</em></p>
</description>
</item>
<item>
<title>World Cup 2022 results</title>
<link>https://selbydavid.com/2023/02/07/world-cup-2022-results/</link>
<pubDate>Tue, 07 Feb 2023 23:00:00 +0100</pubDate>
<guid>https://selbydavid.com/2023/02/07/world-cup-2022-results/</guid>
<description><p>There is a widespread critique that too many pundits fail to make measurable predictions.</p>
<p>For example, Philip Tetlock takes aim at what he calls <a href="https://goodjudgment.com/vague-verbiage-forecasting/"><em>vague verbiage</em></a>, the use of vaguely probabilistic phrases such as &ldquo;it&rsquo;s a real possibility&rdquo; or &ldquo;there&rsquo;s a fair chance&rdquo;.
We&rsquo;d like to think that with respect to the World Cup we escaped the crowd of vague-verbiageurs and <a href="https://selbydavid.com/world-cup-2022">nailed our colours to the mast</a>.
There&rsquo;s very little point in making predictions if you are not going to be held accountable for them. That reckoning could be from the market in the form of actually betting based on your predictions, but we are not really gambling types.</p>
<p>So, instead, in what follows, we present our review of those predictions.</p>
<p>The aim of our model was to use publicly accessible data available pre-tournament to predict the outcome of all possible matches. We chose to use betting market data as, to a first approximation, this represented the knowledge and analysis of a lot of highly informed and capable people.
Betting market odds were available for all group games, outright winner, and for reaching the final, and it was these data that we used, giving us a total of 159 data points to work with.
We will use the (negative) log-loss metric to consider performance, where the lower your score, the better you did.
This is defined as
$$ \text{log loss}= -\sum_k \log(p_k), $$
where $p_k$ is the probability that we assigned to the outcome that was observed in match $k$.
It is a commonly used measure for judging predictions, with <a href="https://arxiv.org/pdf/1502.06254.pdf">some appealing features</a> and is the one that was used in that original RSS Euro 2020 prediction competition that inspired us to come up with the model in the first place.
We will compare the log-loss from the probabilities derived from our model with those derived from the market odds immediately prior to each match.
For group stage matches and 90-minute KO-round matches we source those from <a href="https://www.oddsportal.com/soccer/world/world-championship/results/">here</a>, and for the knock-out round outcomes (in the form of the odds for each team progressing to the next round) from <a href="https://www.oddschecker.com">oddschecker.com</a> (recorded manually at the time).
We&rsquo;ll discuss a main competing alternative to the log-loss measure later in this article.</p>
<h3 id="group-stage">Group stage</h3>
<p>Since we took the group game odds directly from the market, our model was not really applied to these matches.
Nevertheless it might be interesting to see how we performed.
Effectively this is a test of pre-tournament against pre-match odds. The two figures below present the cumulative and game-by-game log-loss comparisons.</p>
<p><img src="https://selbydavid.com/img/2023/eval_group_stage_cumulative.png" alt="Pre-tournament odds (&ldquo;our_neg_log-loss&rdquo;) vs. pre-match odds (&ldquo;market_neg_log-loss&rdquo;) measured in log loss. Cumulative log-loss over the course of the group stage."></p>
<p><img src="https://selbydavid.com/img/2023/eval_group_stage_by_game.png" alt="Log-loss by game. The largest losses are incurred in only three games which are highlighted by the dotted lines. These correspond to the matches (from left to right): Tunisia beating France, Cameroon beating Brazil and South Korea beating Portugal."></p>
<p>The performance tracks each other very closely with some divergence occuring towards the end of the group stage.
This seems very natural.
Betting odds capture the information available at the time and there is more relevant information that becomes available as more matches are observed.</p>
<p>One obvious version of new information would be that having seen each team play a couple of matches, we have a better sense of their tournament ability.
But, in fact, there appears to be little effect from this.
The divergence between pre-tournament and pre-match odds can be ascribed here to a single factor&mdash;asymmetrically dead rubbers.</p>
<p>As can be seen below, the difference is explained by just three matches.
They were Tunisia&ndash;France, Cameroon&ndash;Brazil, South Korea&ndash;Portugal, which all ended in upsets and are indicated by the dotted lines in the plot above.</p>
<p>These were precisely the matches where the favourite had already won both of their group matches so far and qualified for the next round, likely as group winners, before the final group match.
We might then reasonably expect them to rest some of their key players and generally not be so motivated.
On the other hand, the other team were super-motivated both by the possibility of a qualifying spot and of being able to topple a favourite and thus take something positive from the World Cup.
The immediate pre-match odds took account of this when compared to the pre-tournament odds.</p>
<h3 id="knock-out-stage">Knock-out stage</h3>
<p>Turning to the more interesting part, how did we do on the knock-out stage matches? As a reminder, our initial predictions for the knock-out stage that we made in <a href="https://selbydavid.com/2022/11/20/world-cup-2022">our original blog post</a> are shown below (based on running 10,000 simulations of the World Cup with the probabilities determined by our model).</p>
<p><img src="https://selbydavid.com/img/2023/knockout_stage_original.png" alt="The original predictions we made in our initial blog post based on simulating the tournament with the probabilities calculated by our model 10,000 times."></p>
<p>For comparison, we also present the actual outcomes of the knock-out stage together with the probabilities our model assigned to them.</p>
<p><img src="https://selbydavid.com/img/2023/Knockout_stage_semi_finals_actual_2.png" alt="The actual outcomes of the knock-out stages with the probabilities as predicted by our model."></p>
<p>The overall performance of our model in terms of cumulative log-loss is summarised in the plot below. Dotted lines mark the end of the Round of 16, quarter-finals and semi-finals respectively.</p>
<p><img src="https://selbydavid.com/img/2023/eval_ko_cumulative.png" alt="Cumulative log-loss. Comparison between our method (&ldquo;c_our_log_loss&rdquo;) and market odds (&ldquo;c_market_log_loss&rdquo;). Dotted lines mark the end of the Round of 16, quarter-finals and semi-finals respectively."></p>
<p>Coming into the quarter-final weekend we felt pretty good; we had correctly predicted six of the eight quarter-finalists and six of the eight group winners. In the round of sixteen, we correctly predicted seven of the eight match outcomes, with the one we got wrong being Morocco&rsquo;s victory over Spain on penalties. This is a notable success compared to <a href="https://www.youtube.com/watch?v=KjISuZ5o06Q">other attention-grabbing predictions</a>.</p>
<p>Then, as the plot above shows, it began to unravel a little with the victories of Morocco over Portugal and Croatia over Brazil. But then we made some ground back in the semi-finals and even more with Argentina&rsquo;s victory over France in the final.</p>
<p>In the end the market odds bested us by just 0.04. The median of our absolute match-level log-loss differences in the KO-stage was 0.16, so it seems not unfair to claim 0.04 as noise, and that our model based solely on pre-tournament data matched the performance of pre-match odds in prediction terms.</p>
<h3 id="an-alternative-metric">An alternative metric</h3>
<p>That is a decent result given how much more information those pre-match odds included compared to our pre-tournament data. But we&rsquo;re not content with a draw; we think we (might) deserve the win on this one.</p>
<p>One of the joys of football is the propensity for upsets. One of the authors of this post knows this well as a Coventry City fan. Two years after winning the FA Cup in 1987 (in probably the greatest ever FA Cup final match - there will never be a goal quite like <a href="https://www.youtube.com/watch?v=1Q5-ANGlhuM">Keith Houchen&rsquo;s diving header</a>), Coventry, at the time in fifth place in the top league of English football, went on to <a href="ttps://en.wikipedia.org/wiki/Sutton_United_2-1_Coventry_City_(1989)">lose to Sutton United</a>, a non-league team languishing in 13th place in the Vauxhall conference, 100 places below them. These events, while rare enough to be intriguing, are more common in football than in other high-profile sports.</p>
<p>High-scoring sports such as rugby union, basketball or cricket, exercise a type of score Central Limit Theorem where a lesser-favoured team may be able to score and even win portions of a game but over the entire match the aggregate scoring ability will tend towards the mean with a reduced variance. In contrast, football matches often have only a few goals. The randomness of the form of a striker or goalkeeper on a particular day or the bounce of the ball can therefore have a larger impact. Over many matches these more arbitrary elements will balance out, so that shocks over a league season are much rarer (though <a href="https://www.eurosport.com/football/premier-league/2015-2016/leicester-city-s-premier-league-title-win-the-greatest-underdog-story-of-all_sto5521114/story.shtml">Leicester</a>!).</p>
<p>This might suggest that there could be a better measure for a match than goals, in the sense that it would be more reflective of a team&rsquo;s performance and would be a better predictor of future performance. This is indeed one of the claims made for the <a href="https://www.goal.com/en-gb/news/what-is-xg-football-how-statistic-calculated/h42z0iiv8mdg1ub10iisg1dju">expected goals (xG) metric</a>.</p>
<p>If we are prepared to believe this, and there seems to be some evidence for the <a href="https://www.americansocceranalysis.com/home/2022/7/19/the-replication-project-is-xg-the-best-predictor-of-future-results">claim</a>, then it would be reasonable to measure predictions against the xG outcome of a match i.e. where the winner is the team that achieved the highest xG, since this is a better reflection of actual performance, stripped of the arbitrariness of goals.</p>
<p>For this purpose we take our xG outcomes from <a href="https://twitter.com/xGPhilosophy">@xGPhilosophy</a> on Twitter. If we re-run the log-loss measure with outcomes based on xG, what do we find? We win, and clearly! We beat the market by 1.15, with the median absolute match-level log-loss difference now being 0.10.</p>
<p><img src="https://selbydavid.com/img/2023/eval_xG_cumulative.png" alt="Cumulative log-loss over the course of the knock-out stage for xG outcomes. Dotted lines mark the end of the Round of 16, quarter-finals and semi-finals respectively."></p>
<p><img src="https://selbydavid.com/img/2023/eval_xG_by_game.png" alt="Log-loss by game for xG outcomes. The solid vertical line corresponds to the match France - Poland, which France won 3-1 in reality, but Poland won 1.81-1.22 on xG."></p>
<p>We suspect that a number of readers at this point might be a bit sceptical. You buy the argument about goals being a bit arbitrary, but some of the things that contribute towards the difference between xG and goals&mdash;striker or goalkeeper proficiency, for example&mdash;are important parts of the game.</p>
<p>Sceptics might also reasonably point out that our xG outperformance would be wiped out with the reversal of a single match: Poland-France, which Poland won 1.81-1.22 on xG, but France won 3-1 in reality (solid vertical line in the plot above). On the other hand, the argument that xG is more predictive of future performance seems quite persuasive, and this is the only one of the fifteen matches that could have reversed the finding with some other match reversals working in our favour.</p>
<p>Overall, we share the concerns, and wouldn&rsquo;t claim that xG is the right way to look at it and goals the wrong way, but rather that they both have merit and given we drew on goals and won on xG, it would seem that there might be something being captured in our crude model that is working.</p>
<p>Ultimately, if the claim that xG is a better long-term predictor than goals is correct, and our outperformance based on xG is meaningful, then if you examined enough tournaments you would expect our method to work based on goals too. We have only analysed two tournaments, so can&rsquo;t really comment on this, but they do not contradict the claim.</p>
<h3 id="lets-talk-about-money">Let&rsquo;s talk about money</h3>
<p>Before speculating on what it might be that causes the model to work, it is worth addressing the measure used here. Log-loss is a reasonable way to measure predictions, but is more of academic interest (meant literally and perjoratively).</p>
<p>In the real world, when looking at betting odds, people care more about whether they can make money.</p>
<p>In order to turn predictions into money one needs a betting scheme, a method by which one turns those predictions into bets. Then, one can then measure the success of predictions by the profit or loss they generate under the betting scheme.</p>
<p>A good place to start with a betting scheme is to consider the expected return from a bet, given your predictions. In the knock-out stage matches, we were predicting the probability of reaching the next round i.e. it was a binary outcome, either Team 1 progressed or Team 2. If $p_i$ is the probability that Team $i$ progresses then $p_2 = 1 - p_1$. Our expected return in pounds from betting £1 on Team $i$ is then,</p>
<p>$$ \mathbb{E}[\text{return}]=p_i (o_i - 1) - (1-p_i) = p_io_i - 1, $$</p>
<p>where $o_i$ are the European odds for Team $i$ winning.<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> For each match, we can calculate the expected return from backing either Team 1 or Team 2 in this way. Note that there may not be a positive expected-return bet from backing either team if the market is sufficiently wide and our predictions are sufficiently well-calibrated to the market.<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup></p>
<p>A typical betting scheme would be to bet an amount, £1 say, whenever the expected return is above a specified threshold. One might argue that this threshold should be zero&mdash;you should bet whenever you think you are going to win&mdash;but often people are a bit more cautious. Models have noise and you want to identify the real opportunities.</p>
<p>Also betting has operational costs, the time taken to post money with bookmakers and make the bet. So often this threshold is set above zero (see, for example, <a href="https://arxiv.org/vc/arxiv/papers/1710/1710.02824v1.pdf">this paper</a>, which is very much in the spirit of our model, in using market odds as its data, and effectively takes a threshold of approximately 5%).</p>
<p>The figure below shows the return, calculated as net return divided by total money wagered for different thresholds, based on both goals and xG outcomes for confidence thresholds for which at least five bets would have been placed<sup id="fnref:3"><a href="#fn:3" class="footnote-ref" role="doc-noteref">3</a></sup>.</p>
<p><img src="https://selbydavid.com/img/2023/betting_returns_truncated.png" alt="Betting returns as function of confidence threshold. A bet is placed if the expected return of wagering £1 exceeds the threshold. Including thresholds for which at least five bets are placed."></p>
<p>The next plot show the number of bets placed as a function of the confidence threshold.</p>
<p><img src="https://selbydavid.com/img/2023/number_bets_taken_truncated.png" alt="The number of bets placed as a function of confidence threshold."></p>
<p>Using xG is even more tenuous here&mdash;good luck finding a bookie who will pay out based on xG&mdash;but if the argument is correct that xG is a better future predictor of outcomes then it may give a better indicator of expected performance of the betting scheme. The maximum possible return from this strategy for actual outcomes is 12.9% of the capital wagered, which is realized when using a confidence threshold between 0.07 and 0.1.</p>
<p>While picking a threshold somewhere in the range between, say, 0.05 and 0.1 might have made for a reasonable heuristic, these optimal betting returns are obviously based on hindsight.</p>
<p>Another more serious concern is that while there look to be some tempting returns here, we have by this point travelled well-down Gelman&rsquo;s <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">garden of forking paths</a> and based our argument on very little data, even more so when the betting scheme filters so we are only looking at a handful of matches.</p>
<h3 id="why-does-the-model-work">Why does the model work?</h3>
<p>Whichever way you cut the data and admitting it&rsquo;s limitations given just the fifteen data points, the model does seem to work well against pre-match odds, which begs the question &ldquo;why?&rdquo;</p>
<p>Often when doing data analysis one finds a pattern and is interested in determining whether it is real or just a happenstance of the data. One natural consideration is whether there are good substantive reasons for why the pattern would exist.</p>
<p>In our case, we seem to have found that a crude model based on much less information can perform comparably with, perhaps even better than, the wisdom of the market. What could account for such a phenomenon?</p>
<p>In our last post, we discussed how our model had the effect of extremising the strength of teams, so that, for example, the probability of winning the tournament was increased for the strongest teams and decreased for weaker ones when compared to betting market odds as shown in plot below.</p>
<p><img src="https://selbydavid.com/img/2023/winning_probabilities.png" alt="Probabilities of winning the tournament as predicted by pre-tournament market odds (x-axis) vs. winning probabilities predicted by our model based on 10,000 simulation runs (y-axis)."></p>
<p>For the purpose of predicting individual matches, this meant that in most cases the estimated probability of a favourite winning was greater under our model than under the market odds, see the next plot.</p>
<p><img src="https://selbydavid.com/img/2023/favourites_extremising.png" alt="KO stage probabilities of the favourite team winning as implied by pre-match market odds (x-axis) vs. probability of winning as predicted by our model (y-axis). The first-named team is always the favourite."></p>
<p>The most obvious explanation then would be the so-called &ldquo;favourite-longshot bias&rdquo; (FLB), for which there is an academic literature stretching back decades. Somewhat confusingly the term ``favourite-longshot bias&quot; seems to be applied both to situations where backing favourites produces better returns than from backing long-shots and the exact opposite.</p>
<p>As a group of cyncical statisticians, we can&rsquo;t help sniffing the whiff of publication bias. On the other hand, <a href="https://www.mdpi.com/2227-9091/9/1/22">this recent review article</a> suggests that the findings for situations like ours, where there are two competitors and there is a binary outcome, are more consistently for better returns from backing a favourite. It offers a number of possible reasons, as do <a href="https://web.econ.ku.dk/sorensen/papers/FLBsurvey.pdf">other well-cited articles here</a> and <a href="https://www.nber.org/system/files/working_papers/w15923/w15923.pdf">here</a>.</p>
<p>Perhaps the most appealing explanations recognise that the activity of betting, and therefore the betting market, is not driven solely by a rational evaluation of the chances of different teams.</p>
<!-- As the daughter of one the authors of this blogpost put it "it's more fun when it's a surprise, isn't it?''
Whether this is the case or not, there do at least seem to be plausible explanations along these lines.
However, this does not seem to be the whole explanation. -->
<p>The figure below shows the results of a betting scheme where we back the favourite if the betting market implied probability of it winning are greater than a certain percentage.
It seems not especially persuasive for a favourite-based betting scheme, and not at all similar to the betting outcomes observed under our model, so it doesn&rsquo;t seem like a very satisfactory explanation for the success of the model, such as it is.</p>
<p><img src="https://selbydavid.com/img/2023/opportunistic_threshold_betting.png" alt="Returns for the favourites betting scheme: A bet is placed on the favourite if their probability of winning as implied by market odds exceeds a certain threshold (x-axis). Showing results for which at least five bets are placed."></p>
<p>The next plot shows the number of bets placed when backing the favourite as function of the market implied winning probability.</p>
<p><img src="https://selbydavid.com/img/2023/number_bets_taken_opportunistic_threshold_betting.png" alt="Number of bets placed under the favourites betting scheme as a function of market winning probability of the favourite."></p>
<p>In conclusion, we&rsquo;re not really sure what to make of this. Any conclusions based on so few matches, especially when filtered in the betting schemes, are likely to be unreliable at best.
But the results we got here are consistent with what we saw at Euro 2020, so we still think there might be something to the method. We look forward to wheeling it out again at Euro 2024 and exploring some updated methods and analysis.</p>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>for example, if the European odds are 1.6, equivalent to 5/8 in UK odds, then if you wager £1 you will get £1.60 back. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>The market being wide here refers to the gap between the price one could back or lay the same team. In a perfect market they would be the same. In practice market-makers need to make a profit and so they generally are not. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>We find it hard to make any claims beyond that point due to only considering very few data points. For example, for thresholds larger than 0.35 we would only be betting on a single game: France-Poland; betting on Poland due to phenomenal returns if the bet works out. Return of wagering £1 on Poland and Poland winning would be £6, with an expected return of £1.44. Poland beat France on xG, but lost in reality. C&rsquo;est la vie. <a href="#fnref:3" class="footnote-backref" role="doc-backlink">&#x21a9;&#xfe0e;</a></p>
</li>
</ol>
</section>
</description>
</item>
<item>
<title>World Cup 2022 predictions</title>
<link>https://selbydavid.com/2022/11/20/world-cup-2022/</link>
<pubDate>Sun, 20 Nov 2022 22:15:00 +0100</pubDate>
<guid>https://selbydavid.com/2022/11/20/world-cup-2022/</guid>
<description><p>Sports prediction has exploded in the last couple of decades with entire <a href="https://libguides.northwestern.edu/c.php?g=403355&amp;p=2745062">journals</a>, <a href="https://medium.com/sports-data-science/the-complete-list-of-sports-analytics-conferences-5935846f98c0">conferences</a> and books devoted to it.</p>
<p>Much of this focuses on utilising ever-greater amounts of data, with soccer, for example, now providing sub-second ball and player tracking data.</p>
<p>But sometimes it is nice to try to do more with less.</p>
<p>Here we (<strong>Ian Hamilton</strong>, <strong>Stefan Stein</strong> and <strong>David Selby</strong>) describe a method that we applied to predict the results (indeed all possible match outcomes) of the Euro 2020 football tournament, that took just 120 data points and two linear regressions, yet managed to beat the market over the course of the tournament (based on the accumulated log-loss of match outcomes against predictions taken from the market odds immediately prior to the match).</p>
<p>All data analysts know the well-worn phrase: &ldquo;garbage in, garbage out&rdquo;.
The corollary to this though is that when you have good data then it is easier to make good inferences.
When one is considering predictions, what data would one most want?</p>
<p>One great form of data would be to have the predictions of a bunch of sophisticated data analysts and to be able to take some kind of conviction-weighted poll of those predictions.
Fortunately (at least in some economists' imaginations) that is exactly what a market provides (we&rsquo;re not entirely convinced by this, a point to which we will return).
Betting markets on major tournaments are highly liquid and their prices readily accessible.
They therefore provide excellent data for us to work from. But our aim here is to predict the outcome of all possible matches including those that might occur in the later rounds based on what we know pre-tournament.</p>
<p>There are, of course, no available odds to those KO matches pre-tournament because no-one knows what they will be.
So how can we use what we do know to construct those probabilities?</p>
<p>We begin by taking a foundational model in Sports statistics (and the model that formed the basis of the PhDs of two of the authors of this blog post).
First described by <a href="https://doi.org/10.1007/BF01180541" title="Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung">Ernst Zermelo in 1929</a>, he made the crucial error of publishing the relevant article in German, allowing two upstart Americans to <a href="https://doi.org/10.2307/2334029">nab the glory nearly a quarter of a century later</a> and have the model named after them.
Thus it is known now as the <a href="https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model">Bradley&ndash;Terry model</a>.
It expresses the probability that a team $i$ beats a team $j$ as
$$
p_{ij} = \frac{\pi_i}{\pi_i + \pi_j},
$$
where $\pi_i$ is the &lsquo;strength&rsquo; of $i$. It can also be represented as a generalised linear model
$$
\text{logit}(p_{ij}) = \lambda_i - \lambda_j,
$$
where $\lambda_i = \log (\pi_i)$.</p>
<p>The Bradley&ndash;Terry model has some appealing statistical features, such as being the unique model for which the number of wins for each team is a sufficient statistic for the strength parameters (consistent with round robin ranking), and being the entropy and likelihood maximising model subject to the (highly plausible) constraint that the expected number of wins for a team given the matches observed is equal to the actual wins observed.
As <a href="https://doi.org/10.1080/00029890.1984.11971405" title="A supplement to 'A Mathematician's Guide to Popular Sports'">Stob (1984)</a> put it, &ldquo;What sort of a claim is it that a team solely on the basis of the results should have expected to win more games than they did?&rdquo;
This might be seen as failing to appreciate the bias present from finite observations.
Nevertheless, it reflects the intuitive appeal of the condition.</p>
<p>Typically the Bradley&ndash;Terry model is applied to a set of results, for the purpose of prediction or ranking.
Strength parameters can be estimated for each team by maximum likelihood estimation, using the likelihood function
$$
L(\boldsymbol{\lambda}) = \prod_{i&lt;j}\binom{m_{ij}}{c_{ij}}p_{ij}^{c_{ij}}(1-p_{ij})^{m_{ij}-c_{ij}},
$$
where $c_{ij}$ is the number of times $i$ beats $j$ and $m_{ij}= c_{ij}+c_{ji}$ is the number of matches between $i$ and $j$.</p>
<p>We could apply that method here based on historic performances, but:</p>
<ol>
<li>there are not enough recent useful results to estimate strengths reliably;</li>
<li>market prices are likely to be more informative;</li>
<li>this doesn&rsquo;t account for draws.</li>
</ol>
<p>Addressing the last of these first, the model was extended to account for draws by <a href="https://doi.org/10.2307/2283595" title="On Extending the Bradley-Terry Model to Accommodate Ties in Paired Comparison Experiments">Davidson (1970)</a>, and later <a href="https://alt3.uk/">by David Firth</a> to take account of the standard football point scheme of three for a win, one for a draw, to give,
$$
\mathbb{P}(i \text{ beats } j) = \frac{\pi_i}{\pi_i + \pi_j + \nu(\pi_i \pi_j)^{\frac{1}{3}}},
$$
$$
\mathbb{P}(i \text{ draws with } j) = \frac{\nu(\pi_i \pi_j)^{\frac{1}{3}}}{\pi_i + \pi_j + \nu(\pi_i \pi_j)^{\frac{1}{3}}}.
$$</p>
<p>Note that even with draws,
$$ \frac{p_{ij}}{p_{ji}} = \frac{\pi_i}{\pi_j} \quad \text{ or } \quad \text{logit}(p_{ij})= \lambda_i - \lambda_j. $$</p>
<p>In the present situation, we can estimate the intra-group log-strengths $r_i=\log s_i$ by linear regression:
$$ \log \left(\frac{p_{ij}}{p_{ji}}\right) = r_i - r_j, $$
since $p_{ij}$ are known from market odds.</p>
<p>So now we have an estimate for the strength of a team relative to other teams in its group, but to be able to predict any possible match we need to know the strength of a team relative to all other teams.
In order to do this, we make some assumptions:</p>
<ol>
<li>Team $i$&rsquo;s overall strength $\pi_i$ is a scaling of its intra-group strength $s_i$ by a factor dependent on its group $\gamma_{G(i)}$
$$ \pi_i = \gamma_{G(i)} s_i \quad \text{ or equivalently } \quad \lambda_i = \log\gamma_{G(i)} + r_i $$</li>
<li>The strength of every team&rsquo;s unknown final opponent is the same
$$ p_{io} = \mathbb{P}(i \text{ winning tournament} \mid i \text{ reaches final})
= \frac{\pi_i}{\pi_i + \pi_o}, $$
where $\pi_o$ is the strength of the unknown final opponent.</li>
</ol>
<p>We can calculate $p_{io}$ from market odds since
$$ p_{io}
= \frac{\mathbb{P}(i \text{ winning tournament})}{\mathbb{P}(i \text{ reaches final})} $$
and both these odds &mdash; outright winner and reaching the final &mdash; are available pre-tournament. Then we have that
$$ \log \left(\frac{p_{io}}{p_{oi}}\right) = \lambda_i - \lambda_o = \log\gamma_{G(i)} + r_i - \lambda_o, $$
and we can estimate $\log\gamma_{G(i)}$ and $\lambda_o$ through linear regression.</p>
<p>One thing to look out for here is that some of these odds are very large and not well-calibrated.
For example, according to <a href="https://oddschecker.com">oddschecker.com</a> the odds of Qatar winning the tournament are 500/1 and the odds of them reaching the final are 500/1, so conditional on them reaching the final the market says they have a 100% chance of winning!
This is clearly nonsense, and is due to the fact that for the more unlikely winning teams, these odds are not well-calibrated and so should be discarded.
Exactly which ones to discard is a matter of judgement.
For Euro 2022 we arbitrarily excluded teams where the outright win odds were greater than 100/1.</p>
<p>For the World Cup we graphed $\log \left(\mathbb{P} \left[i \text{ winning tournament} \mid i \text{ reaches final} \right] \right)$ against $\log\left (\mathbb{P}\left[i \text{ winning tournament}\right] \right)$ and excluded at the point where a consistent relationship seemed to start to break down.
For the World Cup we have included up to and including Denmark, the tenth ranked team, with a probability of winning the tournament of 30/1.</p>
<p><img src="https://selbydavid.com/img/2022/probs_plot.png" alt="Conditional probabilities of winning the tournament"></p>
<p>Now we can calculate the strengths of each team
$$ \pi_i = \gamma_{G(i)} s_i, $$
and apply these through the Bradley-Terry model to predict the KO match results by applying
$$ p_{ij} = \frac{\pi_i}{\pi_i + \pi_j}. $$</p>
<p>Using this method, we have determined a strength parameter for each team that can then be used to simulate the outcome of each match.
We were keen to produce a route-to-the-final map, like the one in <a href="https://www.youtube.com/watch?v=KjISuZ5o06Q">the presentation that inspired us to write this post</a> (well worth a watch).</p>
<p>To get a better idea of how these probabilities play out, we simulated 10,000 runs of the World Cup with them.
Here are the empirical probabilities of winning the tournament.
Our method only gives match outcomes (win, lose, draw), so where teams were tied on points in a group, we used a procedure whereby with 50% probability the tied teams were ranked in order of their strength parameters and with 50% probability they were ranked randomly.</p>
<p><img src="https://selbydavid.com/img/2022/probability_of_winning_truncated.png" alt="Empirical probabilities of winning the tournament"></p>
<p>Brazil is clearly the favourite with a 37.9% chance of taking home the Cup, followed by Argentina, with a 24.8% chance
Both of them are far ahead of the third-ranked (France, 6.2%) and fourth-ranked (England, 5.8%) teams.</p>
<p>Looking at the results of the simulation runs, we also extracted the &ldquo;dominant path&rdquo; most likely to lead to Brazil&rsquo;s victory.
Without further ado, here are our headline predictions:</p>
<p><img src="https://selbydavid.com/img/2022/knockout_stage.png" alt="Headline predictions"></p>
<p>As with any model, there are limitations.
It is obviously (and unapologetically) a bit crude. For example, we are calibrating the strengths based on data for the 90-minute match results and applying them to the outcomes of KO matches, potentially after extra-time or penalties for example.
We are also trusting betting odds to represent neutral value with the marginal price-maker being an informed individual, but betting is often a pursuit of the heart as much as the head and the notably different betting proclivity in different countries might suggest that some teams odds are likely to be skewed by this more than others.</p>
<p>But given these limitations, it is perhaps remarkable that when applied to Euro 2020 it outperformed the market (if taking the odds immediately prior to each match). Three possible explanations are:</p>
<ol>
<li>betting markets overweight in-tournament performance in their odds creation. In most football prediction, analysts have huge amounts of data off which to work with clubs playing dozens of competitive matches each season. In the international arena, teams play fewer matches against opposition of more varied quality and in more varied situations of competitiveness (qualifying matches vs friendlies vs tournament etc.), so perhaps their methods are not so well-calibrated to this sparser data scenario;</li>
<li>the equal strength final opposition assumption is a strong one that boosts the probability of well-favoured teams winning, and it happened to be that well-favoured teams did well in Euro 2020;</li>
<li>England&rsquo;s (surprising?) run to the final meant that the pre-tournament odds that (perhaps) featured a lot of heart-based backing (in possibly the biggest sports betting market in Europe) were supportive of the success of our scheme.</li>
</ol>
<p>We would be inclined to think that some combination of the last two of these explanations are most likely, but another piece of evidence comes from other predictions for that tournament.</p>
<p>Our prediction work was done as part of the <a href="https://github.com/mberk/rss-euro-2020-prediction-competition">RSS Euro 2020 prediction competition</a>.
We came second, but the winner had a <a href="https://journals.plos.org/plosone/article/authors?id=10.1371/journal.pone.0268511">completely different method based on previous match outcomes</a> and produced very similar performance to us, also beating the market, with the same being true of the third-placed entrant, who took a different approach again.</p>
<p>It&rsquo;s possible that all our approaches overly-favoured favourites and that was how we came to be ranked highest.
Alternatively, perhaps markets do over-react to in-tournament performance compared to pre-tournament assessments.</p>
<p>For more information and the full table of outcome predictions, check out the <a href="https://github.com/stefan-stein/world-cup-2022">GitHub repository</a>.</p>
</description>
</item>
<item>
<title>Indexing from zero in R</title>
<link>https://selbydavid.com/2021/12/06/indexing/</link>
<pubDate>Mon, 06 Dec 2021 10:00:00 +0000</pubDate>
<guid>https://selbydavid.com/2021/12/06/indexing/</guid>
<description><p>Everybody knows that R is an inferior programming language, because vector
indices start from 1, whereas in <em>real</em> programming languages like C and Python,
<a href="https://en.wikipedia.org/wiki/Zero-based_numbering">array indexing begins from 0</a>.</p>
<p>Sometimes this can be quite annoying if a problem&mdash;be it a mathematical
algorithm or a <a href="https://selbydavid.com/2021/12/01/advent-2021/">coding challenge</a>&mdash;calls for zero-based indexing.
You find yourself having to add <code>+ 1</code> to all your indices and it&rsquo;s easy to
introduce bugs or mix up values with their positions.</p>
<p>Help is at hand.
I have worked out how to break R so utterly that it starts counting from zero
instead of from one.
Someone on Stack Overflow said <strong><a href="https://stackoverflow.com/a/25308710">“just don&rsquo;t do it!”</a></strong> so naturally I&rsquo;ve gone ahead and <em>done it</em>.</p>
<p>What&rsquo;s the first letter of the alphabet?
In a normal R session, you get:</p>
<pre><code class="language-r">x &lt;- letters
x[1]
#&gt; [1] &quot;a&quot;
</code></pre>
<p>But with my enhanced version, you get the <em>real</em> answer:</p>
<pre><code class="language-r">x &lt;- index_from_0(letters)
x[1]
#&gt; &quot;b&quot;
</code></pre>
<p>Where&rsquo;s the &ldquo;a&rdquo; then? In the zeroth position, of course:</p>
<pre><code class="language-r">x[0]
#&gt; &quot;a&quot;
</code></pre>
<p>It works for replacing elements, and for matrices as well:</p>
<pre><code class="language-r">m &lt;- index_from_0(matrix(0, 2, 2))
m[0, 1] &lt;- 42
m[1] &lt;- 7
m
#&gt; [,1] [,2]
#&gt; [1,] 0 42
#&gt; [2,] 7 0
</code></pre>
<p>This is made possible with some abuse of S3 objects to redefine the <code>`[`</code>
and <code>`[&lt;-`</code> operators such that different methods are used every time you
subset a vector assigned the special class <code>index0</code>.</p>
<p>Want to try it yourself? Download the R package <a href="https://cran.r-project.org/package=index0"><strong>index0</strong> from CRAN</a>:</p>
<pre><code class="language-r">install.packages('index0')
</code></pre>
<p>Or view the source code <a href="https://github.com/Selbosh/index0">on GitHub</a>.</p>
<p>I&rsquo;m sure you&rsquo;ll agree this will be very useful.</p>
</description>
</item>
<item>
<title>Advent of Code 2021</title>
<link>https://selbydavid.com/2021/12/01/advent-2021/</link>
<pubDate>Wed, 01 Dec 2021 09:00:00 +0000</pubDate>
<guid>https://selbydavid.com/2021/12/01/advent-2021/</guid>
<description>
<script src="https://selbydavid.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>It’s that time of year again.
And not just for <a href="https://selbydavid.com/2016/12/07/santa/">Secret Santa</a>—it’s time for the <a href="https://adventofcode.com/">Advent of Code</a>, a series of programming
puzzles in the lead-up to Christmas.</p>
<p>I’m doing the 2021 challenge in R—in the form of an open-source <a href="https://github.com/Selbosh/adventofcode2021">R package</a>, to demonstrate a <a href="https://personalpages.manchester.ac.uk/staff/david.selby/rthritis/2021-11-19-unittesting/">test-driven</a> workflow.</p>
<div style="text-align:center;">
<div class="github-card" data-github="Selbosh/adventofcode2021" data-width="400" data-height="" data-theme="default" style="display:block; margin:0 auto;">
</div>
</div>
<script src="//cdn.jsdelivr.net/github-cards/latest/widget.js"></script>
<p>Each puzzle description typically comes with a few simple examples of inputs and outputs.
We can use these to define expectations for unit tests with the <a href="https://testthat.r-lib.org/"><strong>testthat</strong></a> package.
Once a function passes the unit tests, it should be ready to try with the main puzzle input.</p>
<p>Check my <a href="https://github.com/Selbosh/adventofcode2021"><strong>adventofcode2021</strong></a> repository on GitHub for the latest.</p>
<pre class="r"><code>remotes::install_github(&#39;Selbosh/adventofcode2021&#39;)</code></pre>
<ol style="list-style-type: decimal">
<li><a href="#day1">Sonar Sweep</a></li>
<li><a href="#day2">Dive!</a></li>
<li><a href="#day3">Binary Diagnostic</a></li>
<li><a href="#day4">Giant Squid</a></li>
<li><a href="#day5">Hydrothermal Venture</a></li>
<li><a href="#day6">Lanternfish</a></li>
<li><a href="#day7">The Treachery of Whales</a></li>
<li><a href="#day8">Seven Segment Search</a></li>
<li><a href="#day9">Smoke Basin</a></li>
<li><a href="#day10">Syntax Scoring</a></li>
<li><a href="#day11">Dumbo Octopus</a></li>
<li><a href="#day12">Passage Pathing</a></li>
<li><a href="#day13">Transparent Origami</a></li>
<li><a href="#day14">Extended Polymerization</a></li>
<li><a href="#day15">Chiton</a></li>
<li><a href="#day16">Packet Decoder</a></li>
<li><a href="#day17">Trick Shot</a></li>
<li><a href="#day18">Snailfish</a></li>
<li><a href="#day19">Beacon Scanner</a></li>
<li><a href="#day20">Trench Map</a></li>
<li><a href="#day21">Dirac Dice</a></li>
<li><a href="#day22">Reactor Reboot</a></li>
<li><a href="#day23">Amphipod</a></li>
<li><a href="#day24">Arithmetic Logic Unit</a></li>
<li><a href="#day25">Sea Cucumber</a></li>
</ol>
<div id="day1" class="section level2">
<h2>Day 1 - <a href="https://adventofcode.com/2021/day/1">Sonar Sweep</a></h2>
<div id="increases" class="section level3">
<h3>Increases</h3>
<p>To count the number of times elements are increasing in a vector it’s as simple as</p>
<pre class="r"><code>depths &lt;- c(199, 200, 208, 210, 200, 207, 240, 269, 260, 263)
sum(diff(depths) &gt; 0)</code></pre>
<pre><code>## [1] 7</code></pre>
<p>for which I defined a function called <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day01.R#L91-L93"><code>increases</code></a>.</p>
</div>
<div id="rolling-sum" class="section level3">
<h3>Rolling sum</h3>
<p>For part two, we first want to calculate the three-depth moving sum, then we count the increases as in part one.
There are plenty of solutions in external R packages for getting lagged (and leading) vectors, for instance <code>dplyr::lag()</code> and <code>dplyr::lead()</code>:</p>
<pre class="r"><code>depths + dplyr::lead(depths) + dplyr::lead(depths, 2)</code></pre>
<pre><code>## [1] 607 618 618 617 647 716 769 792 NA NA</code></pre>
<p>Or you could even calculate the rolling sum using a pre-made solution in <strong>zoo</strong> (Z’s Ordered Observations, a time-series analysis package).</p>
<pre class="r"><code>zoo::rollsum(depths, 3)</code></pre>
<pre><code>## [1] 607 618 618 617 647 716 769 792</code></pre>
<p>To avoid loading any external packages at this stage, I defined my own base R function called <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day01.R#L99-L101"><code>rolling_sum()</code></a>, which uses <code>tail</code> and <code>head</code> with negative lengths to omit the first and last elements of the vector:</p>
<pre class="r"><code>head(depths, -2) + head(tail(depths, -1), -1) + tail(depths, -2)</code></pre>
<pre><code>## [1] 607 618 618 617 647 716 769 792</code></pre>
<p>As <a href="https://twitter.com/schochastics/status/1466062839077027845">David Schoch points out</a>, you can just use the <code>lag</code> argument of <code>diff</code> to make this entire puzzle into a one-liner:</p>
<pre class="r"><code>sapply(c(1, 3), \(lag) sum(diff(depths, lag) &gt; 0))</code></pre>
<pre><code>## [1] 7 5</code></pre>
</div>
</div>
<div id="day2" class="section level2">
<h2>Day 2 - <a href="https://adventofcode.com/2021/day/2">Dive!</a></h2>
<div id="depth-sum" class="section level3">
<h3>Depth sum</h3>
<p>Read in the input as a two-column data frame using <code>read.table()</code>.
I gave mine nice column names, <code>cmd</code> and <code>value</code>, but this isn’t essential.</p>
<p>Then take advantage of the fact that <code>TRUE == 1</code> and <code>FALSE == 0</code> to make a mathematical <code>ifelse</code>-type statement for the horizontal and vertical movements.
In my R package, this is implemented as a function called <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day02.R#L70-L76"><code>dive()</code></a>:</p>
<pre class="r"><code>x &lt;- (cmd == &#39;forward&#39;) * value
y &lt;- ((cmd == &#39;down&#39;) - (cmd == &#39;up&#39;)) * value
sum(x) * sum(y)</code></pre>
</div>
<div id="cumulative-depth-sum" class="section level3">
<h3>Cumulative depth sum</h3>
<p>Part two is much like part one, but now <code>y</code> represents (change in) aim, and (change in) depth is derived from that.
Don’t forget the function <code>cumsum()</code>, which can save you writing a loop!
Here is the body of my function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day02.R#L80-L87"><code>dive2()</code></a>:</p>
<pre class="r"><code>x &lt;- (cmd == &#39;forward&#39;) * value
y &lt;- ((cmd == &#39;down&#39;) - (cmd == &#39;up&#39;)) * value
depth &lt;- cumsum(y) * x
sum(x) * sum(depth)</code></pre>
</div>
</div>
<div id="day3" class="section level2">
<h2>Day 3 - <a href="https://adventofcode.com/2021/day/2">Binary Diagnostic</a></h2>
<div id="power-consumption" class="section level3">
<h3>Power consumption</h3>
<p>There are a few different ways you could approach part one, but my approach was first to read in the data as a data frame of binary integers using the function <code>read.fwf()</code>.
Then, find the most common value in each column using the base function <code>colMeans()</code> and rounding the result.</p>
<p>According to the instructions, in the event of a tie you should take 1 to be the most common digit.
Although this is familiar to real life—0.5 rounds up to 1—computers <a href="https://en.wikipedia.org/wiki/Rounding#Round_half_to_even">don’t work this way</a>: R rounds to even instead (see <code>?round</code>).
Because zero is even, that means <code>round(0.5)</code> yields 0.
To get around this, add 1 before rounding, then subtract it again.</p>
<p>My function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day03.R#L78-L81"><code>power_consumption()</code></a>, which once again takes advantage of <code>TRUE</code> being equivalent to 1 and <code>FALSE</code> to 0:</p>
<pre class="r"><code>common &lt;- round(colMeans(x) + 1) - 1
binary_to_int(common) * binary_to_int(!common)</code></pre>
<p>To convert a vector of binary digits to decimal, I use the following <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day03.R#L104-L106">utility function</a>:</p>
<pre class="r"><code>binary_to_int &lt;- function(x) {
sum(x * 2 ^ rev(seq_along(x) - 1))
}</code></pre>
<p>However, if using a string representation then there’s a handy function in base R called <code>strtoi()</code> that you could also use for this (<a href="https://twitter.com/_Riinu_/status/1466681283887648769">thanks to Riinu Pius for that tip</a>).</p>
</div>
<div id="life-support" class="section level3">
<h3>Life support</h3>
<p>Part two finds the common digits in a successively decreasing set of binary numbers.
A loop is appropriate here, since we can halt once there is only one number left.
As this loop will only run (at most) 12 times in total, it shouldn’t be too slow in R.</p>
<p>Function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day03.R#L85-L98"><code>life_support()</code></a>:</p>
<pre class="r"><code>life_support &lt;- function(x) {
oxygen &lt;- co2 &lt;- x
for (j in 1:ncol(x)) {
if (nrow(oxygen) &gt; 1) {
common &lt;- most_common(oxygen)
oxygen &lt;- oxygen[oxygen[, j] == common[j], ]
}
if (nrow(co2) &gt; 1) {
common &lt;- most_common(co2)
co2 &lt;- co2[co2[, j] != common[j], ]
}
}
binary_to_int(oxygen) * binary_to_int(co2)
}</code></pre>
<p>There might be cleverer ways of doing this.</p>
</div>
</div>
<div id="day4" class="section level2">
<h2>Day 4 - <a href="https://adventofcode.com/2021/day/4">Giant Squid</a></h2>
<div id="bingo" class="section level3">
<h3>Bingo</h3>
<p>This is one of those problems where half the battle is working out which data structure to use.
I wrote a function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day04.R#L80-L84"><code>read_draws()</code></a> that reads in the first line of the file to get the drawn numbers, then separately reads in the remainder of the file to get the bingo cards stacked as a data frame.
Later we take advantage of the fact that the bingo cards are square to split the data frame into a list of matrices.</p>
<pre class="r"><code>read_draws &lt;- function(file) {
draws &lt;- scan(file, sep = &#39;,&#39;, nlines = 1, quiet = TRUE)
cards &lt;- read.table(file, skip = 1)
list(draws = draws, cards = cards)
}</code></pre>
<p>As numbers are called out, I replace them in the dataset with <code>NA</code>s.
Then the helper <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day04.R#L86-L91"><code>score_card()</code></a> counts the number of <code>NA</code>s in each row and column.
If there are not enough, we return zero, else we calculate the score.</p>
<pre class="r"><code>score_card &lt;- function(mat, draw) {
marked &lt;- is.na(mat)
if (all(c(rowMeans(marked), colMeans(marked)) != 1))
return(0)
sum(mat, na.rm = TRUE) * draw
}</code></pre>
<p>Then we put it all together, looping through the draws, replacing numbers with <code>NA</code> and halting as soon as someone wins.
Function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day04.R#L98-L111"><code>play_bingo()</code></a> is defined as follows, using just base R commands:</p>
<pre class="r"><code>play_bingo &lt;- function(draws, cards) {
size &lt;- ncol(cards)
ncards &lt;- nrow(cards) / size
ids &lt;- rep(1:ncards, each = size)
for (d in draws) {
cards[cards == d] &lt;- NA
score &lt;- sapply(split(cards, ids), score_card, draw = d)
if (any(score &gt; 0))
return(score[score &gt; 0])
}
}</code></pre>
</div>
<div id="last-caller" class="section level3">
<h3>Last caller</h3>
<p>Part two is very similar, but we throw away each winning bingo card as we go to avoid redundant computation, eventually returning the score when there is only one left.
Here is function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day04.R#L115-L131"><code>play_bingo2()</code></a>, which uses the same two utility functions:</p>
<pre class="r"><code>play_bingo2 &lt;- function(draws, cards) {
size &lt;- ncol(cards)
for (d in draws) {
ncards &lt;- nrow(cards) / size
ids &lt;- rep(1:ncards, each = size)
cards[cards == d] &lt;- NA
score &lt;- sapply(split(cards, ids), score_card, draw = d)
if (any(score &gt; 0)) {
if (ncards == 1)
return(score[score &gt; 0])
cards &lt;- cards[ids %in% which(score == 0), ]
}
}
}</code></pre>
<p>Further optimizations are possible.
For example: as written, we calculate every intermediate winner’s score, but we only really need to do it for the first (part 1) and last (part 2) winners.</p>
<p>Also, we could draw more than one number at a time, as we know that nobody’s going to win until at least the fifth draw (for 5×5 cards) and from there, increment according to the minimum number of unmarked numbers on any row or column.</p>
<p>I didn’t bother implementing either of these, as it already runs quickly enough.</p>
</div>
</div>
<div id="day5" class="section level2">
<h2>Day 5 - <a href="https://adventofcode.com/2021/day/5">Hydrothermal Venture</a></h2>
<p>For a while I tried to think about clever mathematical ways to solve the system of inequalities, but this gets complicated when working on a grid, and where some segments are collinear.
In the end it worked out quicker to what seems like a ‘brute force’ approach:
generate all the points on the line segments and then simply count how many times they appear.</p>
<p>This is a problem that really lends itself to use of <strong>tidyr</strong> functions like <a href="https://tidyr.tidyverse.org/reference/separate.html"><code>separate()</code></a> and <a href="https://tidyr.tidyverse.org/reference/nest.html"><code>unnest()</code></a>, so naturally I made life harder for myself by doing it in base R, instead.</p>
<p>First, read in the coordinates as a data frame with four columns, <code>x1</code>, <code>y1</code>, <code>x2</code> and <code>y2</code>.
The <em>nice</em> way to do this is with <code>tidyr::separate()</code> but <code>strsplit()</code> works just fine too.
Here is my parsing function, <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day05.R#L77-L82"><code>read_segments()</code></a>:</p>
<pre class="r"><code>read_segments &lt;- function(x) {
lines &lt;- do.call(rbind, strsplit(readLines(x), &#39;( -&gt; |,)&#39;))
storage.mode(lines) &lt;- &#39;numeric&#39;
colnames(lines) &lt;- c(&#39;x1&#39;, &#39;y1&#39;, &#39;x2&#39;, &#39;y2&#39;)
as.data.frame(lines)
}</code></pre>
<p>This is one of the few puzzles where the solution to part two is essentially contained in part one.
Depending on how you implement your home-rolled <code>unnest</code>-like function, it could just be a case of filtering out the diagonal lines in part one.
I make liberal use of <code>mapply</code> for looping over two vectors at once.</p>
<p>In the penultimate line, we take advantage of vector broadcasting, which handles all the horizontal and vertical lines where you have multiple coordinates on one axis paired with a single coordinate on the other.
For the diagonal lines, there is a 1:1 relationship so the coordinates just bind together in pairs.
Finally, we work out how to count the rows, without using <code>dplyr::count()</code>.
If you convert to a data frame, then <code>table()</code> does this for you.</p>
<pre class="r"><code>count_intersections &lt;- function(lines, part2 = FALSE) {
if (!part2)
lines &lt;- subset(lines, x1 == x2 | y1 == y2)
x &lt;- mapply(seq, lines$x1, lines$x2)
y &lt;- mapply(seq, lines$y1, lines$y2)
xy &lt;- do.call(rbind, mapply(cbind, x, y))
sum(table(as.data.frame(xy)) &gt; 1)
}</code></pre>
<p>I’m fairly pleased to get the main solution down to <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day05.R#L89-L96">essentially four lines of code</a>, though I’m certain that there are more computationally efficient ways of tackling this problem—if you value computer time more than your own time.</p>
<p>For the tidyverse approach, see <a href="https://twitter.com/drob/status/1467361848525787138">David Robinson’s solution</a>.</p>
</div>
<div id="day6" class="section level2">
<h2>Day 6 - <a href="https://adventofcode.com/2021/day/6">Lanternfish</a></h2>
<p>In this problem, we have many fish with internal timers.
As the instructions suggest, we will have exponential growth, so it’s not a good idea to keep track of each individual fish as you’ll soon run out of memory.
On the other hand, there are only nine possible states for any given fish to be in: the number of days until they next reproduce.
So we can store a vector that simply tallies the number of fish in each state.</p>
<p>On each day, we can shuffle the fish along the vector, decreasing the number of days for each group of fish by 1, and adding new cohorts of fish at day 6, to represent parent fish resetting their timers, and at day 8 to represent the newly hatched lanternfish.
My short function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day06.R#L72-L77"><code>lanternfish()</code></a>:</p>
<pre class="r"><code>lanternfish &lt;- function(x, days = 80) {
fish &lt;- as.double(table(factor(x, levels = 0:8)))
for (i in 1:days)
fish &lt;- c(fish[2:7], fish[8] + fish[1], fish[9], fish[1])
sum(fish)
}</code></pre>
<p>Because R indexes from 1 rather than 0, the element <code>fish[1]</code> represents the number of fish with 0 days left, <code>fish[2]</code> represents the number with 1 day left, and so on.
If you find this confusing, you can index from zero instead, thanks to the new <a href="https://github.com/Selbosh/index0"><strong>index0</strong> package</a>:</p>
<pre class="r"><code>lanternfish0 &lt;- function(x, days = 80) {
fish &lt;- as.double(table(factor(x, levels = 0:8)))
for (i in 1:days) {
fish &lt;- index0::index_from_0(fish)
fish &lt;- c(fish[1:6], fish[7] + fish[0], fish[8], fish[0])
}
sum(fish)
}</code></pre>
<p>There is a slightly different way to perform the updates.
<a href="https://twitter.com/drob/status/1467727330663534594">David Robinson suggested</a> an approach based on linear algebra.
Here we apply the same procedure as above, but via matrix multiplication.
It takes about the same time to run.</p>
<pre class="r"><code>lanternfish &lt;- function(x, days = 80) {
fish &lt;- table(factor(x, levels = 0:8))
mat &lt;- matrix(0, 9, 9)
mat[cbind(2:9, 1:8)] &lt;- 1 # decrease timer for fish w/ 1-8 days left
mat[1, c(7, 9)] &lt;- 1 # add &#39;new&#39; fish with 6 &amp; 8 days left
for (i in 1:days)
fish &lt;- fish %*% mat
sum(fish)
}</code></pre>
<p>Day 6 is another puzzle where the solutions for parts one and two are essentially the same.
The only thing to be careful of on part two is that you don’t run into integer overflow.
If you do, make sure the numbers you’re adding together are of type <code>double</code>.</p>
</div>
<div id="day7" class="section level2">
<h2>Day 7 - <a href="https://adventofcode.com/2021/day/7">The Treachery of Whales</a></h2>
<div id="median" class="section level3">
<h3>Median</h3>
<p>While it’s possible to brute-force this puzzle by simply calculating the fuel requirement at every single point (within the range of the inputs), you can do it about 200× faster by treating it as an optimization problem.</p>
<p>The total fuel required for any potential position is</p>
<pre class="r"><code>x &lt;- scan(&#39;input.txt&#39;, sep = &#39;,&#39;)
f &lt;- function(pos) sum(abs(x - pos))</code></pre>
<p>where <code>x</code> are the initial locations of the crabs.
Then run it through <code>optimize()</code>, and round to the nearest integer position:</p>
<pre class="r"><code>sol &lt;- optimize(f, range(x))$minimum
f(round(sol))</code></pre>
<p>However, there is an even faster analytical solution!</p>
<pre class="r"><code>sol &lt;- median(x)</code></pre>
<p>Thanks to <a href="https://twitter.com/claire_little1">Claire Little</a> for pointing this out.</p>
</div>
<div id="mean" class="section level3">
<h3>Mean</h3>
<p>Part two just has a slightly different function to optimize.
Using the formula for the sum of an <a href="https://en.wikipedia.org/wiki/Arithmetic_progression">arithmetic progression</a>:</p>
<pre class="r"><code>f2 &lt;- function(pos) {
n &lt;- abs(x - pos)
sum(n / 2 * (1 + n))
}</code></pre>
<p>Then we can simply minimize this function as before.</p>
<pre class="r"><code>sol &lt;- optimize(f2, range(x))$minimum
f2(round(sol))</code></pre>
<p>However, there’s a shortcut for this part as well!
Calculate the mean of the initial positions, and work out which of the two nearest integers gives the minimum result:</p>
<pre class="r"><code>min(
f2(floor(mean(x))),
f2(ceiling(mean(x)))
)</code></pre>
<p>Thanks to <a href="https://twitter.com/jonatanpallesen/status/1468165025575624704">Jonatan Pallesen</a>.
This is about 5 times faster than my optimizer.</p>
<p>And here is what the functions look like for my input dataset:</p>
<p><img src="https://selbydavid.com/post/2021-12-01-advent_files/figure-html/day7-1.png" width="576" /></p>
</div>
</div>
<div id="day8" class="section level2">
<h2>Day 8 - <a href="https://adventofcode.com/2021/day/8">Seven Segment Search</a></h2>
<div id="unique-digits" class="section level3">
<h3>Unique digits</h3>
<p>Read in the data and then the first part is just a one-liner:</p>
<pre class="r"><code>input &lt;- do.call(rbind, strsplit(readLines(input_file(8)), &#39;[^a-z]+&#39;))
count_unique &lt;- function(x) {
sum(nchar(x[, -(1:10)]) %in% c(2, 3, 4, 7))
}</code></pre>
</div>
<div id="segment-matching" class="section level3">
<h3>Segment matching</h3>
<p>I <em>really</em> wanted to solve part two using graph theory, by representing the puzzle as a maximum bipartite matching problem.
However, I couldn’t quite get this to work.
My final solution is instead just a lot of leg work.</p>
<p>Essentially you solve the problem by hand and then encode the process programmatically.
Recognize that some digits have segments in common, or not in common, and use this to eliminate the possibilities.
I stored the solutions in a named vector, which I was able to use to look up the digits found so far.</p>
<p>The function <code>setdiff()</code> comes in useful.</p>
<pre class="r"><code>contains &lt;- function(strings, letters) {
vapply(strsplit(strings, &#39;&#39;),
function(s) all(strsplit(letters, &#39;&#39;)[[1]] %in% s),
logical(1))
}
output_value &lt;- function(vec) {
segments &lt;- c(&#39;abcefg&#39;, &#39;cf&#39;, &#39;acdeg&#39;, &#39;acdfg&#39;, &#39;bcdf&#39;,
&#39;abdfg&#39;, &#39;abdefg&#39;, &#39;acf&#39;, &#39;abcdefg&#39;, &#39;abcdfg&#39;)
nchars &lt;- setNames(nchar(segments), 0:9)
# Sort the strings
vec &lt;- sapply(strsplit(vec, &#39;&#39;), function(d) paste(sort(d), collapse = &#39;&#39;))
sgn &lt;- head(vec, 10)
out &lt;- tail(vec, 4)
# Store the known values
digits &lt;- setNames(character(10), 0:9)
unique &lt;- c(&#39;1&#39;, &#39;4&#39;, &#39;7&#39;, &#39;8&#39;)
digits[unique] &lt;- sgn[match(nchars[unique], nchar(sgn))]
# Remaining digits have 5 or 6 segments:
sgn &lt;- setdiff(sgn, digits)
digits[&#39;3&#39;] &lt;- sgn[nchar(sgn) == 5 &amp; contains(sgn, digits[&#39;1&#39;])]
digits[&#39;6&#39;] &lt;- sgn[nchar(sgn) == 6 &amp; !contains(sgn, digits[&#39;1&#39;])]
sgn &lt;- setdiff(sgn, digits)
digits[&#39;0&#39;] &lt;- sgn[nchar(sgn) == 6 &amp; !contains(sgn, digits[&#39;4&#39;])]
sgn &lt;- setdiff(sgn, digits)
digits[&#39;9&#39;] &lt;- sgn[nchar(sgn) == 6]
sgn &lt;- setdiff(sgn, digits)
digits[&#39;2&#39;] &lt;- sgn[
contains(sgn, do.call(setdiff,
unname(strsplit(digits[c(&#39;8&#39;, &#39;6&#39;)], &#39;&#39;))))
]
digits[&#39;5&#39;] &lt;- setdiff(sgn, digits)
# Combine four output digits:
as.numeric(paste(match(out, digits) - 1, collapse = &#39;&#39;))
}</code></pre>
</div>
</div>
<div id="day9" class="section level2">
<h2>Day 9 - <a href="https://adventofcode.com/2021/day/9">Smoke Basin</a></h2>
<div id="lowest-points" class="section level3">
<h3>Lowest points</h3>
<p>You can find all the lowest points with a one-liner:</p>
<pre class="r"><code>lowest &lt;- function(h) {
h &lt; cbind(h, Inf)[, -1] &amp; # right
h &lt; rbind(h, Inf)[-1, ] &amp; # down
h &lt; cbind(Inf, h[, -ncol(h)]) &amp; # left
h &lt; rbind(Inf, h[-nrow(h), ]) # up
}</code></pre>
<p>Then do <code>sum(h[lowest(h)])</code> to get the result, where <code>h</code> is a numeric matrix of the input data.</p>
</div>
<div id="basins" class="section level3">
<h3>Basins</h3>
<p>The second part is harder and doesn’t immediately lead from the first.
Initially I thought of replacing each lowest point with <code>Inf</code>, then finding the new lowest points and repeating the process until all the basins are found.
However, the basins are simply all points where the height is <code>&lt; 9</code>, so you can find the basins in a single step.</p>
<p>The tricky part is labelling them separately, so you can count up their respective sizes.</p>
<p>The boring way of doing this is just to loop over the indices and label the points that neighbour already-labelled ones (starting with the lowest points as the initial labels), doing several passes until everything (except the 9s) is labelled.</p>
<pre class="r"><code>basins &lt;- function(h) {
l &lt;- lowest(h)
h[] &lt;- ifelse(h &lt; 9, NA, Inf)
h[l] &lt;- 1:sum(l)
while (anyNA(h)) {
for (i in 1:nrow(h)) for (j in 1:ncol(h)) {
if (is.na(h[i, j])) {
nbrs &lt;- h[cbind(c(max(i - 1, 1), min(i + 1, nrow(h)), i, i),
c(j, j, max(j - 1, 1), min(j + 1, ncol(h))))]
if (any(is.finite(nbrs)))
h[i, j] &lt;- nbrs[is.finite(nbrs)][1]
}
}
}
sizes &lt;- table(h[is.finite(h)])
head(sort(sizes, decreasing = TRUE), 3)
}</code></pre>
<p>To vectorize this in the same way as part one, we define a new binary (infix) operator <code>%c%</code>, analogous to <code>dplyr::coalesce()</code>.
What this does is replace an <code>NA</code> value (a basin not yet assigned a label) with its finite neighbour, while leaving <code>Inf</code>s (marking basin edges) alone.</p>
<pre class="r"><code>&quot;%c%&quot; &lt;- function(x, y) {
ifelse(is.infinite(x), x,
ifelse(!is.na(x), x,
ifelse(!is.infinite(y), y, x)))
}</code></pre>
<p>Then the new function for part two is as follows.
It is five times faster to run than the nested loop above.</p>
<pre class="r"><code>basins2 &lt;- function(h) {
l &lt;- lowest(h)
h[] &lt;- ifelse(h &lt; 9, NA, Inf)
h[l] &lt;- 1:sum(l)
while(anyNA(h)) {
h &lt;- h %c%
cbind(h, NA)[, -1] %c% # right
rbind(h, NA)[-1, ] %c% # down
cbind(NA, h[, -ncol(h)]) %c% # left
rbind(NA, h[-nrow(h), ]) # up
}
sizes &lt;- table(h[is.finite(h)])
head(sort(sizes, decreasing = TRUE), 3)
}</code></pre>
<p>You can also <a href="https://twitter.com/rappa753/status/1468876602016735233">formulate this as an image analysis problem</a>, effectively treating each basin as an area of similar colour to select, or you can <a href="https://twitter.com/babeheim/status/1468898580408811525">treat it as a network theory problem and apply the <strong>igraph</strong> package</a> to find graph components.</p>
</div>
</div>
<div id="day10" class="section level2">
<h2>Day 10 - <a href="https://adventofcode.com/2021/day/10">Syntax Scoring</a></h2>
<div id="corrupt-characters" class="section level3">
<h3>Corrupt characters</h3>
<p>Whilst it’s probably possible to do the first part with some very fancy <a href="https://www.php.net/manual/en/regexp.reference.recursive.php">recursive regular expressions</a>, I don’t know how to use them.</p>
<p>Instead, my method of finding unmatched brackets is simply to search for empty pairs of brackets and successively strip them from the string.
Keep doing this until the strings stop changing.
Then, get the first closing bracket (if any), using <code>regmatches()</code>.
These are the illegal characters.</p>
<p>My function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day10.R#L122-L128"><code>syntax_score()</code></a> is implemented as follows:</p>
<pre class="r"><code>lines &lt;- readLines(&#39;input.txt&#39;)
old &lt;- &#39;&#39;
while (!identical(old, lines -&gt; old))
lines &lt;- gsub(r&#39;(\(\)|&lt;&gt;|\{\}|\[\])&#39;, &#39;&#39;, lines)
illegals &lt;- regmatches(lines, regexpr(r&#39;(\)|&gt;|\}|\])&#39;, lines))</code></pre>
<p>The syntax score is calculated using a named vector as a lookup table.</p>
<pre class="r"><code>illegal_score &lt;- c(&#39;)&#39; = 3, &#39;]&#39; = 57, &#39;}&#39; = 1197, &#39;&gt;&#39; = 25137)
sum(illegal_score[illegals])</code></pre>
</div>
<div id="autocomplete" class="section level3">
<h3>Autocomplete</h3>
<p>Part two starts the same, but instead of extracting the illegal characters we just throw away those lines that contain them.</p>
<pre class="r"><code>illegals &lt;- grep(r&#39;(\)|&gt;|\}|\])&#39;, lines)
chars &lt;- strsplit(lines[-illegals], &#39;&#39;)</code></pre>
<p>From here, we can calculate the scores using a <code>Reduce</code> operation (from right to left) with another lookup table.
The final answer is the median score.</p>
<pre class="r"><code>complete_score &lt;- c(&#39;(&#39; = 1, &#39;[&#39; = 2, &#39;{&#39; = 3, &#39;&lt;&#39; = 4)
scores &lt;- sapply(chars, Reduce, init = 0, right = TRUE,
f = \(c, s) 5 * s + complete_score[c])
median(scores)</code></pre>
<p>The function <a href="https://github.com/Selbosh/adventofcode2021/blob/main/R/day10.R#L132-L141"><code>autocomplete()</code></a> wraps it all together.</p>
</div>
</div>
<div id="day11" class="section level2">
<h2>Day 11 - <a href="https://adventofcode.com/2021/day/11">Dumbo Octopus</a></h2>
<div id="convoluted-octopuses" class="section level3">
<h3>Convoluted octopuses</h3>
<p>The process of updating the energy levels can be described using a <a href="https://en.wikipedia.org/wiki/Kernel_(image_processing)"><em>convolution matrix</em></a>.
It’s easy—<a href="https://selbydavid.com/2020/12/06/advent-2020/#day11">like on Day 11 last year</a>—to use a ready-made solution from an image analysis package for this, namely <code>OpenImageR::convolution()</code>.</p>
<p>The convolution matrix, or <em>kernel</em> is
<span class="math display">\[\begin{bmatrix}1 &amp; 1 &amp; 1 \\1 &amp; 0 &amp; 1 \\1 &amp; 1 &amp; 1\end{bmatrix},\]</span>
to be applied on an indicator matrix of ‘flashing’ octopuses, and then added to the result.
In R,</p>
<pre class="r"><code>kernel &lt;- matrix(1:9 != 5, 3, 3)</code></pre>
<p>So we define a function, <code>step1()</code>, that applies a single step of the energy level updating process.
Since each octopus can only flash once in a given step, we keep track of those that have already flashed, as well as those currently flashing.
A short <code>while()</code> loop repeats until no more octopuses flash.</p>
<pre class="r"><code>step1 &lt;- function(x) {
x &lt;- x + 1
flashing &lt;- flashed &lt;- x == 10
while (any(flashing)) {
x &lt;- x + OpenImageR::convolution(flashing, kernel)
flashing &lt;- x &gt; 9 &amp; !flashed
flashed &lt;- flashing | flashed
}
x[x &gt; 9] &lt;- 0
x
}</code></pre>
<p>However, there is a base R alternative to <code>OpenImageR::convolution()</code> that you can substitute in, for negligible speed penalty (despite that package being written in C).</p>
<pre class="r"><code>add_neighbours &lt;- function(x) {
I &lt;- nrow(x)
J &lt;- ncol(x)
cbind(x[, -1], 0) + # Right
rbind(x[-1, ], 0) + # Down
cbind(0, x[, -J]) + # Left