python-demo/document.txt at main · scorecard-ai/python-demo · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
https://doi.org/10.1038/s41586-023-06924-6

Accelerated Article Preview

W
E
I
V
E
R
P
E
L
C
I
T
R
A
D
E
T
A
R
E
L
E
C
C
A
Mathematical discoveries from program
search with large language models
Received: 12 August 2023

Accepted: 30 November 2023
Accelerated Article Preview
Published online xx xx xxxx

Cite this article as: Romera-Paredes, B. et al.
Mathematical discoveries from program
search with large language models. Nature
https://doi.org/10.1038/s41586-023-06924-6
(2023)

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog,
M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang,
Omar Fawzi, Pushmeet Kohli & Alhussein Fawzi

This is a PDF file of a peer-reviewed paper that has been accepted for publication.
Although unedited, the content has been subjected to preliminary formatting. Nature
is providing this early version of the typeset paper as a service to our authors and
readers. The text and figures will undergo copyediting and a proof review before the
paper is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers
apply.

Nature | www.nature.com

EW

Mathematical discoveries from program search with large
language models

1

2

4
5
6

PR
EV
I

Bernardino Romera-Paredes1∗
Mohammadamin Barekatain1∗
1∗
Alexander Novikov
Matej Balog1∗
M. Pawan Kumar1∗
Emilien Dupont1∗
Francisco J. R. Ruiz1∗
Jordan S. Ellenberg2
1
3
Pengming Wang
Omar Fawzi
Pushmeet Kohli1
Alhussein Fawzi1∗

3

1

9

Google DeepMind, London, UK
University of Wisconsin-Madison, Madison, Wisconsin, USA
3
Université de Lyon (Inria, ENS Lyon, UCBL, LIP), Lyon, France

10

Abstract

7

AR
TI
C

8

Large Language Models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs
sometimes suffer from confabulations (or hallucinations) which can result in them making plausible but incorrect statements (Bang et al., 2023; Borji, 2023). This hinders the use of current
large models in scientific discovery. Here we introduce FunSearch (short for searching in the
function space), an evolutionary procedure based on pairing a pre-trained LLM with a systematic evaluator. We demonstrate the effectiveness of this approach to surpass the best known results in important problems, pushing the boundary of existing LLM-based approaches (Lehman
et al., 2022). Applying FunSearch to a central problem in extremal combinatorics — the cap
set problem — we discover new constructions of large cap sets going beyond the best known
ones, both in finite dimensional and asymptotic cases. This represents the first discoveries made
for established open problems using LLMs. We showcase the generality of FunSearch by applying it to an algorithmic problem, online bin packing, finding new heuristics that improve upon
widely used baselines. In contrast to most computer search approaches, FunSearch searches for
programs that describe how to solve a problem, rather than what the solution is. Beyond being
an effective and scalable strategy, discovered programs tend to be more interpretable than raw
solutions, enabling feedback loops between domain experts and FunSearch, and the deployment
of such programs in real-world applications.

11
12
13
14
15
16
17
18
19

ED

20
21
22
23
24

26
27
28

30
31

Many problems in mathematical sciences are “easy to evaluate,” despite being typically “hard to
solve.” For example, in computer science, NP-complete optimization problems admit a polynomialtime evaluation procedure (measuring the quality of the solution), despite the widespread belief that
no polynomial-time algorithms to solve such problems exist. We focus in this paper on problems
admitting an efficient evaluate function, which measures the quality of a candidate solution. Prominent examples include the maximum independent set problem and maximum constraint satisfaction
problems (such as finding the ground state energy of a Hamiltonian). Our goal is to generate a
solve program, such that its outputs receive high scores from evaluate (when executed on inputs
of interest), and ultimately improve over the best known solutions.

EL

32

ER

AT

25

29

33
34

AC

C

35
36
37

LE

2

∗ Equal contributors.

1

80

1

47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

EL

78

PR
EV
I

46

LE

45

AR
TI
C

44

ED

42
43

AT

41

ER

40

EW

79

While Large Language Models (LLMs) have recently seen dramatic improvements in their coding
capabilities [5–9], with applications including debugging [10, 11], solving code competitions [12, 13]
and improving code performance [14], synthesizing solve programs for open problems requires finding new ideas that are verifiably correct. This is very hard for LLMs, as they tend to confabulate or
ultimately fall short of going beyond existing results. To surpass the “nominal” capabilities of LLMs,
recent works [3] have combined them with evolutionary algorithms [15, 16], leading to important
improvements on diverse synthetic problems [17], searching for neural network architectures [18–20],
and solving puzzles [21]. Our proposed method, FunSearch, pushes the boundary of LLM-guided
evolutionary procedures to a new level: the discovery of new scientific results for established open
problems, and the discovery of new algorithms. Surpassing state-of-the-art results on established
open problems provides a clear indication that the discoveries are truly new, as opposed to being
retrieved from the LLM’s training data.
FunSearch (short for searching in the function space) combines a pre-trained (frozen) Large Language Model, whose goal is to provide creative solutions, with an evaluator, which guards against
confabulations and incorrect ideas. FunSearch iterates over these two components, evolving initial
low-scoring programs into high-scoring ones discovering new knowledge. Key to the success of this
simple procedure is a combination of multiple essential ingredients. First, we sample best performing
programs and feed them back into prompts for the LLM to improve on; we refer to this as best-shot
prompting. Second, we start with a program in the form of a skeleton (containing boilerplate code
and potentially prior structure about the problem), and only evolve the part governing the critical
program logic. For example, by setting a greedy program skeleton, we evolve a priority function
used to make decisions at every step. Third, we maintain a large pool of diverse programs by using
an island-based evolutionary method that encourages exploration and avoids local optima. Finally,
leveraging the highly parallel nature of FunSearch, we scale it asynchronously, considerably broadening the scope of this approach to find new results, while keeping the overall cost of experiments
low.
We show the surprising effectiveness of FunSearch on several use-cases. We consider a fundamental problem in extremal combinatorics, namely, the cap set problem [22, 23]. FunSearch demonstrates
the existence of hitherto unknown constructions that go beyond existing ones, including the largest
improvement in 20 years to the asymptotic lower bound. To the best of our knowledge, this shows
the first scientific discovery — a new piece of verifiable knowledge about a notorious scientific problem — using an LLM. Using FunSearch, we also find new algorithms for the online bin packing
problem that improve upon traditional ones on well-studied distributions of interest [24, 25], with
potential applications to improving job scheduling algorithms.
While most computer search techniques output directly what the solution is (e.g., a list of vectors
forming a cap set), FunSearch produces programs generating the solution. For structured problems,
such programs tend to be more interpretable — facilitating interactions with domain experts —
and concise — making it possible to scale to large instances — compared to a mere enumeration
of the solution. In addition, decision procedures (such as for bin packing) described by code in a
standard programming language are crucially easier to deploy compared to other types of descriptions
(e.g., neural networks), which typically require specialized hardware and for which verifying design
specifications is notoriously hard.

38
39

AC

C

FunSearch

81
82

An overview of FunSearch is shown in Figure 1, and its components are described in more detail
below. For more details and ablations showing the importance of each component, see Methods and

2

90
91
92
93
94
95
96
97
98

99
100
101
102
103
104
105
106
107
108
109
110
111

112
113
114
115
116
117
118

Pre-trained LLM. The LLM is the creative core of FunSearch, in charge of coming up with
improvements to the functions presented in the prompt and sending these for evaluation. Perhaps
surprisingly, we obtain our results with a pre-trained model, i.e., without any fine-tuning on our
problems. We use Codey, an LLM built on top of the PaLM2 model family [26], which has been
finetuned on a large corpus of code and is publicly accessible through its API [27]. Because FunSearch
relies on sampling from an LLM extensively, an important performance-defining tradeoff is between
the quality of the samples and the inference speed of the LLM. In practice, we have chosen to work
with a fast-inference model (rather than slower-inference, higher-quality), and the results in the
paper are obtained using a total number of samples on the order of 106 . Beyond this tradeoff, we
have empirically observed that the results obtained in this paper are not too sensitive to the exact
choice of LLM, as long as it has been trained on a large enough corpus of code. See Appendix A in
Supplementary Information for a comparison to StarCoder [7], a state-of-the-art open-source LLM
for code.
Evaluation. Programs generated by the LLM are evaluated and scored on a set of inputs. For
example, in the cap set problem (Section 2.1) the inputs are the values of the dimensionality n
that we are interested in, and in combinatorial optimization (Section 2.2), the inputs correspond
to different bin packing instances. The scores across different inputs are then combined into an
overall score of the program using an aggregation function, such as the mean. The scored programs
are then sent to the programs database. Programs that were incorrect (did not execute within the
imposed time and memory limits, or produced invalid outputs) are discarded, and the remaining
scored programs are then sent to the programs database.

EL

119

EW

89

PR
EV
I

88

LE

87

AR
TI
C

86

Specification. The input to FunSearch is a specification of the problem in the form of an evaluate
function, which scores candidate solutions. In addition, we provide an initial program (which can
be trivial) to evolve. While in principle these are the minimum requirements, we found that performance tends to improve significantly if we write the initial solve program in the form of a skeleton
(containing boilerplate code and prior knowledge of the problem in the form of a program structure), and only use FunSearch to evolve the critical part that governs its logic. Figure 2 (a) shows
an example where the skeleton takes the form of a simple greedy algorithm, and the crucial part to
evolve by FunSearch is the priority function that is used to make the greedy decision at every step.
This delegates to FunSearch precisely the part that is usually the hardest to come up with. While
a fixed skeleton may constrain the space of programs that can be discovered, we find it improves
overall results because it focuses the LLM resources on evolving the critical part only, instead of also
using the LLM to recreate already known program structures (with more opportunities for mistakes
that would render the entire program incorrect). If available, the user can optionally provide additional known information about the problem at hand, in the form of docstrings, relevant primitive
functions, or import packages, which FunSearch may use.

ED

85

AT

84

Appendix A in Supplementary Information.

ER

83

120
121

AC

C

122

Programs database. The programs database keeps a population of correct programs, which are
then sampled to create prompts. Preserving and encouraging diversity of programs in the database is
crucial to enable exploration and avoid being stuck in local optima. To encourage diversity we adopt
an islands model, also known as multiple population and multiple-deme model [28, 29], a genetic
algorithm approach. A number of islands, or subpopulations, are created and evolved independently.
To sample from the program database, we first sample an island and then sample a program within

123
124
125

3

129
130

131
132
133
134
135
136
137
138
139
140
141
142

EW

128

that island, favoring higher-scoring and shorter programs (see Methods for the exact mechanism).
Crucially, we let information flow between the islands by periodically discarding the programs in the
worst half of the islands (corresponding to the ones whose best individuals have the lowest scores).
We replace the programs in those islands with a new population, initialized by cloning one of the
best individuals from the surviving islands.
Prompt. New prompts are created by “best-shot prompting” from the programs database, and
are then fed to the LLM to generate a new program. We first sample k programs from a single island
in the programs database, according to the procedure described above. Sampled programs are then
sorted according to their score, and a version is assigned to each (v0 for the lowest scoring program,
v1 for the second lowest scoring, etc.). These programs are then combined into a single prompt —
with the version appended as a suffix to the function name; e.g., in the case of Figure 2 (a), this
would be priority v0, priority v1, ... — and the header of the function we wish to generate
(e.g., priority vk) is added to the end of the prompt. In practice, we set k = 2, as two functions
lead to better results compared to just one, with diminishing returns beyond that. Constructing a
prompt by combining several programs (as opposed to only one) enables the LLM to spot patterns
across the different programs and generalize those. Related approaches to prompt building have
been recently considered; e.g., [17], and were shown to perform well on different domains.

PR
EV
I

127

LE

126

162

2

149
150
151
152
153
154
155
156
157
158
159

Results

EL

160

ED

148

AT

146
147

ER

145

AR
TI
C

161

Distributed approach. We implement FunSearch as a distributed system that has three types
of workers: a programs database, samplers, and evaluators, which communicate asynchronously. The
programs database stores and serves programs, samplers generate new functions using the pre-trained
LLM, while evaluators assess programs, as shown in Figure F.26 in Supplementary Information. In
the example of Figure 2 (a), the programs database stores priority functions, samplers generate
new implementations of priority, while evaluators score the proposals by executing the main function on user-specified inputs. Our distributed system offers several advantages: first, it naturally
leverages parallelism across different tasks, e.g., LLM sampling and evaluation are performed concurrently. Second, it enables scaling to more than one sampler and evaluator, which would be a
very limiting setup, considering that evaluation can take minutes for many problems of interest.
Running evaluators in parallel considerably broadens the scope of this approach to such problems.
The distributed setting enables running many evaluator nodes on inexpensive CPU hardware, while
few samplers run on machines with accelerators for fast LLM inference; this keeps the overall cost
and energy usage of experiments low. In our experiments, we typically use 15 samplers and 150 CPU
evaluators (can be served on 5 CPU servers each running 32 evaluators in parallel). See Appendix
A in Supplementary Information for more details. Also, due to the randomness of LLM sampling
and of the evolutionary procedure, for some problems we run several experiments to get the best
reported results. See Methods and Appendix A.3 in Supplementary Information for a full statistical
analysis.

143
144

163
164

AC

C

165

We now describe some of the new discoveries made by FunSearch in two different fields: pure mathematics and applied computer science. Additional discoveries on other problems (namely, corners
problem and Shannon capacity of cycle graphs) are presented in Appendix B in Supplementary
Information. Full discovered programs are available in Appendix C in Supplementary Information.

166

4

176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

201
202
203
204

EW

Admissible sets. Beyond finding the size of the largest cap set cn in dimension n, a fundamental
1/n
problem in additive combinatorics [23] is determining the capacity C = supn cn . The breakthrough
result of [32] established an upper bound of C ≤ 2.756. In this work, we are interested in lower
bounds on C. To this end, we use the framework of constant weight admissible sets (or admissible
sets for short) [35], which has established the current state-of-the-art.
Formally, admissible sets A(n, w) are collections of vectors in {0, 1, 2}n satisfying two properties:
i) each vector
has the same number w of non-zero elements but a unique support (thereby implying

n
|A| ≤ w
); ii) for any three distinct vectors there is a coordinate in which their three respective
values are {0, 1, 2}, {0, 0, 1}, or {0, 0, 2}. Informally, an admissible set describes how to combine
cap sets in smaller dimensions into large cap sets in higher dimensions [35]. We denote the set of

EL

205

PR
EV
I

174
175

LE

172
173

Cap sets. The cap set problem [22], once described by Terence Tao as “perhaps my favourite open
question” [30], refers to the task of finding the largest possible set of vectors in Zn3 (known as a cap
set) such that no three vectors sum to zero. Geometrically, no three points of a cap set lie on a line
(see Figure 3 for an example with n = 2).
The problem has drawn much interest for a variety of reasons. For one, it is an analogue of
the classical number theory problem of finding large subsets of primes in which no three are in
arithmetic progression. For another, it differs from many problems in combinatorics in that there
is no consensus among mathematicians regarding what the right answer should be. Finally, the
problem serves as a model for the many other problems involving “three-way interactions.” For
instance, progress towards improved upper bounds for the cap set problem [31, 32] immediately led
to a series of other combinatorial results, e.g., on the Erdös-Radio sunflower problem [33].
The exact size of the largest possible cap set in n dimensions is known only for n ≤ 6. A
brute force approach is not practical as the search space quickly becomes enormous with growing
n, e.g., around 31600 for n = 8. Previous methods impose potentially suboptimal restrictions on the
search space [34, 35]. In contrast, we search the full space via an algorithm skeleton that utilises a
function priority : Zn3 → R. Intuitively, this function provides a priority with which each x ∈ Zn3
should be included in the cap set. Our algorithm starts with an empty set and iteratively adds the
vector x ∈ Zn3 with the highest priority that does not violate the cap set constraint; see Figure 2
(a). Starting from a trivial constant function, we evolve the crucial priority component of our
approach to result in large cap sets.
Using this approach we discovered cap sets of sizes shown in Figure 4 (a). Notably, in dimension
n = 8, FunSearch found a larger cap set than what was previously known, thus illustrating the
power of FunSearch to discover novel constructions. This also shows the scalability of FunSearch to
larger dimensions, where the previously best known construction relied on a complex combination
of cap sets in lower dimensions [34, 35]. In contrast, FunSearch discovered a larger cap set from
scratch, without having to be explicitly taught any way of combining cap sets. Moreover, we do not
just discover the set of 512 8-dimensional vectors in itself, but a program that generates it: we show
this program in Figure 4 (b). Through inspecting the code, we obtain a degree of understanding
of what this set is: specifically, manual simplification of Figure 4 (b) provides the construction in
Figure 4 (c). Some properties of this construction are strikingly similar to the construction of the
Hill cap [36, 37], which results in the optimal 112-cap in Z63 .

AR
TI
C

170
171

We apply FunSearch to two related problems in extremal combinatorics — a branch of mathematics
that studies the maximal (or minimal) possible sizes of sets satisfying certain properties.

ED

169

Extremal combinatorics

AT

168

2.1

ER

167

206
207

AC

C

208

209
210

5

238

2.2

220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236

239
240
241
242
243
244
245
246
247
248
249

Bin packing

Combinatorial optimization is a subfield of mathematics which plays an important role across a wide
range of areas, from theoretical computer science to practical problems in logistics and scheduling.
While many combinatorial optimization problems are provably hard to solve for large instances, it
is typically possible to achieve strong performance using heuristics to guide the search algorithm.
The choice of a heuristic is crucial for obtaining strong performance, but designing a good heuristic
is difficult in practice. In this section, we show that FunSearch can be used to discover effective
heuristics for one of the central problems in combinatorial optimization: bin packing [4].
The goal of bin packing is to pack a set of items of various sizes into the smallest number of
fixed-sized bins. Bin packing finds applications in many areas, from cutting materials to scheduling
jobs on compute clusters. We focus on the online setting where we pack an item as soon as it is
received (as opposed to the offline setting where we have access to all items in advance). Solving
online bin packing problems then requires designing a heuristic for deciding which bin to assign an
incoming item to.
Heuristics for online bin packing are well studied and several variants exist with strong worst
case performance [40–45]. However, they often exhibit poor performance in practice [4]. Instead, the
most commonly used heuristics for bin packing are first fit and best fit. First fit places the incoming
item in the first bin with enough available space, while best fit places the item in the bin with least
available space where the item still fits. Here, we show that FunSearch discovers better heuristics

EL

250

PR
EV
I

219

LE

217
218

AR
TI
C

216

ED

215

AT

214

ER

212
213

EW

237


n
full-size admissible sets (with |A| = w
) as I(n, w). The current state-of-the-art [39] has relied on
SAT solvers to construct large admissible sets.
As before, we evolve a function priority : {0, 1, 2}n → R, which is used to iteratively grow
admissible sets. Starting from a trivial constant function, we discover one that provides us with
an I(12, 7) admissible set; the discovered program is shown in Figure 5 (b). This discovery alone
already improves the lower bound on the cap set capacity from 2.2180 [39] to 2.2184. Yet, interpreting
the program found by FunSearch (Figure 5 b) helps us significantly push the boundaries of what
admissible sets we can construct. Specifically, we notice that the discovered priority function
treats the n coordinates in a highly symmetric way, and indeed it turns out that the admissible set
it constructs is preserved under independent cyclic permutations of coordinates within four disjoint
groups of coordinate triples. Hereinafter we call such admissible sets symmetric (see Appendix D in
Supplementary Information for a formal definition).
We now use FunSearch to directly search for symmetric admissible sets. Note that this is a more
restricted but also much smaller search space, which allows for significantly higher dimensions and
weights than were previously possible. This led us to discovering a full-size I(15, 10) admissible set
(implying C ≥ 2.219486) and a partial admissible set in A(24, 17) of size 237 984, which implies
a new lower bound on the cap set capacity of 2.2202 (see Figure 5 a). While this is the largest
improvement to the lower bound in the last 20 years, we note it is still far from the upper bound,
and we hope our results inspire future work on this problem.
Not only does FunSearch scale to much larger instances than traditional combinatorial solvers
(see Appendix A.4 in Supplementary Information), it is a unique feature of searching in function
space that we were able to inspect the code discovered by FunSearch and infer a new insight into
the problem, in the form of a new symmetry. The procedure we followed in this section is a concrete
example of how LLM-based approaches can be used in mathematical sciences: FunSearch suggests
a solution, which is examined by researchers, who may note features of interest. These features are
used to refine the search, leading to better solutions. This process can be iterated, with both human
and search consistently in the loop.

211

251

252

AC

C

253

254
255
256

6

OR2

OR3

OR4

Weibull 5k

Weibull 10k

Weibull 100k

First Fit

6.42%

6.45%

5.74%

5.23%

4.23%

4.20%

4.00%

Best Fit

5.81%

6.06%

5.37%

4.94%

3.98%

3.90%

3.79%

FunSearch

5.30%

4.19%

3.11%

2.47%

0.68%

0.32%

0.03%

EW

OR1

264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285

EL

286

LE

263

AR
TI
C

262

ED

260
261

AT

259

than first fit and best fit on simulated data.
To achieve this, we define a heuristic as a program that takes as input an item and an array
of bins (containing the remaining capacity of each bin) and returns a priority score for each bin.
The solve function picks the bin with the highest score according to the heuristic (see Figure 2 b).
FunSearch is then used to evolve this heuristic, starting from best fit.
We first evaluate FunSearch on the well-known OR-Library bin packing benchmarks [24], consisting of four datasets, OR1 to OR4, containing bin packing instances with an increasing number
of items (see Appendix E.4 in Supplementary Information for details). We evolve our heuristic on
a training set of generated bin packing instances with the same number of items as those in OR1
and, after the evolutionary process is concluded, test it on the OR1 to OR4 datasets. We measure
performance as the fraction of excess bins used over the L2 lower bound [46] of the optimal offline
packing solution (which is generally not achievable in the online setting).
As can be seen in Table 1, FunSearch outperforms both first fit and best fit across all datasets.
Further, the learned heuristic generalizes: even though it has only seen instances of the same size as
OR1 during training, it generalizes across problem sizes, performing even better on large instances
and widening the gap to best fit. In addition to the OR benchmarks, we also use FunSearch to evolve
heuristics on bin packing instances sampled from a Weibull distribution, as these closely follow many
real-world scheduling problems [25, 47] (see Appendix E.4 in Supplementary Information for details).
As shown in Table 1, the performance of FunSearch is very strong on this dataset, significantly
outperforming first fit and best fit across instances, as well as scaling gracefully to large instances
(being only 0.03% off the lower bound on the optimum for 100 000 items). In addition, FunSearch is
robust and consistently outperforms these baselines as shown in the statistical analysis in Appendix
A.3 in Supplementary Information.
We observed that several heuristics discovered by FunSearch use the same general strategy for
bin packing (see Figure 6 for an example). Instead of packing items into bins with the least capacity
(like best fit), the FunSearch heuristics assign items to least capacity bins only if the fit is very tight
after placing the item. Otherwise, the item is typically placed in another bin which would leave
more space after the item is placed. This strategy avoids leaving small gaps in bins that are unlikely
to ever be filled (see Appendix E.5 in Supplementary Information for example visualizations of such
packings).
As this example demonstrates, the benefits of FunSearch extend beyond theoretical and mathematical results to practical problems like bin packing. Indeed, bin packing, and related combinatorial
optimization problems, are ubiquitous and find applications across a range of industries. We are
optimistic that FunSearch could be applied to several such use-cases with potential for real-world
impact.

ER

257
258