ARC-AGI Challenge

This repo includes a few different approaches for the ARC-AGI challenge. These are exploratory attempts that seek to prototype a few ideas, more so than a full-fledged research project that goes in-depth on a single approach, due to time and budget constraints. Particularly, instead of running evaluation on all 400 public test cases, I've only run evaluation on randomly sampled 10 test cases (largely because I do not have a lot of OpenAI credits), so these results are not statistically significant and at best useful as preliminary results to guide proper experiments.

Below are the different approaches I've tried. For all approaches, I only used GPT-4o, solely because I only have credits on OpenAI platform.

Approaches

Evolutionary Method: This approach is based on Jeremy's approach, with some modifications based on DeepMind's FunSearch paper. For each task, I ran five generations of 5 programs each. For every generation, I took the best 2 programs evaluated on an equally weighted sum of pixel correctness and correctness (num of correct examples). I then prompted an LLM to generate new programs based on the 2 best programs by juxtaposing the two programs together as well as the results from executing those two programs on example pairs.
Evolutionary Method with Transduction: I ran the same evolutionary method as #1, but instead of prompting the LLM to generate python programs, I prompted it to directly generate the correct output grid. In this approach, given that I cannot evaluate each response on the example pairs with a verifier, I ranked the top 2 programs solely based on the pixel correctness of the output grid directly generated by the LLM. Then, I basically treated the LLM's reasoning output (which is enclosed in tags) of the two best scoring responses as natural language programs that can be juxtaposed to prompt the LLM to generate a new generation of responses the same way I did in #1. Likewise, I ran this for 5 generations with 5 programs per each generation.
Transduction & Induction Hybrid: Inspired by the results from the first place ARC Prize paper that Induction and Transduction are highly complementary, I prompted the LLM to freely choose at every step whether to generate a python program that would result in the correct transformation, or to directly return the correct output grid. I did not carry out this approach as an evolutionary method, but instead as a single-threaded LLM <> Verifier loop. If the LLM generates a python program, the verifier runs the code and returns back to the LLM the results of the code execution on example pairs. If the LLM generates a grid instead, the verifier checks the grid against the solution grid and returns back to the LLM just a binary indication of whether the grid was correct or incorrect. For both cases, the reasoning output of the LLM is included in the prompt returned by the verifier. I ran this for 25 iterations on each task.

Results

All approaches were tested on the same 10 randomly sampled test cases. The results are as follows:

Evolutionary Method: 30%
Evolutionary Method with Transduction: 0%
Transduction & Induction Hybrid: 10%

The full logs from running each method is included in the ./results folder for reference.

Discussion

1. Evolutionary Method

Even though it was tested on only 10 examples and on a different model (GPT 4o vs sonnet 3.5), the results from this approach are significantly lower than the 53% score Jeremy got with a nearly identical approach. I believe the main difference is the search time - due to time/budget constraints, I ran each generation to produce only 5 programs, which is too few to explore the large search space. In addition, another key difference is that Jeremy's approach feeds in the input grids to the LLM in various formats (e.g. ASCII, image, [list[list[int]]], etc.), whereas I only fed in the input grids in the form of [list[list[int]]] to the LLM. Lastly, Jeremy's approach combines the top 2 programs into a single revision prompt only some of the time, and also generates programs using a single parent program, whereas I combine the top 2 programs every time.

In addition, even for the tasks that the model failed to solve, I found that the average pixel correctness of each generation across all examples almost always monotically increases, which checks out with my expectation that each generation is getting the program closer to the solution.

Here is an example prompt for this method:

Here are the paired example inputs and outputs.

Example 1

Input:

[[0, 2, 2, 0],
[2, 0, 0, 0],
[0, 2, 0, 2],
[2, 2, 2, 2],
...

Here are two programs that you've previously generated, along with the results from running the programs on the examples. Both of them resulted in at least one incorrect output when they were applied to the test input.

Examine both programs carefully to determine what the issue is. Then, based on the provided programs, write a new version of the program that would result in a correct output.

You will need to carefully reason to determine the issue in both programs and determine how to fix the code. Start your response by doing this reasoning in <reasoning> tags. Then, implement the fixed transformation in code.

Make sure to wrap the code in triple backticks (e.g. ```python (code) ```).


Program 1
Code: 

def transform(grid):
...

Results from running the code:

Result for example 1:
✗ Transformation does not match expected output.

Input Grid:
[[0, 2, 2, 0],
[2, 0, 0, 0],
[0, 2, 0, 2],
[2, 2, 2, 2],
[0, 0, 2, 0],
[0, 0, 2, 2],
...

Program 2
Code: 

def transform(grid):
...

Results from running the code:

Result for example 1:
✗ Transformation does not match expected output.

Input Grid:
[[0, 2, 2, 0],
[2, 0, 0, 0],
[0, 2, 0, 2],
[2, 2, 2, 2],
[0, 0, 2, 0],
[0, 0, 2, 2],
...

Here are some things to try in the future:

Generate more programs per generation and run more generations.
Feed in the input grids in various formats to the LLM, although they are redundant.
Be concise with the prompts and prompt the model to only output the function code instead of relying on CoT with extensive reasoning. While this is contrary to the recent developments of reasoning models such as o3 and r1, interestingly, authors of the FunSearch paper do not feed in that much context about the problem into the LLM. They specifically mention:

"We believe that the LLM used within FunSearch does not use much context about the problem; the LLM should instead be seen as a source of diverse (syntactically correct) programs with occasionally interesting ideas."

I like this idea of treating the LLM as nothing but a code generator that just throws stuff at the wall (like the infinite monkey theorem, but LLMs as smarter monkeys that constrain the search space with better intuition) and offloading bulk of the search process to a verifier. I could run the evolutionary method with prompts that 1. tell the LLM to only output the code in their response without reasoning or any other tokens 2. do not include much context about the previously run programs beyond just the code (i.e. do not include the results obtained from running the previous code). The idea is to reduce the input & output tokens to the LLM so that I can generate more programs per task for a comparable cost.

2. Evolutionary Method with Transduction

This approach did not perform as well as #1, at least on the limited sample of 10. I hypothesize that the main reason this approach underperforms compared to the evolutionary method with induction is because "natural language program" that the model generates in form of "reasoning" is not as good at storing and propagating information learned across multiple generations as code. This is something I expected, but I expected that there would be some tasks that this approach would be able to solve that the evolutionary method with induction would not, based on the first place ARC Prize paper that states that induction and transduction are highly complementary. To be fair, I'm working with a severly limited sample size, but I did not see any tasks that this approach was able to solve that the evolutionary method with induction was not able to. Empirically, I also found that the model rarely draws information from the previous reasoning threads to generate a new output grid - this is something I could improve by experimenting further with prompts.

3. Transduction & Induction Hybrid

This approach also performed very poorly, not just quantatively (i.e. low score), but also qualitatively. From the logs and the model's reasoning, it appears that the constant flip-flop between the two different methods (transduction and induction) is causing the model to lose track of the solution and effectively resets the progress that it has made across multiple iterations. My original intuition was that the reasoning (enclosed in tags) from previous responses would still withold some information about the past iterations even if the format of the model's (code vs grid) output differ across iterations, but empirically, the model did not include much details about the reasoning threads from previous iterations. This led to a lot of information obtained from previous iterations being lost. However, the idea of combining induction and transduction is still compelling, and an approach that runs the two methods in parallel (as opposed to in tandem within the same iteration loop) is worth exploring in the future.

In addition, this single-threaded approach (in which the the LLM iterates on a single generated program for 25 iterations) is something worth trying with a purely inductive approach as opposed to an evolutionary approach that generates multiple programs per generation. In other words, you could run the evolutionary method with 25 generations of 1 program per each generation.

Here is an example prompt for this method:

Here are the paired example inputs and outputs, along with the additional input that you should predict the output for. 

I've also included an incorrect answer that you've returned for these examples during a previous attempt. If your previous answer was a transformation rule in code, I will also include the outputs of the transformation rule when it was applied on the examples.
If you've answered directly with an output grid instead of writing the transformation code, I will include only the incorrect output grid that you've previously returned.

You should first carefully read through your previous answer to figure out what was wrong. Explain your reasoning in detail why you think the previous answer was incorrect, and how you would fix it.

Make sure to enclose your reasoning in <reasoning> tag. Your reasoniong should NOT include the actual transformation code implementation in Python, and you should NOT try to predict the output grid based on the transformation code you wrote. I will be executing the transformation code you wrote and comparing the output to the expected output.

After your reasoning, return either the transformation code enclosed in triple backticks or an output grid enclosed in <output> tags.

Note that it is completely up to you at each attempt to decide whether to write transformation code or return the expected output grid, regardless of the format of your previous answer.
Even if your previous answer included a transformation code implementation, you may choose to answer directly with the output grid, and vice versa. However, make sure to choose only ONE of these options. DO NOT return both the code and the output grid in your answer.


A. Examples

Here are the paired example inputs and outputs.

Example 1

Input:

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 2, 8, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
...

Output:

[[0, 2, 0, 5, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0],
...


Here is the additional test input without a known output:

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 6, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 6, 7, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 3, 3, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

B. Incorrect Previous Answer

<reasoning>
To understand the transformation rule, let's analyze each example:
...
</reasoning>

<output>
[[0, 6, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 6, 0, 1, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 6, 1, 1, 0, 0, 0, 0, 0, 0],
 [6, 6, 6, 6, 7, 1, 6, 6, 6, 6, 6, 6],
 [0, 0, 0, 3, 3, 1, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
</output>

C. Outputs produced by the previous answer

Result for example 1:
✗ Transformation does not match expected output.
...

Alternate approach I plan to try

DreamCoder-Style Abstraction Sleep

While running the evolutionary methods, I noticed that the LLM sometimes finds commonalities between the top 2 programs and implements modular functions that are called as subroutines within the transform function. I came across the DreamCoder paper by Ellis et al that includes an Abstraction Sleep phase, in which the authors generate higher-level program abstractions by refactoring the DSL code that the recognition model generates during its wake phase and adds them to a library of functions that the model can use.

I could modify the evolutionary method to include this Abstraction Sleep phase, in which I prompt the LLM to generate higher-level program abstractions from the programs that have been generated from the given generation and add them to a library of functions that the model can use. I could then pass down this library across generations and run the evolutionary method again with the updated library of functions. I would expect this approach to improve search efficiency by allowing the model to use the same subroutines across multiple generations as opposed to generating code from scratch. However, I would imagine this is something that would only pay off at scale, so implementing this method on the experimental scale of 10 examples is probably not going to result in much improvement.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
results		results
sample_data		sample_data
.gitignore		.gitignore
README.md		README.md
env		env
execute.py		execute.py
main.py		main.py
poetry.lock		poetry.lock
prompts.py		prompts.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARC-AGI Challenge

Approaches

Results

Discussion

1. Evolutionary Method

2. Evolutionary Method with Transduction

3. Transduction & Induction Hybrid

Alternate approach I plan to try

DreamCoder-Style Abstraction Sleep

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARC-AGI Challenge

Approaches

Results

Discussion

1. Evolutionary Method

2. Evolutionary Method with Transduction

3. Transduction & Induction Hybrid

Alternate approach I plan to try

DreamCoder-Style Abstraction Sleep

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages