GNN extension and improvements on CoqGym repository.
Note, this repo includes the codebase of Coq, SerAPI, CoqHammer, and the Coq projects in coq_projects.
- Notation
- Project Goals
- Main Contributions
- Setup and Installation
- Running train and test pipelines
- FAQ and Known Bugs
- Resources
Here are some important notation to understand the below explanations:
-
$G = (V, E)$ : Graph notation, where$|V|$ is the number of vertices (nodes) and$|E|$ is the number of edges. -
x:$\mathbb{R} ^{|V|}$ node list of node types, referenced by index into the non-terminal node information. Child-first ordering enforced bytraverse_postorder -
edge_index:$\mathbb{R} ^ 2 \times \mathbb{R}^{|E|}$ edge list referenced by index intox.
With the advent of recent progress in graph neural networks (GNNs), we hope to improve on the original CoqGym results by replacing their TreeLSTM encoder module with various GNN implementations.
The main contributions of this repository are some pipeline modifications to allow for the GNNs, and the actual GNN implementation. In the following sections, these modifications will be presented in detail with comparisons to the original when relevant.
These implementations are split between many branches to provide easier management of different tests:
master: Contains the original CoqGym paper at the time of the fork.bofb: Contains first implementation of graph batchesrl-mods: Modified decoder with more expressive attention mechanismint-emb: Contains implementation ofIntegerFeatureEncoderin the encoder and the modifications fromrl-mods.
All of the above branches (except for master) have implementations of the GNN.
In order to learn using GNNs efficiently, x and edge_index information need to be extracted from the lark.tree.Tree representations of ASTs. This computation is very costly, so it is delegated to the proof extraction stage. Here are the pipeline modifications that facilitate this change:
- Modified proof step data representation from
dicttotorch_geometric.data.Batchobjects - Modified merge operation to facilitate
Batchmerging - Used
torch_geometric.data.Datasetobject overtorch.data.Datasetobject - Changed saving protocol from
.pickleto PyTorch-optimized.ptusingtorch.save()
More explicitly, a comparison of the proof step structures is outlined below:
CoqGym proof step:
{
file : str,
proof_name : str,
n_step : int,
env : [
{
qualid: str,
ast : lark.tree.Tree
},
...
],
local_context : [
{
ident : str
ast : lark.tree.Tree
},
...
],
goal : lark.tree.Tree,
tactic_actions : list[int | str],
tactic_str : str,
}CoqGym-GNN proof step:
torch_geometric.data.Batch (
x : tensor.Tensor,
edge_index : tensor.Tensor,
batch : tensor.Tensor,
# Some modifications to original CoqGym attributes
file : str,
proof_name : str,
n_step : int,
env : [
{
qualid: str,
ast : lark.tree.Tree
},
...
],
local_context : [
{
ident : str,
text : str,
ast : lark.tree.Tree
},
...
],
goal : {
id : int,
text : str,
ast : lark.tree.Tree
}
tactic_actions : int | str,
tactic_str : str,
)Along with these changes, optimizations were made to the data generation process which facilitates easier updates to the extracted dataset and lighter computational resource requirements. Specifically:
- Added
filter_fileoption initer_proofsand its derivatives to skip loading of proof if not being considered - Save files as data is generated, so a monolithic list is not needed to keep track of data.
- Added multi-processing script for both
extract_proof.pyandevaluate.pyfor more efficient use of computational resources.
Some notable design modifications in both the new GNN encoder and the existing RL pipeline were made for testing purposes, which are listed below
- 2-layer GNN with modular convolutions
- Multi-headed Graph Attention (GAT) or GraphSage convolutions
- Used
torch_geometric.graphgym.models.encoder.IntegerFeatureEncoderover one-hot encodings of node types - Increased expressiveness of attention module in the decoder
- Added an extra layer
- Used PreLu activations between each layer
- Used batch normalization within each layer
CoqGym has many dependencies and is nontrivial to set up correctly. The following instruction detail how to obtain the CoqGym dataset and build the interaction environment natively.
- OCaml Package Manager (OPAM) is used to install OCaml and the corresponding packages.
- Lightning Memory-Mapped Database (LMDB) is used to store S-expressions in
*.jsonfiles.
- Create an OPAM switch for OCaml 4.07.1+flambda:
opam switch create 4.07.1+flambda && eval $(opam env) - Clone the repository:
git clone https://github.com/danjenson/CoqGym-GNN.git - Install Coq, SerAPI and CoqHammer:
cd CoqGym && source install.sh - Build the Coq projects (can take a while):
cd coq_projects && make && cd .. - Setup the python environment (see requirements.txt for version details):
curl https://pyenv.run | bashpyenv install 3.7.1 && pyenv local 3.7.1pip install numpy ipython lark-parser==0.6.5 lmdb==0.94 pandas==0.24.2 pexpect==4.6.0 sexpdata==0.0.3 progressbar2pip install torch pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv torch_geometric -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
Note: Coq, SerAPI, CoqHammer, and the Coq projects in coq_projects directory are indendent software projects with their own code repositories, but please follow the instructions above to build the specific versions we need.
-
Download the CoqGym dataset here
-
Unzip the data and set the paths:
python unzip_data.py
Caveat: The second step sets the absolute paths in the data. You have to re-do it whenever the absolute path of the data/ directory changes (e.g. after moving the entire repo to another directory).
Run python eval_env.py to check if it terminates normally without raising an error.
Now you are ready to interact with CoqGym!
Our encoder-decoder models are trained on individual proof steps rather than entire proofs. This allows use to directly use teacher forcing.
To extract proof steps from the CoqGym dataset, run python extract_proof_steps.py from the ASTactic directory. Note, this can take a while (8-12 hours). To help, we provide an alternate multiprocessing script to parallelize extraction across proof libraries (coq projects) python multiprocess_extract.py.
The extracted proof steps are in proof_steps/. You can double-check the number of proof steps to make sure everything works as expected:
| Directory | # files |
|---|---|
| proof_steps/train | 121,644 |
| proof_steps/valid | 68,180 |
We also provide pre-extracted download tarballs for train and valid proof steps here.
To train on the proof steps in training + validation set, run the following command from the ASTactic directory:
python main.py --no_validation --exp_id <model_id> --model_type <model_type> --heads <num_heads>
Model checkpoints will be saved to runs/astactic/checkpoints/. See options.py for command line options.
CoqGym's pre-trained astatic model can be downloaded here.
Our pre-trained GNN models can be downloaded here
To test a trained model on unseen proof libraries, run the following command from the ASTactic directory:
python evaluate ours <model_id> --path runs/<model_id>/checkpoints/model_<epoch#>.pth --filter <proof_library_name>
-
To execute testing just a single proof (e.g.
get_set_namefrom../data/StructTact/Assoc.json):python evaluate.py ours ours-TEST --path runs/astactic/checkpoints/model_003.pth --file ../data/StructTact/Assoc.json --proof "get_set_same" -
Testing an automated tactic X (may be "auto", "trivial", "easy", "intuition", or "hammer"):
python -u evaluate.py X X-TEST --file ../data/StructTact/Assoc.json --proof "get_set_same" -
Testing ASTactic+X:
python -u evaluate.py ours+X ours+X-TEST --path runs/astactic/checkpoints/model_003.pth --file ../data/StructTact/Assoc.json --proof "get_set_same"
Caveat: Testing is computationally expensive, but the workloads are very parallelizable. We provide the code for this in multiprocess_test.py.
-
Error in
source install.sh- Double check requirements
-
Failed to build
coqhammermake cleaninASTactic/coqhammerand remakecoqhammer
-
makeincoq_projectsfailsmake cleanand remake the entire folder- If that doesn't work, double check requirements and reset the directory to its fresh state
-
pipfailed to installlmdb==0.94- Try installing
lmdb==1.0
- Try installing
-
EOFerror while loading.ptobjects- Rebuild that project/file specifically
Data can be obtained from the original CoqGym repo here
Our pre-trained models and pre-extracted proof steps for training can be obtained here
