Takeru Miyato · Bernhard Jaeger · Max Welling · Andreas Geiger
Official reproducing code of "GTA: A Geometry-Aware Attention Mechanism for Multi-view Transformers".
⭐ (12/27/2025) ⭐ Recently, GTA-style camera encoding has been increasingly adopted across a variety of works, particularly for improved camera control:
- PRoPE: which extends GTA by incorporating intrinsic-aware camera encoding.
- UCPE: further extends GTA and PRoPE to support non-pinhole camera models.
- Kaleido: a large generative model for scene-level neural rendering
- WorldPlay: a generative world model with real-time interaction.
- ReDirector: GTA-like camera encoding for video diffusion models.
This repository contains the following different codebases, each of which can be accessed by switching to the corresponding branch:
- NVS experiments on CLEVR-TR and MSN-Hard (this branch)
- NVS experiments on ACID and RealEstate (link)
- ImageNet generation with Diffusion transformers (DiT) (link)
You can find the core implementation of GTA for multi-view ViTs here and for image ViTs here.
Please feel free to reach out to us if you have any questions!
conda create -n gta python=3.9
conda activate gta
pip3 install -r requirements.txt
export DATADIR=<path_to_datadir>
mkdir -p $DATADIR
Download the dataset from this link and place it under $DATADIR
gsutil -m cp -r gs://kubric-public/tfds/kubric_frames/multi_shapenet_conditional/2.8.0/ ${DATADIR}/multi_shapenet_frames/
torchrun --standalone --nnodes 1 --nproc_per_node 4 train.py runs/clevrtr/GTA/gta/config.yaml ${DATADIR}/clevrtr --seed=0
torchrun --standalone --nnodes 1 --nproc_per_node 4 train.py runs/msn/GTA/gta_so3/config.yaml ${DATADIR} --seed=0
python evaluate.py runs/clevrtr/GTA/gta/config.yaml ${DATADIR}/clevrtr $PATH_TO_CHECKPOINT # CLEVR-TR
python evaluate.py runs/msn/GTA/gta_so3/config.yaml ${DATADIR} $PATH_TO_CHECKPOINT # MSN-Hard
This repository is built on top of SRT and OSRT created by @stelzner. We would like to thank him for his open-source contribution of the SRT models. We also thank @lucidrains for providing the values of J matrices, which are needed to compute the irreps of SO(3) efficiently.
@inproceedings{Miyato2024GTA,
title={GTA: A Geometry-Aware Attention Mechanism for Multi-View Transformers},
author={Miyato,Takeru and Jaeger, Bernhard and Welling, Max and Geiger, Andreas},
booktitle={International Conference on Learning Representations (ICLR)},
year={2024}
}


