situated-vision

Probing CLIP's cultural situatedness through 993 dance-related objects from the Pitt Rivers Museum online collection.

"The knowing self is partial in all its guises." — Donna Haraway

Research Question

How does CLIP's visual understanding diverge from the archival logic of colonial museum collections — and what does this reveal about the cultural situatedness of CLIP?

Dataset

993 dance-related objects from the Pitt Rivers Museum Collections Online, University of Oxford. Initial selection included 1,000 objects. 7 were excluded: no images available.

Data Access

Images were retrieved via the IIIF image server at dams.prm.ox.ac.uk. IIIF is an open standard designed for machine access to cultural heritage collections. Please respect the Pitt Rivers Museum's terms of use and rate limits when using this code. Images or outputs including images are not included in this repository.

Method

1. Data Collection Downloaded 1000 dance-related objects from Pitt Rivers Museum Collections Online as CSV. Retrieved images by reverse-engineering the IIIF image server (dams.prm.ox.ac.uk) — each accession number maps directly to a IIIF manifest containing the image URL.

2. Image Embeddings Passed 993 images through CLIP (ViT-B/32) image encoder. Each image becomes a 512-dimensional vector representing its position in CLIP's semantic space.

3. Text Embeddings Encoded each object's museum description through CLIP's text encoder. Same 512-dimensional space — allowing direct comparison with image vectors.

4. Dimensionality Reduction Applied UMAP to reduce 512D vectors to 2D and 3D for visualization. Run separately on image embeddings, text embeddings, and both combined.

5. Gap Analysis Computed cosine similarity between each object's image and text embedding. Low similarity = large gap between what CLIP sees and what the museum says.

6. Zero-Shot Retrieval Used CLIP's text encoder to query the image embeddings with natural language prompts (e.g. "face", "death and ritual"). Results reveal how CLIP's visual ontology maps onto — or diverges from — museum categories.

Key Findings

CLIP clusters objects by visual form, not cultural origin — masks from Africa, Oceania and Mexico group together regardless of provenance
Image and text embeddings occupy fundamentally separate semantic spaces
The gap is largest where museum language uses cultural interpretation ("devil-dances") rather than visual description
Zero-shot queries reveal CLIP's cultural bias directly: "dance" returns Western ballet figures as top results despite the dataset consisting entirely of non-Western dance objects; "fear" finds grimacing masks; "love" finds red shoes and fans

Outputs

UMAP scatter plots (regional, 3D), image grids, zero-shot retrieval grids, interactive HTML visualizations (run 09_umap_3d.py and 11_text_image_umap.py locally). Images not included — run the pipeline locally.

Stack

Python · CLIP · UMAP · PyTorch · rembg · plotly · pandas

Note

Parts of the code were generated with AI assistance (Claude, Sonnet 4.6, Anthropic).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01_fetch_images.py		01_fetch_images.py
02_download_images.py		02_download_images.py
03_clip_embeddings.py		03_clip_embeddings.py
04_umap.py		04_umap.py
05_image_grid.py		05_image_grid.py
06_image_grid_nooverlap.py		06_image_grid_nooverlap.py
07_remove_bg.py		07_remove_bg.py
08_grid_form.py		08_grid_form.py
09_umap_3d.py		09_umap_3d.py
10_text_image_gap.py		10_text_image_gap.py
11_text_image_umap.py		11_text_image_umap.py
12_gap_table.py		12_gap_table.py
13_gap_table_closest.py		13_gap_table_closest.py
14_zero_shot_viz.py		14_zero_shot_viz.py
LICENSE		LICENSE
README.md		README.md
text_image_umap.png		text_image_umap.png
umap_plot.png		umap_plot.png
umap_regions.png		umap_regions.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

situated-vision

Research Question

Dataset

Data Access

Method

Key Findings

Outputs

Stack

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

situated-vision

Research Question

Dataset

Data Access

Method

Key Findings

Outputs

Stack

Note

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages