Skip to content

georgeblck/art-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Art Datasets for Machine Learning

A curated list of publicly available art datasets for machine learning research, covering classification, object detection, visual question answering, aesthetics, generative models, sketches, and more. Includes both scientific benchmark datasets and museum open-access collections.

Contributions welcome via pull request.

Availability: ✅ freely downloadable (direct download, Zenodo, Kaggle, HuggingFace, GitHub, etc.) 📩 requires application or approval 📄 paper only, no public data link found 🔒 browse only, no bulk download

Table of Contents


Research Datasets

Sorted by year (newest first) within each category.

Classification, Style & Attribution

Dataset Size Year Avail. Links Notes
Stylebreeder 6.8M images, 1.8M prompts 2024 paper, huggingface AI-generated images from Artbreeder with style clusters. NeurIPS 2024, CC0
fruit-SALAD 10,000 images 2024 paper, code Synthetic benchmark for style vs content similarity in embeddings
StyleBabel 135,000 artworks 2022 📄 paper (ECCV) Free-form style tags and captions from art/design students on Behance
ArtBench-10 60,000 images 2022 paper, code, kaggle Class-balanced benchmark, 10 styles, clean annotations
WikiArtVectors 68,094 artworks 2022 📄 paper Precomputed CLIP embeddings for WikiArt, 132 styles
DELAUNAY 11,503 images 2022 paper, code Abstract and non-figurative art, 53 artists
contempArt 14,398 images, 441 artists 2020 paper, zenodo, github Contemporary art from German art schools with social network data (456K edges) and demographics
DomainNet 600,000 images 2019 website 345 classes across 6 domains (painting, clipart, sketch, photo, etc.)
NoisyArt 90,000+ images 2019 paper, code Webly-supervised artwork recognition with noisy web labels, 3,000+ classes
Multitask Painting Collection ~100,000 images 2019 paper, data Multitask learning: artist, style, genre, period
Best Artworks of All Time ~8,000 images 2019 kaggle 50 most influential artists, popular starter dataset
BAM! 2.5M+ images 2017 📄 paper Behance Artistic Media: content, emotion, and media labels. ICCV 2017
Art 500K ~500,000 images 2017 📄 paper, data Large-scale art retrieval. Download link dead (404 as of 2026-04). Compiled from WikiArt + WGA + Rijks + Google Arts
Pandora 18k 18,038 images 2016 📄 paper 18 art styles, expert-labeled, higher annotation quality than WikiArt
Painter by Numbers 103,250 images 2016 kaggle Kaggle competition: predict whether two paintings are by the same artist
Rijksmuseum Challenge 112,039 images (3,593 paintings) 2014 paper, data, code Artist, material, type prediction
Painting-91 4,266 images 2014 paper, data 91 painters, style classification
PRINTART 988 images 2012 paper, data Print art classification
WikiArt ~250,000 images varies website, crawler, source 1, source 2, kaggle 3,000+ artists, most widely used art dataset. API signups currently disabled
WikiArt 215K (HuggingFace) 215,000 images varies huggingface Preprocessed WikiArt with image URLs and captions (title, artist, year, genre, style). 150K+ with parseable dates
Web Gallery of Art ~19,000 images ongoing website European fine art, encyclopedic scope

Object Detection, Pose & Iconography

Dataset Size Year Avail. Links Notes
Poses of People in Art 2,454 images, 10,749 figures 2024 paper, data First openly licensed pose estimation dataset for art, 22 depiction styles
Human-Art 50,000 images, 123,000 person instances 2023 📩 paper, website Natural and artificial scenes (paintings, cartoons, sculptures). CVPR 2023
DEArt 15,000+ images 2022 paper, data 69 object classes, 12 pose categories, European paintings 12th-18th c.
ArtDL 2.0 42,479 images 2021 paper, data, code Iconographic classification, 19 Iconclass classes, Renaissance art
Materials In Paintings 19,325 paintings, 227,810 bboxes 2021 paper, data Material perception (fabric, metal, wood, etc.) with fine-grained labels
IconArt 5,955 images 2018 paper, huggingface, data Weakly supervised iconographic element detection (angels, saints, etc.)
People-Art images from 41 art movements 2016 paper, code Cross-depiction person detection across photos, cartoons, art
Oxford VGG Paintings 210,000+ paintings (10K annotated) 2014 paper, data Object retrieval in paintings, crowdsourced object tags

Aesthetics & Emotion

Dataset Size Year Avail. Links Notes
ArtELingo-28 ~200,000 annotations, 28 languages 2024 📄 paper Multilingual art emotion annotations. EMNLP 2024
APDDv2 10,023 images, 85,191 scores 2024 paper, code Expert aesthetic scores and language comments, 10 attributes. NeurIPS 2024
BAID 60,337 images, 360,000+ votes 2023 paper, code BoldBrush artistic image aesthetics with user votes. CVPR 2023
ArtELingo ~1.24M annotations (EN/AR/ZH/ES) 2022 paper, website Multilingual emotion annotations on WikiArt. EMNLP 2022
ArtEmis v2 260,000 contrastive instances 2022 website Contrastive extension balancing positive/negative emotion pairs
TAD66K 66,000 images, 47 themes 2022 paper, huggingface Theme-oriented aesthetics, 1,200+ annotations per image. IJCAI 2022
ArtEmis 455,000 annotations on 80,000 artworks 2021 paper, website Emotion attributions + verbal explanations for WikiArt. CVPR 2021
WikiArt Emotions 4,105 images, 20 emotion categories 2018 paper, data Crowdsourced emotions across four Western art periods
AVA 250,000+ images 2012 paper, kaggle Photography aesthetics from DPChallenge, scores + style labels

Faces & Portraits in Art

Dataset Size Year Avail. Links Notes
WikiArt Face 6,095 face images 2021 data, code Faces cropped from portraits across art movements
AAHQ ~25,000 images 2021 code Artistic portrait faces from Artstation, various painting styles
MetFaces 1,336 face images (1024x1024) 2020 code Faces from Met artworks, aligned and cropped for GAN training
Artistic Faces Dataset from 103,250 artworks 2019 website 68 facial landmarks plus artist/style metadata

Multimodal & Visual QA

Dataset Size Year Avail. Links Notes
CognArtive LLM art analyses 2025 paper, website LLM-generated formal art analyses and aesthetic descriptions
MELArt annotations over Wikimedia art 2024 paper, code Multimodal entity linking in paintings
AQUA QA pairs over SemArt 2020 paper, code Visual and knowledge-based question answering on art
Artpedia 2,930 paintings 2019 📄 paper 28,212 text sentences (visual + contextual), cross-modal retrieval
OmniArt 2,050,017 images 2018 📄 paper, data Multi-task, multi-label, metadata-rich. Download links dead (as of 2026-04). Compiled from Rijks + Met + WGA
SemArt 21,383 images 2018 paper, data Semantic art descriptions paired with images

Sketches & Drawings

Dataset Size Year Avail. Links Notes
Creative Birds / Creatures 10,000 sketches each 2021 paper, code Part-annotated creative sketch datasets. ICLR 2021
ImageNet-Sketch 50,000 images, 1,000 classes 2019 code, kaggle Sketch versions of ImageNet classes for domain shift evaluation
OpenSketch 400+ sketches, 12 objects 2019 paper, website Product design sketches from professional designers
SketchyScene 29,056 scene sketches 2018 paper (ECCV), website Scene-level sketches with instance annotations
Quick, Draw! 50M drawings, 345 categories 2017 website, code, huggingface Google crowd-sourced sketch game, CC BY 4.0
Sketchy Database 75,471 sketches of 12,500 objects 2016 paper, website First large-scale paired sketch-photo dataset. SIGGRAPH 2016
TU-Berlin Sketches 20,000 sketches, 250 categories 2012 huggingface Human sketches for sketch recognition research

Forgery & Authentication

Dataset Size Year Avail. Links Notes
Van Gogh Authentication 338+ images 2024 paper, data Originals, human forgeries, and AI-generated fakes
DeepfakeArt Challenge 32,000+ image pairs 2023 paper, kaggle, code AI art forgery and data poisoning detection benchmark

Generative AI & Diffusion

Dataset Size Year Avail. Links Notes
Danbooru2023 5M+ images, 162M+ tags 2023 data (2021), huggingface (2023) Crowdsourced anime/illustration, ~30 tags per image
CommonCanvas ~70M CC images 2023 paper, models Copyright-safe training data for diffusion models. CVPR 2024
TWIGMA 800,000+ images 2023 paper, data AI-generated images from Twitter with metadata. NeurIPS 2023
Pick-a-Pic 1M+ preference pairs 2023 paper, huggingface Human preferences for text-to-image, used for RLHF
JourneyDB 4,429,295 images 2023 paper, data, huggingface Midjourney images with prompts, captions, VQA. NeurIPS 2023
AI-ArtBench 185,015 images 2023 data (IEEE), kaggle 60K human + 125K AI-generated, real vs AI art detection
Art-fm 650K art images (training set) 2025 📄 paper, code Flow matching for art generation trained on 650K curated images (WikiArt + 7 museum sources, SSCD-deduped from 950K). LMU Munich, ICCV 2025
SCFlow N/A (model only) 2025 paper, code Style/content disentanglement via conditional flow matching in CLIP space. Same lab as Art-fm. ICCV 2025
LAION-Aesthetics ~120M images (score >7) 2022 info, data Aesthetic-filtered subset of LAION-5B, used to train Stable Diffusion v1
DiffusionDB 14M images, 1.8M prompts 2022 paper, code, huggingface Stable Diffusion prompt-image pairs from Discord
COYO-700M 747M image-text pairs 2022 data Large-scale image-text pairs, CC-BY-4.0

Cartoon, Manga & Illustration

Dataset Size Year Avail. Links Notes
iCartoonFace 389,678 images, 5,013 characters 2020 📄 paper Large-scale cartoon face detection and recognition
Manga109 109 volumes, 21,142 pages 2020 📩 paper Japanese manga with annotated frames, faces, text, and characters

Cultural Heritage & Archaeology

Dataset Size Year Avail. Links Notes
CULTURE3D 41,006 drone images, 20 sites 2025 paper, code Cultural landmarks 3D reconstruction (pyramids, Forbidden City, etc.)
MuralDH 5,000+ images 2024 paper, code Dunhuang mural restoration: segmentation, inpainting, super-resolution
WikiScenes landmark photo collections 2021 code Architectural landmarks with captions and 3D geometry. ICCV 2021
ArchAIDE 435 sketches, 381 photos 2020 📄 paper Archaeological pottery classification via shape and decoration

Street Art & Graffiti

Dataset Size Year Avail. Links Notes
17K-Graffiti ~17,000 images 2022 code Graffiti classification. VISAPP 2022

Knowledge Graphs & Metadata

Dataset Size Year Avail. Links Notes
PainterPalette ~10,000 painters 2023 code WikiArt + Art500k + Wikidata painter metadata, network analysis
ArtGraph 135,038 resources, 875,416 facts 2022 paper, data Knowledge graph linking WikiArt and DBpedia

Museum & Gallery Collections

Open-access collections from cultural institutions. Sorted by collection size (largest first) where size is known, then alphabetically for collections without published counts.

Large Collections (100,000+ objects)

Collection Size Avail. License Links Notes
Smithsonian Open Access 11M+ records, 2.8M+ images CC0 github, website, AWS 19 museums + research centers
Victoria and Albert Museum 1M+ records, 500,000+ images personal/educational API Decorative arts, fashion, design. IIIF, OpenAPI spec
Te Papa (New Zealand) 1M+ objects, 200,000+ downloadable CC varies website First Australasian large-scale open access museum
Rijksmuseum 800,000+ objects CC0 (public domain works) data portal API and bulk downloads, high-resolution images
Louvre 500,000+ works varies website, JSON docs Append .json to any artwork URL. CSV export of searches
The Metropolitan Museum of Art 470,000+ artworks CC0 github Comprehensive metadata CSV, regularly updated
iMet Collection 375,000 images varies paper, kaggle Fine-grained attribute recognition challenge
Cooper Hewitt 215,000+ objects CC0 github, API Smithsonian design museum, JSON per object
Paris Musees 150,000+ images CC0 website, API 14 Paris city museums. GraphQL API with free account (session cookie auth). ~8K paintings + ~63K drawings with dates and images
National Gallery of Art (DC) 130,000+ artworks CC0 github CSV format with Wikidata identifiers

Medium Collections (10,000-100,000 objects)

Collection Size Avail. License Links Notes
SMK (National Gallery of Denmark) 88,000+ works CC0 API, website Leading OpenGLAM institution
National Palace Museum (Taiwan) 70,000+ digitized images open gov data Chinese art, calligraphy, ceramics, bronzes. IIIF compliant
Yale Center for British Art ~70,000 IIIF images public domain website Linked Open Data via RDF
Cleveland Museum of Art 61,000+ artworks CC0 github, API CSV and JSON data plus public API
Museo del Prado ~40,000 artworks non-commercial knowledge graph Linked Open Data, SPARQL-queryable
Finnish National Gallery 36,000+ artworks CC0 github Ateneum, Kiasma, Sinebrychoff collections
Getty Museum 30,000+ high-res images no known restrictions website Open Content Program, IIIF access
Whitney Museum 17,000+ works CC0 github, API CSV updated nightly. 20th/21st century American art
Williams College Museum of Art ~15,600 records CC0 github CSV format with thumbnails
Walters Art Museum 10,000+ records CC0 github Static data files (API v1 retired 2023)

API / Full Collection Access (size unspecified)

Collection Avail. License Links Notes
Art Institute of Chicago CC0 github, API, website REST API + bulk JSON dumps on AWS S3
Harvard Art Museums varies API docs, website REST API refreshed daily, IIIF-compatible
Minneapolis Institute of Art CC0 github JSON metadata, updated approximately daily
Brooklyn Museum 📩 varies API, examples REST API, registration required
British Museum non-commercial website, github SPARQL/Linked Data. Export up to 10K records per search
National Gallery (London) CC BY-NC-ND 4.0 API ~2,300 paintings. Elasticsearch-based API
ColBase (National Museums of Japan) varies website Tokyo, Kyoto, Nara, Kyushu national museums
Science Museum Group (UK) CC varies API, github JSON/CSV exports, 37 GitHub repos
Auckland War Memorial Museum CC varies API API plus Linked Open Data
QAGOMA (Queensland, Australia) CC varies data CSV, refreshed monthly. Australian and Asia-Pacific art

Metadata Only / Limited Access

Collection Avail. License Links Notes
MoMA Collection varies github Artist, artwork, exhibition data. Artworks.csv (via LFS) includes ImageURL column with direct JPEG links for ~64K works
Carnegie Museum of Art CC0 github Pittsburgh collection metadata
The Tate Collection CC-BY-NC-ND 3.0 github Metadata CSV includes thumbnail URLs. Images downloadable via CDN (swap _8.jpg for _10.jpg for 1536px). ~38K paintings/drawings with dates
Nationalmuseum Sweden CC0 github Wikidata-linked metadata
ArtUK 🔒 restricted website UK public art, browsable only. MDS extract API exists but aggressive bot protection (403 on all programmatic access, including Selenium, as of 2026-04)

Aggregated & Cross-Institution Datasets

Dataset Size Avail. License Links Notes
Europeana 50M+ cultural heritage objects varies data Aggregator across thousands of European institutions
Wikidata: Sum of All Paintings hundreds of thousands of paintings CC0 project Structured data linking paintings across museums. SPARQL queryable
art-museums-pd-440k ~440,000 images CC-BY-4.0 huggingface Public domain museum art with bilingual captions, WebDataset format

Related Resources

About

A curated list of art datasets for machine learning: 90+ research datasets and 35+ museum open-access collections

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors