Ceres

Semantic search engine for open data portals

Quick Start • Features • Usage • Roadmap

Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.

Named after the Roman goddess of harvest and agriculture.

Why Ceres?

Open data portals are everywhere, but finding the right dataset is still painful:

Keyword search fails: "public transport" won't find "mobility data" or "bus schedules"
Portals are fragmented: Italy alone has 20+ regional portals with different interfaces
No cross-portal search: You can't query Milano and Roma datasets together

Ceres solves this by creating a unified semantic index. Search by meaning, not just keywords.

$ ceres harvest

INFO [Portal 1/5] milano (https://dati.comune.milano.it)
INFO Found 2575 dataset(s) on portal
INFO Progress: 2575/2575 (100%) - 0 new, 0 updated, 2575 unchanged, 0 failed
INFO [Portal 1/5] milano completed: 2575 dataset(s)

INFO [Portal 2/5] sicilia (https://dati.regione.sicilia.it)
INFO Found 186 dataset(s) on portal
INFO [Portal 2/5] sicilia completed: 186 dataset(s)

INFO [Portal 3/5] trentino (https://dati.trentino.it)
INFO Found 1388 dataset(s) on portal
INFO [Portal 3/5] trentino completed: 1388 dataset(s) (1388 created)

INFO [Portal 4/5] aragon (https://opendata.aragon.es/ckan)
INFO Found 2879 dataset(s) on portal
INFO [Portal 4/5] aragon completed: 2879 dataset(s) (2879 created)

INFO [Portal 5/5] nrw (https://ckan.open.nrw.de)
INFO Found 10926 dataset(s) on portal
INFO [Portal 5/5] nrw completed: 10926 dataset(s) (10926 created)

═══════════════════════════════════════════════════════
BATCH HARVEST COMPLETE
═══════════════════════════════════════════════════════
  Portals processed:   5
  Successful:          5
  Failed:              0
  Total datasets:      17954
═══════════════════════════════════════════════════════

$ ceres search "trasporto pubblico" --limit 5

🔍 Search Results for: "trasporto pubblico"

Found 5 matching datasets:

1. [████████░░] [78%] TPL - Percorsi linee di superficie
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds534-tpl-percorsi-linee-di-superficie
   📝 Il dataset contiene i tracciati delle linee di trasporto pubblico di superficie...

2. [████████░░] [76%] TPL - Fermate linee di superficie
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds535-tpl-fermate-linee-di-superficie
   📝 Il dataset contiene le fermate delle linee di trasporto pubblico di superficie...

3. [███████░░░] [72%] Mobilità: flussi veicolari rilevati dai spire
   📍 https://dati.comune.milano.it
   🔗 https://dati.comune.milano.it/dataset/ds418-mobilita-flussi-veicolari
   📝 Dati sul traffico veicolare rilevati dalle spire elettromagnetiche...

$ ceres stats

📊 Database Statistics

  Total datasets:        17954
  With embeddings:       17954
  Unique portals:        5
  Last update:           2025-12-31 15:09:48 UTC

Features

CKAN Harvester — Fetch datasets from any CKAN-compatible portal
Multi-portal Batch Harvest — Configure multiple portals in portals.toml and harvest them all at once
Delta Harvesting — Only regenerate embeddings for changed datasets (99.8% API cost savings)
Real-time Progress — Live progress reporting during harvest with batch timestamp updates
Semantic Search — Find datasets by meaning using Gemini embeddings
Multi-format Export — Export to JSON, JSON Lines, or CSV
Database Statistics — Monitor indexed datasets and portals

Pre-configured Portals

Ceres comes with verified CKAN portals ready to use:

Portal	Region	Datasets
Milano	Italy	~2,575
Sicilia	Italy	~186
Trentino	Italy	~1,388
Aragón	Spain	~2,879
NRW	Germany	~10,926

See examples/portals.toml for the full list. Want to add more? Check issue #19.

Cost-Effectiveness

API costs, based on the Gemini embedding model, are almost negligible, making the solution extremely efficient even for personal projects or those with limited budgets.

The main cost is for the initial creation of vector embeddings. Below is a cost breakdown for a large catalog.

Cost Analysis for Initial Indexing

This scenario estimates the one-time cost to index a catalog of 50,000 datasets.

Metric	Detail
Cost per 1M Input Tokens	~$0.15 USD (Standard rate for Google's `text-embedding-004` model)
Estimated Tokens per Dataset	500 tokens (A generous estimate for title, description, and tags)
Total Tokens	`50,000 datasets * 500 tokens/dataset = 25,000,000 tokens`
Total Initial Cost	`(25,000,000 / 1,000,000) * $0.15 =` $3.75

As shown, the initial cost to index a substantial number of datasets is just a few dollars. Monthly maintenance for incremental updates would be even lower, typically amounting to a few cents.

Tech Stack

Component	Technology
Language	Rust (async with Tokio)
Database	PostgreSQL 16+ with pgvector
Embeddings	Google Gemini text-embedding-004
Portal Protocol	CKAN API v3

Quick Start

Prerequisites

Rust 1.87+
Docker & Docker Compose
Google Gemini API key (get one free)

Installation

# Install from crates.io
cargo install ceres-search

# Or build from source
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cargo build --release

Setup

# Start PostgreSQL with pgvector
docker-compose up -d

# Run database migrations
make migrate

# Or manually with psql if you prefer
# psql postgresql://ceres_user:password@localhost:5432/ceres_db \
#   -f migrations/202511290001_init.sql

# Configure environment
cp .env.example .env
# Edit .env with your Gemini API key

💡 Tip: This project includes a Makefile with convenient shortcuts. Run make help to see all available commands.

Usage

Harvest datasets from a CKAN portal

ceres harvest https://dati.comune.milano.it

Search indexed datasets

ceres search "trasporto pubblico" --limit 10

Export datasets

# JSON Lines (default)
ceres export > datasets.jsonl

# JSON array
ceres export --format json > datasets.json

# CSV
ceres export --format csv > datasets.csv

# Filter by portal
ceres export --portal https://dati.comune.milano.it

View statistics

ceres stats

CLI Reference

ceres <COMMAND>

Commands:
  harvest  Harvest datasets from a CKAN portal or batch harvest from portals.toml
  search   Search indexed datasets using semantic similarity
  export   Export indexed datasets to various formats
  stats    Show database statistics
  help     Print help information

Environment Variables:
  DATABASE_URL     PostgreSQL connection string
  GEMINI_API_KEY   Google Gemini API key for embeddings

Development

The project includes a Makefile with convenient shortcuts for common development tasks:

# Start development environment (starts PostgreSQL with docker-compose)
make dev

# Run database migrations
make migrate

# Build the project
make build

# Build in release mode
make release

# Run tests
make test

# Format code
make fmt

# Run lints
make clippy

# See all available commands
make help

Architecture

Roadmap

v0.0.1 — Initial Release ✅

CKAN harvester with concurrent processing
Gemini embeddings (text-embedding-004, 768 dimensions)
CLI with harvest, search, export, stats commands
PostgreSQL + pgvector backend
Multi-format export (JSON, JSONL, CSV)

v0.1 — Enhancements ✅

Portals configuration from portals.toml
Delta harvesting
Improved error handling and retry logic

v0.2 — Multi-portal & API

Incremental harvesting (time-based metadata filtering)
REST API
Graceful shutdown

Future

Multilingual embeddings (E5-multilingual)
Cross-language search
data.europa.eu integration
Socrata support
DCAT-AP harvester (EU portals)
Switchable embedding providers
Schema-level search
Data quality scoring

Contributing

Contributions are welcome! This project is in early stages, so there's plenty of room to shape its direction.

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run -- harvest https://dati.comune.milano.it

See CONTRIBUTING.md for guidelines.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Acknowledgments

pgvector — vector similarity for Postgres
Google Gemini — embeddings API
CKAN — the open source data portal platform

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.cargo		.cargo
.github		.github
crates		crates
docs		docs
examples		examples
migrations		migrations
.env.example		.env.example
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
clippy.toml		clippy.toml
compose.yml		compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ceres

Why Ceres?

Features

Pre-configured Portals

Cost-Effectiveness

Cost Analysis for Initial Indexing

Tech Stack

Quick Start

Prerequisites

Installation

Setup

Usage

Harvest datasets from a CKAN portal

Search indexed datasets

Export datasets

View statistics

CLI Reference

Development

Architecture

Roadmap

v0.0.1 — Initial Release ✅

v0.1 — Enhancements ✅

v0.2 — Multi-portal & API

Future

Contributing

License

Acknowledgments

About

Uh oh!

Releases 2

Uh oh!

Contributors 4

Languages

License

AndreaBozzo/Ceres

Folders and files

Latest commit

History

Repository files navigation

Ceres

Why Ceres?

Features

Pre-configured Portals

Cost-Effectiveness

Cost Analysis for Initial Indexing

Tech Stack

Quick Start

Prerequisites

Installation

Setup

Usage

Harvest datasets from a CKAN portal

Search indexed datasets

Export datasets

View statistics

CLI Reference

Development

Architecture

Roadmap

v0.0.1 — Initial Release ✅

v0.1 — Enhancements ✅

v0.2 — Multi-portal & API

Future

Contributing

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 4

Languages