Skip to content

AndreaBozzo/Ceres

Ceres Logo

Ceres

Semantic search engine for open data portals

crates.io CI License

Quick Start β€’ Features β€’ Usage β€’ Roadmap


Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.

Named after the Roman goddess of harvest and agriculture.

Why Ceres?

Open data portals are everywhere, but finding the right dataset is still painful:

  • Keyword search fails: "public transport" won't find "mobility data" or "bus schedules"
  • Portals are fragmented: Italy alone has 20+ regional portals with different interfaces
  • No cross-portal search: You can't query Milano and Roma datasets together

Ceres solves this by creating a unified semantic index. Search by meaning, not just keywords.

$ ceres harvest

INFO [Portal 1/5] milano (https://dati.comune.milano.it)
INFO Found 2575 dataset(s) on portal
INFO Progress: 2575/2575 (100%) - 0 new, 0 updated, 2575 unchanged, 0 failed
INFO [Portal 1/5] milano completed: 2575 dataset(s)

INFO [Portal 2/5] sicilia (https://dati.regione.sicilia.it)
INFO Found 186 dataset(s) on portal
INFO [Portal 2/5] sicilia completed: 186 dataset(s)

INFO [Portal 3/5] trentino (https://dati.trentino.it)
INFO Found 1388 dataset(s) on portal
INFO [Portal 3/5] trentino completed: 1388 dataset(s) (1388 created)

INFO [Portal 4/5] aragon (https://opendata.aragon.es/ckan)
INFO Found 2879 dataset(s) on portal
INFO [Portal 4/5] aragon completed: 2879 dataset(s) (2879 created)

INFO [Portal 5/5] nrw (https://ckan.open.nrw.de)
INFO Found 10926 dataset(s) on portal
INFO [Portal 5/5] nrw completed: 10926 dataset(s) (10926 created)

═══════════════════════════════════════════════════════
BATCH HARVEST COMPLETE
═══════════════════════════════════════════════════════
  Portals processed:   5
  Successful:          5
  Failed:              0
  Total datasets:      17954
═══════════════════════════════════════════════════════
$ ceres search "trasporto pubblico" --limit 5

πŸ” Search Results for: "trasporto pubblico"

Found 5 matching datasets:

1. [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] [78%] TPL - Percorsi linee di superficie
   πŸ“ https://dati.comune.milano.it
   πŸ”— https://dati.comune.milano.it/dataset/ds534-tpl-percorsi-linee-di-superficie
   πŸ“ Il dataset contiene i tracciati delle linee di trasporto pubblico di superficie...

2. [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘] [76%] TPL - Fermate linee di superficie
   πŸ“ https://dati.comune.milano.it
   πŸ”— https://dati.comune.milano.it/dataset/ds535-tpl-fermate-linee-di-superficie
   πŸ“ Il dataset contiene le fermate delle linee di trasporto pubblico di superficie...

3. [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘] [72%] MobilitΓ : flussi veicolari rilevati dai spire
   πŸ“ https://dati.comune.milano.it
   πŸ”— https://dati.comune.milano.it/dataset/ds418-mobilita-flussi-veicolari
   πŸ“ Dati sul traffico veicolare rilevati dalle spire elettromagnetiche...
$ ceres stats

πŸ“Š Database Statistics

  Total datasets:        17954
  With embeddings:       17954
  Unique portals:        5
  Last update:           2025-12-31 15:09:48 UTC

Features

  • CKAN Harvester β€” Fetch datasets from any CKAN-compatible portal
  • Multi-portal Batch Harvest β€” Configure multiple portals in portals.toml and harvest them all at once
  • Delta Harvesting β€” Only regenerate embeddings for changed datasets (99.8% API cost savings)
  • Real-time Progress β€” Live progress reporting during harvest with batch timestamp updates
  • Semantic Search β€” Find datasets by meaning using Gemini embeddings
  • Multi-format Export β€” Export to JSON, JSON Lines, or CSV
  • Database Statistics β€” Monitor indexed datasets and portals

Pre-configured Portals

Ceres comes with verified CKAN portals ready to use:

Portal Region Datasets
Milano Italy ~2,575
Sicilia Italy ~186
Trentino Italy ~1,388
AragΓ³n Spain ~2,879
NRW Germany ~10,926

See examples/portals.toml for the full list. Want to add more? Check issue #19.

Cost-Effectiveness

API costs, based on the Gemini embedding model, are almost negligible, making the solution extremely efficient even for personal projects or those with limited budgets.

The main cost is for the initial creation of vector embeddings. Below is a cost breakdown for a large catalog.

Cost Analysis for Initial Indexing

This scenario estimates the one-time cost to index a catalog of 50,000 datasets.

Metric Detail
Cost per 1M Input Tokens ~$0.15 USD (Standard rate for Google's text-embedding-004 model)
Estimated Tokens per Dataset 500 tokens (A generous estimate for title, description, and tags)
Total Tokens 50,000 datasets * 500 tokens/dataset = 25,000,000 tokens
Total Initial Cost (25,000,000 / 1,000,000) * $0.15 = $3.75

As shown, the initial cost to index a substantial number of datasets is just a few dollars. Monthly maintenance for incremental updates would be even lower, typically amounting to a few cents.

Tech Stack

Component Technology
Language Rust (async with Tokio)
Database PostgreSQL 16+ with pgvector
Embeddings Google Gemini text-embedding-004
Portal Protocol CKAN API v3

Quick Start

Prerequisites

  • Rust 1.87+
  • Docker & Docker Compose
  • Google Gemini API key (get one free)

Installation

# Install from crates.io
cargo install ceres-search

# Or build from source
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cargo build --release

Setup

# Start PostgreSQL with pgvector
docker-compose up -d

# Run database migrations
make migrate

# Or manually with psql if you prefer
# psql postgresql://ceres_user:password@localhost:5432/ceres_db \
#   -f migrations/202511290001_init.sql

# Configure environment
cp .env.example .env
# Edit .env with your Gemini API key

πŸ’‘ Tip: This project includes a Makefile with convenient shortcuts. Run make help to see all available commands.

Usage

Harvest datasets from a CKAN portal

ceres harvest https://dati.comune.milano.it

Search indexed datasets

ceres search "trasporto pubblico" --limit 10

Export datasets

# JSON Lines (default)
ceres export > datasets.jsonl

# JSON array
ceres export --format json > datasets.json

# CSV
ceres export --format csv > datasets.csv

# Filter by portal
ceres export --portal https://dati.comune.milano.it

View statistics

ceres stats

CLI Reference

ceres <COMMAND>

Commands:
  harvest  Harvest datasets from a CKAN portal or batch harvest from portals.toml
  search   Search indexed datasets using semantic similarity
  export   Export indexed datasets to various formats
  stats    Show database statistics
  help     Print help information

Environment Variables:
  DATABASE_URL     PostgreSQL connection string
  GEMINI_API_KEY   Google Gemini API key for embeddings

Development

The project includes a Makefile with convenient shortcuts for common development tasks:

# Start development environment (starts PostgreSQL with docker-compose)
make dev

# Run database migrations
make migrate

# Build the project
make build

# Build in release mode
make release

# Run tests
make test

# Format code
make fmt

# Run lints
make clippy

# See all available commands
make help

Architecture

Ceres Architecture Diagram

Roadmap

v0.0.1 β€” Initial Release βœ…

  • CKAN harvester with concurrent processing
  • Gemini embeddings (text-embedding-004, 768 dimensions)
  • CLI with harvest, search, export, stats commands
  • PostgreSQL + pgvector backend
  • Multi-format export (JSON, JSONL, CSV)

v0.1 β€” Enhancements βœ…

  • Portals configuration from portals.toml
  • Delta harvesting
  • Improved error handling and retry logic

v0.2 β€” Multi-portal & API

  • Incremental harvesting (time-based metadata filtering)
  • REST API
  • Graceful shutdown

Future

  • Multilingual embeddings (E5-multilingual)
  • Cross-language search
  • data.europa.eu integration
  • Socrata support
  • DCAT-AP harvester (EU portals)
  • Switchable embedding providers
  • Schema-level search
  • Data quality scoring

Contributing

Contributions are welcome! This project is in early stages, so there's plenty of room to shape its direction.

# Run tests
cargo test

# Run with debug logging
RUST_LOG=debug cargo run -- harvest https://dati.comune.milano.it

See CONTRIBUTING.md for guidelines.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Acknowledgments

  • pgvector β€” vector similarity for Postgres
  • Google Gemini β€” embeddings API
  • CKAN β€” the open source data portal platform