Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.
Named after the Roman goddess of harvest and agriculture.
Open data portals are everywhere, but finding the right dataset is still painful:
- Keyword search fails: "public transport" won't find "mobility data" or "bus schedules"
- Portals are fragmented: Italy alone has 20+ regional portals with different interfaces
- No cross-portal search: You can't query Milano and Roma datasets together
Ceres solves this by creating a unified semantic index. Search by meaning, not just keywords.
$ ceres harvest
INFO [Portal 1/5] milano (https://dati.comune.milano.it)
INFO Found 2575 dataset(s) on portal
INFO Progress: 2575/2575 (100%) - 0 new, 0 updated, 2575 unchanged, 0 failed
INFO [Portal 1/5] milano completed: 2575 dataset(s)
INFO [Portal 2/5] sicilia (https://dati.regione.sicilia.it)
INFO Found 186 dataset(s) on portal
INFO [Portal 2/5] sicilia completed: 186 dataset(s)
INFO [Portal 3/5] trentino (https://dati.trentino.it)
INFO Found 1388 dataset(s) on portal
INFO [Portal 3/5] trentino completed: 1388 dataset(s) (1388 created)
INFO [Portal 4/5] aragon (https://opendata.aragon.es/ckan)
INFO Found 2879 dataset(s) on portal
INFO [Portal 4/5] aragon completed: 2879 dataset(s) (2879 created)
INFO [Portal 5/5] nrw (https://ckan.open.nrw.de)
INFO Found 10926 dataset(s) on portal
INFO [Portal 5/5] nrw completed: 10926 dataset(s) (10926 created)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BATCH HARVEST COMPLETE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Portals processed: 5
Successful: 5
Failed: 0
Total datasets: 17954
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
$ ceres search "trasporto pubblico" --limit 5
π Search Results for: "trasporto pubblico"
Found 5 matching datasets:
1. [ββββββββββ] [78%] TPL - Percorsi linee di superficie
π https://dati.comune.milano.it
π https://dati.comune.milano.it/dataset/ds534-tpl-percorsi-linee-di-superficie
π Il dataset contiene i tracciati delle linee di trasporto pubblico di superficie...
2. [ββββββββββ] [76%] TPL - Fermate linee di superficie
π https://dati.comune.milano.it
π https://dati.comune.milano.it/dataset/ds535-tpl-fermate-linee-di-superficie
π Il dataset contiene le fermate delle linee di trasporto pubblico di superficie...
3. [ββββββββββ] [72%] MobilitΓ : flussi veicolari rilevati dai spire
π https://dati.comune.milano.it
π https://dati.comune.milano.it/dataset/ds418-mobilita-flussi-veicolari
π Dati sul traffico veicolare rilevati dalle spire elettromagnetiche...
$ ceres stats
π Database Statistics
Total datasets: 17954
With embeddings: 17954
Unique portals: 5
Last update: 2025-12-31 15:09:48 UTC
- CKAN Harvester β Fetch datasets from any CKAN-compatible portal
- Multi-portal Batch Harvest β Configure multiple portals in
portals.tomland harvest them all at once - Delta Harvesting β Only regenerate embeddings for changed datasets (99.8% API cost savings)
- Real-time Progress β Live progress reporting during harvest with batch timestamp updates
- Semantic Search β Find datasets by meaning using Gemini embeddings
- Multi-format Export β Export to JSON, JSON Lines, or CSV
- Database Statistics β Monitor indexed datasets and portals
Ceres comes with verified CKAN portals ready to use:
| Portal | Region | Datasets |
|---|---|---|
| Milano | Italy | ~2,575 |
| Sicilia | Italy | ~186 |
| Trentino | Italy | ~1,388 |
| AragΓ³n | Spain | ~2,879 |
| NRW | Germany | ~10,926 |
See examples/portals.toml for the full list. Want to add more? Check issue #19.
API costs, based on the Gemini embedding model, are almost negligible, making the solution extremely efficient even for personal projects or those with limited budgets.
The main cost is for the initial creation of vector embeddings. Below is a cost breakdown for a large catalog.
This scenario estimates the one-time cost to index a catalog of 50,000 datasets.
| Metric | Detail |
|---|---|
| Cost per 1M Input Tokens | ~$0.15 USD (Standard rate for Google's text-embedding-004 model) |
| Estimated Tokens per Dataset | 500 tokens (A generous estimate for title, description, and tags) |
| Total Tokens | 50,000 datasets * 500 tokens/dataset = 25,000,000 tokens |
| Total Initial Cost | (25,000,000 / 1,000,000) * $0.15 = $3.75 |
As shown, the initial cost to index a substantial number of datasets is just a few dollars. Monthly maintenance for incremental updates would be even lower, typically amounting to a few cents.
| Component | Technology |
|---|---|
| Language | Rust (async with Tokio) |
| Database | PostgreSQL 16+ with pgvector |
| Embeddings | Google Gemini text-embedding-004 |
| Portal Protocol | CKAN API v3 |
- Rust 1.87+
- Docker & Docker Compose
- Google Gemini API key (get one free)
# Install from crates.io
cargo install ceres-search
# Or build from source
git clone https://github.com/AndreaBozzo/Ceres.git
cd Ceres
cargo build --release# Start PostgreSQL with pgvector
docker-compose up -d
# Run database migrations
make migrate
# Or manually with psql if you prefer
# psql postgresql://ceres_user:password@localhost:5432/ceres_db \
# -f migrations/202511290001_init.sql
# Configure environment
cp .env.example .env
# Edit .env with your Gemini API keyπ‘ Tip: This project includes a Makefile with convenient shortcuts. Run
make helpto see all available commands.
ceres harvest https://dati.comune.milano.itceres search "trasporto pubblico" --limit 10# JSON Lines (default)
ceres export > datasets.jsonl
# JSON array
ceres export --format json > datasets.json
# CSV
ceres export --format csv > datasets.csv
# Filter by portal
ceres export --portal https://dati.comune.milano.itceres statsceres <COMMAND>
Commands:
harvest Harvest datasets from a CKAN portal or batch harvest from portals.toml
search Search indexed datasets using semantic similarity
export Export indexed datasets to various formats
stats Show database statistics
help Print help information
Environment Variables:
DATABASE_URL PostgreSQL connection string
GEMINI_API_KEY Google Gemini API key for embeddings
The project includes a Makefile with convenient shortcuts for common development tasks:
# Start development environment (starts PostgreSQL with docker-compose)
make dev
# Run database migrations
make migrate
# Build the project
make build
# Build in release mode
make release
# Run tests
make test
# Format code
make fmt
# Run lints
make clippy
# See all available commands
make help- CKAN harvester with concurrent processing
- Gemini embeddings (text-embedding-004, 768 dimensions)
- CLI with harvest, search, export, stats commands
- PostgreSQL + pgvector backend
- Multi-format export (JSON, JSONL, CSV)
- Portals configuration from
portals.toml - Delta harvesting
- Improved error handling and retry logic
- Incremental harvesting (time-based metadata filtering)
- REST API
- Graceful shutdown
- Multilingual embeddings (E5-multilingual)
- Cross-language search
- data.europa.eu integration
- Socrata support
- DCAT-AP harvester (EU portals)
- Switchable embedding providers
- Schema-level search
- Data quality scoring
Contributions are welcome! This project is in early stages, so there's plenty of room to shape its direction.
# Run tests
cargo test
# Run with debug logging
RUST_LOG=debug cargo run -- harvest https://dati.comune.milano.itSee CONTRIBUTING.md for guidelines.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
- pgvector β vector similarity for Postgres
- Google Gemini β embeddings API
- CKAN β the open source data portal platform

