-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Participants
- Romain PETER @Rom1Peter
- Amadou SOW @akumq
- Gaspard BAUBY @GaspardBBY
Description
Actually, the MByte project allows users to store and manage files. As #8, to enhance the user experience and move towards a professional Cloud Storage solution, we need to implement a powerful search engine. The goal is to index the content of stored files (PDF, Docx, TXT, etc.) using Elasticsearch to allow full-text search across the user's store.
Additionally, we want to support advanced metadata extraction, specifically YAML Frontmatter, a standard widely used in note-taking apps like Obsidian.
User Stories
- As a user, I want to search for a keyword and find all documents containing that word in their body text.
- As a user, I want to filter my notes/files based on the metadata (tags, dates, aliases) defined in their YAML frontmatter.
- As a developer, I want a scalable indexing system that stays synchronized with the PostgreSQL database.
Technical Requirements & Tasks
1. Infrastructure & Setup
- Add an Elasticsearch service to the
docker-compose.yml. - Configure the Manager or Store application to communicate with the Elasticsearch cluster (using the Quarkus Elasticsearch extension).
- Ensure the service is discoverable via Consul.
2. Indexing Pipeline
- Implement a listener or a background job to detect file uploads in the Store service.
- Integrate Apache Tika (already mentioned in the project resources) to extract text content from various file formats.
- Create an Elasticsearch Index Mapping optimized for full-text search (using N-grams or specific Analyzers).
3. YAML Frontmatter Support (Optional/Bonus)
- Implement a parser to detect YAML blocks at the beginning of Markdown files:
---
tags: [project, miage]
status: draft
---
- Map these YAML fields to specific attributes in the Elasticsearch document to allow filtered searches (e.g.,
find all files where status is draft).
4. Search API
- Create a new REST endpoint
/search?q=keywordin the Store API. - Implement the search logic using the Elasticsearch Query DSL.
- Return relevant results with snippets or highlights if possible.
Proposed Architecture
We should decide between:
- Application-level indexing: The Store service sends data to Elasticsearch after saving to Postgres.
- Asynchronous indexing: Using a Message Oriented Middleware (MOM) to decouple the indexing process from the upload process.
Resources
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels