Skip to content

Integration of Typesense for Full-Text Search and Obsidian-style YAML Frontmatter indexing #24

@GaspardBBY

Description

@GaspardBBY

Participants

Description

Actually, the MByte project allows users to store and manage files. As #8, to enhance the user experience and move towards a professional Cloud Storage solution, we need to implement a powerful search engine. The goal is to index the content of stored files (PDF, Docx, TXT, etc.) using Elasticsearch to allow full-text search across the user's store.

Additionally, we want to support advanced metadata extraction, specifically YAML Frontmatter, a standard widely used in note-taking apps like Obsidian.

User Stories

  • As a user, I want to search for a keyword and find all documents containing that word in their body text.
  • As a user, I want to filter my notes/files based on the metadata (tags, dates, aliases) defined in their YAML frontmatter.
  • As a developer, I want a scalable indexing system that stays synchronized with the PostgreSQL database.

Technical Requirements & Tasks

1. Infrastructure & Setup

  • Add an Elasticsearch service to the docker-compose.yml.
  • Configure the Manager or Store application to communicate with the Elasticsearch cluster (using the Quarkus Elasticsearch extension).
  • Ensure the service is discoverable via Consul.

2. Indexing Pipeline

  • Implement a listener or a background job to detect file uploads in the Store service.
  • Integrate Apache Tika (already mentioned in the project resources) to extract text content from various file formats.
  • Create an Elasticsearch Index Mapping optimized for full-text search (using N-grams or specific Analyzers).

3. YAML Frontmatter Support (Optional/Bonus)

  • Implement a parser to detect YAML blocks at the beginning of Markdown files:
---
tags: [project, miage]
status: draft
---
  • Map these YAML fields to specific attributes in the Elasticsearch document to allow filtered searches (e.g., find all files where status is draft).

4. Search API

  • Create a new REST endpoint /search?q=keyword in the Store API.
  • Implement the search logic using the Elasticsearch Query DSL.
  • Return relevant results with snippets or highlights if possible.

Proposed Architecture

We should decide between:

  1. Application-level indexing: The Store service sends data to Elasticsearch after saving to Postgres.
  2. Asynchronous indexing: Using a Message Oriented Middleware (MOM) to decouple the indexing process from the upload process.

Resources

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions