🧠 Artifaix: AI-Powered Research Artifact Extraction

Artifaix is a professional-grade research platform designed to optimize the process of gathering and reusing knowledge from scientific literature. Developed at RWTH Aachen, the name is a tribute to our core focus and our city: a fusion of Artifacts (the heart of our system), AI (our engine), and AIX (the historical name for Aachen). It helps ML researchers and engineers filter out noise and extract high-value Knowledge Artifacts (Algorithms, Code, Terminology) into a curated, searchable repository.

🚀 The Core Vision: Scaling Research

In the status quo, researchers face information overload. Artifaix addresses this by automating the "Search, Evaluate, and Retain" loop:

AI-Driven Evaluation: Automatically filters out low-quality sources based on project-specific ML problem definitions.
Precision Extraction: Uses specialized agents to find "Declarative" (Algos), "Procedural" (Patterns), and "Executable" (Code) artifacts.
Collaborative Repository: A centralized hub where team "Explorers" can reuse curated knowledge without re-reading entire papers.

🛠️ Advanced Intelligence Pipeline

The system leverages state-of-the-art NLP techniques to handle complex academic documents:

📄 Smart Document Processing

MapReduce Strategy: Large papers are split into chunks, processed in parallel, and reduced into a clean, deduplicated knowledge graph.
Docling Integration: High-fidelity structural parsing of PDFs, images, and HTML into unified JSON representations.
Metadata Enrichment: Integrates with the OpenAlex API to verify and enrich bibliographic data (citations, DOIs, and venues).

🤖 LLM-Assisted Reasoning

Persona-Based Agents: Specialized personas (Academic Reviewer, Technical Specialist) drive the extraction logic.
Iterative Feedback Loops: Each extraction is self-assessed for completeness and hallucinations, refining the prompt until the output is optimal.
Trust Checker: A dedicated agent that cross-references extracted metadata with document content to ensure data integrity.

🔍 Semantic Search (RAG)

Embedding Model: BAAI/bge-small-en-v1.5 (optimized for semantic similarity).
Vector Storage: ChromaDB (local-first, python-native) for instant retrieval.
Retrieval-Augmented Generation: Direct chat with your knowledge base using the gpt-oss-120b model (hosted on RWTH Aachen HPC).

🧱 Software Architecture

Built with a modular, microservice-first approach for maximum flexibility.

Frontend: React + TypeScript + shadcn/ui.
Data Service: FastAPI + SQLite (Relational metadata tracking, inspired by Zotero).
LLM API: Specialized endpoints for Chatter, Extraction, and Trust Checking.
Knowledge Manager: Handles TF-IDF calculation and semantic indexing.

🗺️ System Walkthrough: A Guided Tour

This repository is organized as a modular microservice ecosystem. Each component plays a specific role in the information foraging lifecycle:

1. 📡 Data Service (The Central Nervous System)

The backbone of the system. It manages the relational state, project hierarchies, and coordinates between the UI and the intelligence layer.

Key Tech: FastAPI, SQLAlchemy, SQLite.
Focus: Domain-driven design, CRUD operations, and service orchestration.

2. 🤖 LLM Service (The Intelligence Layer)

Where the heavy lifting happens. This service handles PDF parsing, semantic embedding, and iterative AI reasoning.

Key Tech: ChromaDB, Docling, OpenAI-compatible APIs.
Focus: RAG pipelines, MapReduce document processing, and self-evaluating AI agents.

3. 💻 Frontend (The User Experience)

A modern, responsive dashboard where researchers manage their projects and interact with their curated knowledge.

Key Tech: React, TypeScript, Tailwind CSS, shadcn/ui.
Focus: Clean data visualization and interactive AI chat interfaces.

🛠️ CLI & Research Utilities

Beyond the web interface, the llm-service/LLM API contains specialized scripts for deep research:

tfidf.py: Performs keyword analysis to identify distinctive terms in documents compared to the rest of the repository.
prompt_test.py: An interactive CLI for testing Trust Checking, Entity Extraction, and Relevance Evaluation on specific documents.
prompt_batch_extract.py: Automates the full extraction pipeline across all processed documents with iterative self-assessment.
llm_list.py: Utility to list and verify available models on the HPC cluster.

🚦 Getting Started

1. Environment Setup

You'll need Python 3.11+ and Node.js.

# Clone the repository
git clone https://github.com/Sahand-rdl/Artifaix.git
cd Artifaix

2. Configure Secrets

Copy the template .env files and add your API keys:

data-service/.env.example -> data-service/.env
llm-service/.env.example -> llm-service/.env

Note on Storage: On the first run, the system automatically creates a storage directory at ~/Documents/AIForaging/papers (configured in .env). This pathing supports macOS and Linux; for Windows users, please update the PAPERS_STORAGE_PATH in .env to a valid Windows path.

3. Master Launch Script

chmod +x start_services.sh
./start_services.sh

Access the Frontend at http://localhost:8080.

📜 Academic References & Acknowledgments

This project was developed at the RWTH Aachen University (Software Construction Lab) under the supervision of Selin Coban.

MapReduce Implementation: Inspired by Dean & Ghemawat (2008).
LLM Infrastructure: Supported by the HPC Aachen gpt-oss cluster.
Semantic Engine: Powered by BAAI's bge-small and ChromaDB.

Transforming information into insight, one artifact at a time.

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
UI (nano banana)		UI (nano banana)
data-service		data-service
documents		documents
frontend		frontend
llm-service		llm-service
.gitignore		.gitignore
AI Lab Domain Story.jpeg		AI Lab Domain Story.jpeg
OpenSourceTools.pages		OpenSourceTools.pages
Presentation of the tool.pdf		Presentation of the tool.pdf
README.md		README.md
domain model.png		domain model.png
processing pipeline.JPG		processing pipeline.JPG
start_services.sh		start_services.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Artifaix: AI-Powered Research Artifact Extraction

🚀 The Core Vision: Scaling Research

🛠️ Advanced Intelligence Pipeline

📄 Smart Document Processing

🤖 LLM-Assisted Reasoning

🔍 Semantic Search (RAG)

🧱 Software Architecture

🗺️ System Walkthrough: A Guided Tour

1. 📡 Data Service (The Central Nervous System)

2. 🤖 LLM Service (The Intelligence Layer)

3. 💻 Frontend (The User Experience)

🛠️ CLI & Research Utilities

🚦 Getting Started

1. Environment Setup

2. Configure Secrets

3. Master Launch Script

📜 Academic References & Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Artifaix: AI-Powered Research Artifact Extraction

🚀 The Core Vision: Scaling Research

🛠️ Advanced Intelligence Pipeline

📄 Smart Document Processing

🤖 LLM-Assisted Reasoning

🔍 Semantic Search (RAG)

🧱 Software Architecture

🗺️ System Walkthrough: A Guided Tour

1. 📡 Data Service (The Central Nervous System)

2. 🤖 LLM Service (The Intelligence Layer)

3. 💻 Frontend (The User Experience)

🛠️ CLI & Research Utilities

🚦 Getting Started

1. Environment Setup

2. Configure Secrets

3. Master Launch Script

📜 Academic References & Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages