Skip to content

andrestubbe/FastAIModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FastAIModel 0.1.0 [ALPHA] β€” Native Local Inference Runtime for Java

Status License: MIT Java Platform JitPack


πŸ’‘ Ultra-fast local LLM and embedding inference directly inside your JVM process β€” Zero-copy, Zero HTTP overhead, C++ native speed.

FastAIModel is a retained-memory local inference engine for Java that wraps llama.cpp (for GGUF) and ONNX Runtime (for ONNX) using direct JNI bindings. It is the engine that drives offline execution in the FastJava Ecosystem, giving Java developers native LLM and embedding capabilities without keeping heavy external apps (like LM Studio or Ollama) open.


Table of Contents


Quick Start

import fastaimodel.FastAIModel;

public class Demo {
    public static void main(String[] args) {
        // Load local GGUF model directly into memory
        try (FastAIModel model = new FastAIModel("models/qwen2.5-coder-1.5b.gguf", 2048, 0.7f)) {
            model.generate("Write a quicksort in Java:", token -> {
                System.out.print(token);
                System.out.flush();
            });
        }
    }
}

Why FastAIModel?

Running LLMs locally in Java typically requires invoking external subprocesses or running local HTTP servers. FastAIModel eliminates this bloat by running the model directly inside your Java process:

  • True In-Process Execution β€” Runs the model in the same process space, bypassing system context-switches and network sockets.
  • Zero HTTP/JSON Overhead β€” Text and tokens flow directly between Java and C++ memory.
  • Low Memory Overhead β€” Eliminates the footprint of keeping GUI-based desktop inference servers running in the background.

Key Features

  • πŸš€ Native llama.cpp Performance β€” Direct integration with CPU AVX2/AVX512 instruction sets and GPU computation (Vulkan/CUDA).
  • 🌊 Direct Token Streaming β€” Direct native callbacks stream tokens back to your Java consumer in real-time.
  • πŸ“¦ GGUF Support β€” Native compatibility with any GGUF quantized models (Llama, Qwen, Mistral, Gemma).
  • 🧠 Zero-Copy Memory β€” Shared token handling minimizing garbage collection strain on the JVM.

Installation

Option 1: Maven (via JitPack)

Add the JitPack repository and the dependencies to your pom.xml:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

<dependencies>
    <!-- FastAIModel Engine -->
    <dependency>
        <groupId>com.github.andrestubbe</groupId>
        <artifactId>FastAIModel</artifactId>
        <version>0.1.0</version>
    </dependency>

    <!-- FastCore (Mandatory Native DLL Loader) -->
    <dependency>
        <groupId>com.github.andrestubbe</groupId>
        <artifactId>FastCore</artifactId>
        <version>0.1.0</version>
    </dependency>
</dependencies>

Option 2: Gradle (via JitPack)

repositories {
    maven { url 'https://jitpack.io' }
}

dependencies {
    implementation 'com.github.andrestubbe:FastAIModel:0.1.0'
    implementation 'com.github.andrestubbe:FastCore:0.1.0'
}

Documentation


Platform Support

Platform Status
Windows 10/11 (x64) βœ… Fully Supported
Linux 🚧 Planned
macOS 🚧 Planned

License

MIT License β€” See LICENSE file for details.


Part of the FastJava Ecosystem β€” Making the JVM faster. Small package. Maximum speed. Zero bloat. πŸš€πŸ“‹

About

πŸ€– Zero-dependency, high-performance local inference runtime (GGUF/ONNX) engine for Java via native llama.cpp bindings.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages