Skip to content

j341nono/llemb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

104 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI - Python Version PyPI - Package Version License

llemb: Unified Embedding Extraction from Decoder-only LLMs

llemb is a lightweight framework designed to extract high-quality sentence embeddings from Decoder-only Large Language Models (LLMs) like Llama, Mistral, and others. It unifies various state-of-the-art pooling strategies and efficiency optimizations into a simple, coherent interface.

With llemb, you can easily leverage powerful LLMs for embedding tasks using advanced techniques like PromptEOL and PCoTEOL, with built-in support for Batch Processing and vLLM for high-throughput inference.

Origin of the Name llemb is a coined name derived from LLM and Embedding. In Japanese pronunciation, the letter M in LLM is read as /ɛm/ (emu), and Embedding begins with Em, which is pronounced /ɛm/ as well. Overlapping this sound reflects the idea of merging LLMs and embeddings.

Features

  • Flexible Backends: Seamless support for Hugging Face Transformers and vLLM (for high-speed inference).
  • Advanced Pooling & Prompting:
    • Pooling Methods: mean, last_token, eos_token
    • Prompt Templates: prompteol, pcoteol (Pretended Chain of Thought), ke (Knowledge Enhancement)
    • Independent Control: Combine any pooling method with any prompt template
  • High-Performance:
    • Batch Processing: Efficiently handles large datasets with automatic CPU offloading and progress bars (tqdm).
    • Quantization: Native support for 4-bit and 8-bit via bitsandbytes (Transformers) or fp8/awq/gptq (vLLM).
  • Granular Control: Extract embeddings from any layer (defaults to recommended layers based on research).

Installation

Install via PyPI using pip or uv.

Basic Installation (includes quantization support)

pip install llemb
# or
uv add llemb

Starting from v0.2.2, bitsandbytes for quantization is included by default.

With vLLM Support

To enable the vLLM backend for faster inference:

pip install "llemb[vllm]"
# or
uv add llemb[vllm]

Getting Started

Initialize the encoder and start extracting embeddings in just a few lines of code.

Basic Usage

import llemb

# 1. Initialize the encoder (defaults to auto-device detection)
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# 2. Extract embeddings using mean pooling
embeddings = enc.encode("Hello world", pooling_method="mean")

print(embeddings.shape)
# => (1, 4096)

Batch Processing (Recommended for Large Data)

Process lists of texts efficiently. llemb handles batching, progress tracking with tqdm, and memory management by automatically offloading computed embeddings to the CPU to prevent GPU VRAM saturation.

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is fascinating.",
    # ... thousands of texts ...
]

# Process in batches of 32
embeddings = enc.encode(texts, batch_size=32, pooling_method="mean")

print(embeddings.shape)
# => (N, 4096)

Using vLLM Backend

For maximum throughput on supported hardware, use the vLLM backend. You can pass vLLM-specific arguments (like tensor_parallel_size or gpu_memory_utilization) directly to the encoder.

enc = llemb.Encoder(
    "meta-llama/Llama-3.1-8B",
    backend="vllm",
    tensor_parallel_size=1,       # Number of GPUs
    gpu_memory_utilization=0.9    # vLLM memory setting
)

embeddings = enc.encode("Hello vLLM", pooling_method="last_token")

Using Prompt Templates

Leverage research-backed prompt templates to improve embedding quality. When you specify a template, pooling_method automatically defaults to last_token.

import llemb

enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# Simple usage - pooling_method automatically defaults to last_token
embeddings = enc.encode(
    "Hello world",
    prompt_template="prompteol"
)

# Use PCoTEOL (Pretended Chain of Thought) template
# Note: Automatically uses layer -2 when prompt_template is "pcoteol" or "ke"
embeddings = enc.encode(
    "Hello world",
    prompt_template="pcoteol"
)

# Or override the layer index
embeddings = enc.encode(
    "Hello world",
    prompt_template="pcoteol",
    layer_index=-1  # Override default -2
)

Advanced Usage (Quantization)

Use quantization to reduce memory usage.

import llemb

# Initialize with 4-bit quantization and force CUDA
enc = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    backend="transformers",
    device="cuda",
    quantization="4bit"
)

embeddings = enc.encode(
    "Hello world",
    pooling_method="mean",
    prompt_template="ke"  # Knowledge Enhancement template
)

Configuration & Optimization

llemb passes arguments directly to the backend, allowing for deep customization.

Using Flash Attention 2

import torch

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Custom Quantization Config

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)

Supported Pooling Methods & Prompt Templates

Pooling Methods (pooling_method parameter)

Method Description Default Layer
mean Average pooling of all tokens (excluding padding). -1 (Last)
last_token Vector of the last generated token. -1 (Last)
eos_token Vector corresponding to the EOS token position. -1 (Last)

Prompt Templates (prompt_template parameter)

Template Description Default Layer
prompteol Simple prompt template targeting the last token. -1 (Last)
pcoteol "Pretended Chain of Thought" - wraps input in a reasoning template. -2
ke "Knowledge Enhancement" - wraps input in a context-aware template. -2

You can combine any pooling method with any prompt template. When using pcoteol or ke templates, the default layer is automatically set to -2 unless explicitly overridden.

Smart Defaults (v0.2.2+):

  • When prompt_template is specified, pooling_method automatically defaults to last_token
  • When no prompt_template is specified, pooling_method defaults to mean
  • Explicit values always take precedence

Migration Guide

If you're upgrading from an earlier version of llemb, the API has been refactored to separate pooling methods and prompt templates into orthogonal parameters.

Breaking Changes in v0.2.2

  1. API Refactoring: pooling parameter split into pooling_method and prompt_template
  2. Smart Defaults: pooling_method automatically set to last_token when using templates
  3. Dependencies: bitsandbytes is now a core dependency (no longer optional)
  4. Removed: index pooling strategy has been removed

Old API (deprecated):

# Old: pooling parameter mixed strategies and templates
enc.encode("text", pooling="mean")
enc.encode("text", pooling="pcoteol")  # Mixed template + pooling
enc.encode("text", pooling="index", token_index=0)  # Index strategy

New API (v0.2.2+):

# New: separate parameters with smart defaults
enc.encode("text", pooling_method="mean")
enc.encode("text", prompt_template="pcoteol")  # pooling_method auto-set to last_token
# Index strategy has been removed

Migration Examples

Old Code New Code (Explicit) New Code (Smart Default)
enc.encode(text, pooling="mean") enc.encode(text, pooling_method="mean") enc.encode(text)
enc.encode(text, pooling="last_token") enc.encode(text, pooling_method="last_token")
enc.encode(text, pooling="eos_token") enc.encode(text, pooling_method="eos_token")
enc.encode(text, pooling="prompteol") enc.encode(text, pooling_method="last_token", prompt_template="prompteol") enc.encode(text, prompt_template="prompteol")
enc.encode(text, pooling="pcoteol") enc.encode(text, pooling_method="last_token", prompt_template="pcoteol") enc.encode(text, prompt_template="pcoteol")
enc.encode(text, pooling="ke") enc.encode(text, pooling_method="last_token", prompt_template="ke") enc.encode(text, prompt_template="ke")
enc.encode(text, pooling="index", token_index=0) Removed - use last_token or implement custom logic

Benefits of the New API

  1. Orthogonality: You can now combine any pooling method with any prompt template.
  2. Clarity: The separation makes it clear what each parameter does.
  3. Flexibility: Easier to experiment with different combinations.
  4. Smart Defaults:
    • pooling_method automatically set to last_token when using templates
    • Layer indices automatically set based on prompt template (can be overridden)
  5. Simpler Usage: Less boilerplate for common use cases with templates.

Development

Clone the repository and sync dependencies using uv:

git clone https://github.com/j341nono/llemb.git
cd llemb
uv sync --all-extras --dev

Run Tests

uv run pytest

Static Analysis

uv run ruff check src
uv run mypy src

Citations

If you use the advanced pooling strategies implemented in this library, please cite the respective original papers:

PromptEOL:

@inproceedings{jiang-etal-2024-scaling,
    title = "Scaling Sentence Embeddings with Large Language Models",
    author = "Jiang, Ting and Huang, Shaohan and Luan, Zhongzhi and Wang, Deqing and Zhuang, Fuzhen",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    year = "2024"
}

PCoTEOL and KE:

@article{zhang2024simple,
    title={Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models},
    author={Zhang, Bowen and Chang, Kehua and Li, Chunping},
    journal={arXiv preprint arXiv:2404.03921},
    year={2024}
}

License

This project is open source and available under the Apache-2.0 license.

About

Unified embedding extraction for decoder-only LLMs with support for pooling strategies, quantization, and layer selection.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages