llemb: Unified Embedding Extraction from Decoder-only LLMs

llemb is a lightweight framework designed to extract high-quality sentence embeddings from Decoder-only Large Language Models (LLMs) like Llama, Mistral, and others. It unifies various state-of-the-art pooling strategies and efficiency optimizations into a simple, coherent interface.

With llemb, you can easily leverage powerful LLMs for embedding tasks using advanced techniques like PromptEOL and PCoTEOL, with built-in support for Batch Processing and vLLM for high-throughput inference.

Origin of the Name llemb is a coined name derived from LLM and Embedding. In Japanese pronunciation, the letter M in LLM is read as /ɛm/ (emu), and Embedding begins with Em, which is pronounced /ɛm/ as well. Overlapping this sound reflects the idea of merging LLMs and embeddings.

Features

Flexible Backends: Seamless support for Hugging Face Transformers and vLLM (for high-speed inference).
Advanced Pooling & Prompting:
- Pooling Methods: mean, last_token, eos_token
- Prompt Templates: prompteol, pcoteol (Pretended Chain of Thought), ke (Knowledge Enhancement)
- Independent Control: Combine any pooling method with any prompt template
High-Performance:
- Batch Processing: Efficiently handles large datasets with automatic CPU offloading and progress bars (tqdm).
- Quantization: Native support for 4-bit and 8-bit via bitsandbytes (Transformers) or fp8/awq/gptq (vLLM).
Granular Control: Extract embeddings from any layer (defaults to recommended layers based on research).

Installation

Install via PyPI using pip or uv.

Basic Installation (includes quantization support)

pip install llemb
# or
uv add llemb

Starting from v0.2.2, bitsandbytes for quantization is included by default.

With vLLM Support

To enable the vLLM backend for faster inference:

pip install "llemb[vllm]"
# or
uv add llemb[vllm]

Getting Started

Initialize the encoder and start extracting embeddings in just a few lines of code.

Basic Usage

import llemb

# 1. Initialize the encoder (defaults to auto-device detection)
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# 2. Extract embeddings using mean pooling
embeddings = enc.encode("Hello world", pooling_method="mean")

print(embeddings.shape)
# => (1, 4096)

Batch Processing (Recommended for Large Data)

Process lists of texts efficiently. llemb handles batching, progress tracking with tqdm, and memory management by automatically offloading computed embeddings to the CPU to prevent GPU VRAM saturation.

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is fascinating.",
    # ... thousands of texts ...
]

# Process in batches of 32
embeddings = enc.encode(texts, batch_size=32, pooling_method="mean")

print(embeddings.shape)
# => (N, 4096)

Using vLLM Backend

For maximum throughput on supported hardware, use the vLLM backend. You can pass vLLM-specific arguments (like tensor_parallel_size or gpu_memory_utilization) directly to the encoder.

enc = llemb.Encoder(
    "meta-llama/Llama-3.1-8B",
    backend="vllm",
    tensor_parallel_size=1,       # Number of GPUs
    gpu_memory_utilization=0.9    # vLLM memory setting
)

embeddings = enc.encode("Hello vLLM", pooling_method="last_token")

Using Prompt Templates

Leverage research-backed prompt templates to improve embedding quality. When you specify a template, pooling_method automatically defaults to last_token.

import llemb

enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# Simple usage - pooling_method automatically defaults to last_token
embeddings = enc.encode(
    "Hello world",
    prompt_template="prompteol"
)

# Use PCoTEOL (Pretended Chain of Thought) template
# Note: Automatically uses layer -2 when prompt_template is "pcoteol" or "ke"
embeddings = enc.encode(
    "Hello world",
    prompt_template="pcoteol"
)

# Or override the layer index
embeddings = enc.encode(
    "Hello world",
    prompt_template="pcoteol",
    layer_index=-1  # Override default -2
)

Advanced Usage (Quantization)

Use quantization to reduce memory usage.

import llemb

# Initialize with 4-bit quantization and force CUDA
enc = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    backend="transformers",
    device="cuda",
    quantization="4bit"
)

embeddings = enc.encode(
    "Hello world",
    pooling_method="mean",
    prompt_template="ke"  # Knowledge Enhancement template
)

Configuration & Optimization

llemb passes arguments directly to the backend, allowing for deep customization.

Using Flash Attention 2

import torch

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)

Custom Quantization Config

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)

Supported Pooling Methods & Prompt Templates

Pooling Methods (`pooling_method` parameter)

Method	Description	Default Layer
`mean`	Average pooling of all tokens (excluding padding).	-1 (Last)
`last_token`	Vector of the last generated token.	-1 (Last)
`eos_token`	Vector corresponding to the EOS token position.	-1 (Last)

Prompt Templates (`prompt_template` parameter)

Template	Description	Default Layer
`prompteol`	Simple prompt template targeting the last token.	-1 (Last)
`pcoteol`	"Pretended Chain of Thought" - wraps input in a reasoning template.	-2
`ke`	"Knowledge Enhancement" - wraps input in a context-aware template.	-2

You can combine any pooling method with any prompt template. When using pcoteol or ke templates, the default layer is automatically set to -2 unless explicitly overridden.

Smart Defaults (v0.2.2+):

When prompt_template is specified, pooling_method automatically defaults to last_token
When no prompt_template is specified, pooling_method defaults to mean
Explicit values always take precedence

Migration Guide

If you're upgrading from an earlier version of llemb, the API has been refactored to separate pooling methods and prompt templates into orthogonal parameters.

Breaking Changes in v0.2.2

API Refactoring: pooling parameter split into pooling_method and prompt_template
Smart Defaults: pooling_method automatically set to last_token when using templates
Dependencies: bitsandbytes is now a core dependency (no longer optional)
Removed: index pooling strategy has been removed

Old API (deprecated):

# Old: pooling parameter mixed strategies and templates
enc.encode("text", pooling="mean")
enc.encode("text", pooling="pcoteol")  # Mixed template + pooling
enc.encode("text", pooling="index", token_index=0)  # Index strategy

New API (v0.2.2+):

# New: separate parameters with smart defaults
enc.encode("text", pooling_method="mean")
enc.encode("text", prompt_template="pcoteol")  # pooling_method auto-set to last_token
# Index strategy has been removed

Migration Examples

Old Code	New Code (Explicit)	New Code (Smart Default)
`enc.encode(text, pooling="mean")`	`enc.encode(text, pooling_method="mean")`	`enc.encode(text)`
`enc.encode(text, pooling="last_token")`	`enc.encode(text, pooling_method="last_token")`	—
`enc.encode(text, pooling="eos_token")`	`enc.encode(text, pooling_method="eos_token")`	—
`enc.encode(text, pooling="prompteol")`	`enc.encode(text, pooling_method="last_token", prompt_template="prompteol")`	`enc.encode(text, prompt_template="prompteol")`
`enc.encode(text, pooling="pcoteol")`	`enc.encode(text, pooling_method="last_token", prompt_template="pcoteol")`	`enc.encode(text, prompt_template="pcoteol")`
`enc.encode(text, pooling="ke")`	`enc.encode(text, pooling_method="last_token", prompt_template="ke")`	`enc.encode(text, prompt_template="ke")`
`enc.encode(text, pooling="index", token_index=0)`	Removed - use `last_token` or implement custom logic	—

Benefits of the New API

Orthogonality: You can now combine any pooling method with any prompt template.
Clarity: The separation makes it clear what each parameter does.
Flexibility: Easier to experiment with different combinations.
Smart Defaults:
- pooling_method automatically set to last_token when using templates
- Layer indices automatically set based on prompt template (can be overridden)
Simpler Usage: Less boilerplate for common use cases with templates.

Development

Clone the repository and sync dependencies using uv:

git clone https://github.com/j341nono/llemb.git
cd llemb
uv sync --all-extras --dev

Run Tests

uv run pytest

Static Analysis

uv run ruff check src
uv run mypy src

Citations

If you use the advanced pooling strategies implemented in this library, please cite the respective original papers:

PromptEOL:

@inproceedings{jiang-etal-2024-scaling,
    title = "Scaling Sentence Embeddings with Large Language Models",
    author = "Jiang, Ting and Huang, Shaohan and Luan, Zhongzhi and Wang, Deqing and Zhuang, Fuzhen",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    year = "2024"
}

PCoTEOL and KE:

@article{zhang2024simple,
    title={Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models},
    author={Zhang, Bowen and Chang, Kehua and Li, Chunping},
    journal={arXiv preprint arXiv:2404.03921},
    year={2024}
}

License

This project is open source and available under the Apache-2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
src/llemb		src/llemb
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llemb: Unified Embedding Extraction from Decoder-only LLMs

Features

Installation

Getting Started

Basic Usage

Batch Processing (Recommended for Large Data)

Using vLLM Backend

Using Prompt Templates

Advanced Usage (Quantization)

Configuration & Optimization

Supported Pooling Methods & Prompt Templates

Pooling Methods (`pooling_method` parameter)

Prompt Templates (`prompt_template` parameter)

Migration Guide

Breaking Changes in v0.2.2

Migration Examples

Benefits of the New API

Development

Citations

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llemb: Unified Embedding Extraction from Decoder-only LLMs

Features

Installation

Getting Started

Basic Usage

Batch Processing (Recommended for Large Data)

Using vLLM Backend

Using Prompt Templates

Advanced Usage (Quantization)

Configuration & Optimization

Supported Pooling Methods & Prompt Templates

Pooling Methods (pooling_method parameter)

Prompt Templates (prompt_template parameter)

Migration Guide

Breaking Changes in v0.2.2

Migration Examples

Benefits of the New API

Development

Citations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Pooling Methods (`pooling_method` parameter)

Prompt Templates (`prompt_template` parameter)

Packages