llemb is a lightweight framework designed to extract high-quality sentence embeddings from Decoder-only Large Language Models (LLMs) like Llama, Mistral, and others. It unifies various state-of-the-art pooling strategies and efficiency optimizations into a simple, coherent interface.
With llemb, you can easily leverage powerful LLMs for embedding tasks using advanced techniques like PromptEOL and PCoTEOL, with built-in support for Batch Processing and vLLM for high-throughput inference.
Origin of the Name llemb is a coined name derived from LLM and Embedding. In Japanese pronunciation, the letter M in LLM is read as /ɛm/ (emu), and Embedding begins with Em, which is pronounced /ɛm/ as well. Overlapping this sound reflects the idea of merging LLMs and embeddings.
- Flexible Backends: Seamless support for Hugging Face Transformers and vLLM (for high-speed inference).
- Advanced Pooling & Prompting:
- Pooling Methods:
mean,last_token,eos_token - Prompt Templates:
prompteol,pcoteol(Pretended Chain of Thought),ke(Knowledge Enhancement) - Independent Control: Combine any pooling method with any prompt template
- Pooling Methods:
- High-Performance:
- Batch Processing: Efficiently handles large datasets with automatic CPU offloading and progress bars (
tqdm). - Quantization: Native support for 4-bit and 8-bit via
bitsandbytes(Transformers) orfp8/awq/gptq(vLLM).
- Batch Processing: Efficiently handles large datasets with automatic CPU offloading and progress bars (
- Granular Control: Extract embeddings from any layer (defaults to recommended layers based on research).
Install via PyPI using pip or uv.
Basic Installation (includes quantization support)
pip install llemb
# or
uv add llembStarting from v0.2.2, bitsandbytes for quantization is included by default.
With vLLM Support
To enable the vLLM backend for faster inference:
pip install "llemb[vllm]"
# or
uv add llemb[vllm]Initialize the encoder and start extracting embeddings in just a few lines of code.
import llemb
# 1. Initialize the encoder (defaults to auto-device detection)
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")
# 2. Extract embeddings using mean pooling
embeddings = enc.encode("Hello world", pooling_method="mean")
print(embeddings.shape)
# => (1, 4096)Process lists of texts efficiently. llemb handles batching, progress tracking with tqdm, and memory management by automatically offloading computed embeddings to the CPU to prevent GPU VRAM saturation.
texts = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning is fascinating.",
# ... thousands of texts ...
]
# Process in batches of 32
embeddings = enc.encode(texts, batch_size=32, pooling_method="mean")
print(embeddings.shape)
# => (N, 4096)For maximum throughput on supported hardware, use the vLLM backend. You can pass vLLM-specific arguments (like tensor_parallel_size or gpu_memory_utilization) directly to the encoder.
enc = llemb.Encoder(
"meta-llama/Llama-3.1-8B",
backend="vllm",
tensor_parallel_size=1, # Number of GPUs
gpu_memory_utilization=0.9 # vLLM memory setting
)
embeddings = enc.encode("Hello vLLM", pooling_method="last_token")Leverage research-backed prompt templates to improve embedding quality. When you specify a template, pooling_method automatically defaults to last_token.
import llemb
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")
# Simple usage - pooling_method automatically defaults to last_token
embeddings = enc.encode(
"Hello world",
prompt_template="prompteol"
)
# Use PCoTEOL (Pretended Chain of Thought) template
# Note: Automatically uses layer -2 when prompt_template is "pcoteol" or "ke"
embeddings = enc.encode(
"Hello world",
prompt_template="pcoteol"
)
# Or override the layer index
embeddings = enc.encode(
"Hello world",
prompt_template="pcoteol",
layer_index=-1 # Override default -2
)Use quantization to reduce memory usage.
import llemb
# Initialize with 4-bit quantization and force CUDA
enc = llemb.Encoder(
model_name="meta-llama/Llama-3.1-8B",
backend="transformers",
device="cuda",
quantization="4bit"
)
embeddings = enc.encode(
"Hello world",
pooling_method="mean",
prompt_template="ke" # Knowledge Enhancement template
)llemb passes arguments directly to the backend, allowing for deep customization.
Using Flash Attention 2
import torch
encoder = llemb.Encoder(
model_name="meta-llama/Llama-3.1-8B",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16
)Custom Quantization Config
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
encoder = llemb.Encoder(
model_name="meta-llama/Llama-3.1-8B",
quantization_config=bnb_config
)| Method | Description | Default Layer |
|---|---|---|
mean |
Average pooling of all tokens (excluding padding). | -1 (Last) |
last_token |
Vector of the last generated token. | -1 (Last) |
eos_token |
Vector corresponding to the EOS token position. | -1 (Last) |
| Template | Description | Default Layer |
|---|---|---|
prompteol |
Simple prompt template targeting the last token. | -1 (Last) |
pcoteol |
"Pretended Chain of Thought" - wraps input in a reasoning template. | -2 |
ke |
"Knowledge Enhancement" - wraps input in a context-aware template. | -2 |
You can combine any pooling method with any prompt template. When using pcoteol or ke templates, the default layer is automatically set to -2 unless explicitly overridden.
Smart Defaults (v0.2.2+):
- When
prompt_templateis specified,pooling_methodautomatically defaults tolast_token - When no
prompt_templateis specified,pooling_methoddefaults tomean - Explicit values always take precedence
If you're upgrading from an earlier version of llemb, the API has been refactored to separate pooling methods and prompt templates into orthogonal parameters.
- API Refactoring:
poolingparameter split intopooling_methodandprompt_template - Smart Defaults:
pooling_methodautomatically set tolast_tokenwhen using templates - Dependencies:
bitsandbytesis now a core dependency (no longer optional) - Removed:
indexpooling strategy has been removed
Old API (deprecated):
# Old: pooling parameter mixed strategies and templates
enc.encode("text", pooling="mean")
enc.encode("text", pooling="pcoteol") # Mixed template + pooling
enc.encode("text", pooling="index", token_index=0) # Index strategyNew API (v0.2.2+):
# New: separate parameters with smart defaults
enc.encode("text", pooling_method="mean")
enc.encode("text", prompt_template="pcoteol") # pooling_method auto-set to last_token
# Index strategy has been removed| Old Code | New Code (Explicit) | New Code (Smart Default) |
|---|---|---|
enc.encode(text, pooling="mean") |
enc.encode(text, pooling_method="mean") |
enc.encode(text) |
enc.encode(text, pooling="last_token") |
enc.encode(text, pooling_method="last_token") |
— |
enc.encode(text, pooling="eos_token") |
enc.encode(text, pooling_method="eos_token") |
— |
enc.encode(text, pooling="prompteol") |
enc.encode(text, pooling_method="last_token", prompt_template="prompteol") |
enc.encode(text, prompt_template="prompteol") |
enc.encode(text, pooling="pcoteol") |
enc.encode(text, pooling_method="last_token", prompt_template="pcoteol") |
enc.encode(text, prompt_template="pcoteol") |
enc.encode(text, pooling="ke") |
enc.encode(text, pooling_method="last_token", prompt_template="ke") |
enc.encode(text, prompt_template="ke") |
enc.encode(text, pooling="index", token_index=0) |
Removed - use last_token or implement custom logic |
— |
- Orthogonality: You can now combine any pooling method with any prompt template.
- Clarity: The separation makes it clear what each parameter does.
- Flexibility: Easier to experiment with different combinations.
- Smart Defaults:
pooling_methodautomatically set tolast_tokenwhen using templates- Layer indices automatically set based on prompt template (can be overridden)
- Simpler Usage: Less boilerplate for common use cases with templates.
Clone the repository and sync dependencies using uv:
git clone https://github.com/j341nono/llemb.git
cd llemb
uv sync --all-extras --devRun Tests
uv run pytestStatic Analysis
uv run ruff check src
uv run mypy srcIf you use the advanced pooling strategies implemented in this library, please cite the respective original papers:
PromptEOL:
@inproceedings{jiang-etal-2024-scaling,
title = "Scaling Sentence Embeddings with Large Language Models",
author = "Jiang, Ting and Huang, Shaohan and Luan, Zhongzhi and Wang, Deqing and Zhuang, Fuzhen",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
year = "2024"
}PCoTEOL and KE:
@article{zhang2024simple,
title={Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models},
author={Zhang, Bowen and Chang, Kehua and Li, Chunping},
journal={arXiv preprint arXiv:2404.03921},
year={2024}
}This project is open source and available under the Apache-2.0 license.