You can upload custom metadata for documents during ingestion and retrieval. By uploading custom metadata you can attach additional information to documents, and use it for filtering results during retrieval operations. Custom metadata supports the following:
- Document categorization
- Temporal filtering
- Custom document properties
- Filtering search results
For basic usage examples and implementation details, refer to the following:
- Ingestion API Usage Notebook - Demonstrates how to add custom metadata during document ingestion.
- Retriever API Usage Notebook - Demonstrates how to use metadata filtering during document retrieval.
The following are limitation when you use custom metadata:
- Metadata fields must be consistent across documents in the same collection.
- Currently supports string and datetime data types for metadata fields.
- Complex filter expressions may impact retrieval performance.
- Timestamp filtering requires strict ISO 8601 format compliance.
- Metadata updates require re-ingestion of documents.
When creating a collection using the /v1/collection endpoint, you can define a metadata schema to enforce data validation during document ingestion. The schema helps ensure that metadata fields are properly typed and consistent across all documents in the collection.
The metadata schema is defined as a list of field definitions when creating a collection. Each field definition must specify:
name: The name of the metadata fieldtype: The data type of the field (currently supported: "string" or "datetime")description: A description of the field's purpose
Example schema definition:
[
{
"name": "timestamp",
"type": "datetime",
"description": "Timestamp of when the document was created"
},
{
"name": "category",
"type": "string",
"description": "Document category for classification"
}
]When creating a collection using the /v1/collection endpoint, include the metadata schema in the request:
data = {
"collection_name": "my_collection",
"embedding_dimension": 2048,
"metadata_schema": [
{
"name": "timestamp",
"type": "datetime",
"description": "Timestamp of when the document was created"
},
{
"name": "category",
"type": "string",
"description": "Document category for classification"
}
]
}The system validates metadata during ingestion:
- Datetime fields use ISO 8601 format
- String fields accept any text
- Invalid metadata causes document rejection
You can add custom metadata during the document ingestion process by using the /v1/documents endpoint.
You cand specify metadata for each file,
and you can specify different metadata for different documents in the same ingestion batch.
You specify custom metadata as a list of objects, where each object contains the following:
filename: The name of the document.metadata: A dictionary that contains key-value pairs of metadata.
The following example contains metadata fields timestamp, category, and department.
You can create whatever metadata is helpful for your scenario.
[
{
"filename": "document1.pdf",
"metadata": {
"timestamp": "2024-03-15T10:23:00",
"category": "technical",
"department": "engineering"
}
},
{
"filename": "document2.pdf",
"metadata": {
"timestamp": "2024-03-16T14:30:00",
"category": "marketing",
"department": "sales"
}
}
]The following example adds custom metadata during ingestion.
CUSTOM_METADATA = [
{
"filename": "technical_doc.pdf",
"metadata": {
"timestamp": "2024-03-15T10:23:00", # ISO 8601 format as string
"category": "technical", # string
"department": "engineering", # string
"priority": "1", # numeric as string
"is_active": "true" # boolean as string
}
},
{
"filename": "marketing_doc.pdf",
"metadata": {
"timestamp": "2024-03-16T14:30:00",
"category": "marketing",
"department": "sales",
"priority": "2",
"is_active": "false"
}
}
]
# Include in upload request
data = {
"collection_name": "my_collection",
"blocking": False,
"split_options": {
"chunk_size": 512,
"chunk_overlap": 150
},
"custom_metadata": CUSTOM_METADATA
}Consider the following before you create your custom metadata.
- Metadata types — You can specify strings, numeric values, boolean values, and timestamps, but you must specify all values as strings. For example, specify
"priority": "1"and"is_active": "true". - Timestamp format — Specify timestamps in ISO 8601 format. For example,
"timestamp": "2024-03-15T10:23:00".
You can use custom metadata to filter documents during retrieval operations
by using the filter_expr parameter in both the /v1/search and /v1/generate endpoints.
Use filter expressions that follow the Milvus boolean expression syntax. For more information, refer to Filtering Explained.
Use the following information to write filter expressions:
- Access metadata fields by using
content_metadata["field_name"]. - You can use the following operators:
- Comparison: ==, !=, >, >=, <, <=
- Logical: AND, OR, NOT
- Range: LIKE, IN
- Since all metadata values are strings, comparisons are done with string values. For example,
content_metadata["priority"] == "1".
The following example filters results by category.
filter_expr = 'content_metadata["category"] == "technical"'The following example filters results by time range.
filter_expr = 'content_metadata["timestamp"] >= "2024-03-01T00:00:00" and content_metadata["timestamp"] <= "2024-03-31T23:59:59"'The following example filters by category and uses multiple logical operators.
filter_expr = '(content_metadata["department"] == "engineering" and content_metadata["priority"] == "high") or content_metadata["category"] == "critical"'The following example uses a filter expression to narrow results.
payload = {
"query": "What are the technical specifications?",
"reranker_top_k": 10,
"vdb_top_k": 100,
"collection_names": ["my_collection"],
"enable_query_rewriting": True,
"enable_reranker": True,
"filter_expr": 'content_metadata["category"] == "technical" and content_metadata["priority"] == "high"'
}The following example uses a filter expression to narrow results.
payload = {
"messages": [
{
"role": "user",
"content": "What are the latest engineering updates?"
}
],
"use_knowledge_base": True,
"collection_names": ["my_collection"],
"filter_expr": 'content_metadata["department"] == "engineering" and content_metadata["timestamp"] >= "2024-03-01T00:00:00"'
}The following are the best practices when you work with custom metadata:
-
Metadata Design
- Plan metadata structure before ingestion.
- Use consistent naming conventions.
- Include essential filtering fields.
- Keep metadata values consistent across documents.
- Document the expected string format for each metadata field.
-
Timestamp Usage
- Consider time zone implications.
- Use consistent timestamp precision.
-
Filter Expressions
- Test filter expressions with small datasets first.
- Use parentheses to clarify complex expressions.
- Consider performance implications of complex filters.
-
Error Handling
- Validate metadata during ingestion.
- Handle missing metadata fields gracefully.
- Log invalid filter expressions.
While the current implementation doesn't support array-type metadata fields (like tags), you can implement tag-like functionality using boolean metadata fields. This is particularly useful when you need to categorize documents with multiple attributes.
For example, instead of using an array of tags like:
{
"category": ["finance", "earnings"]
}You can define boolean metadata fields during collection creation:
{
"name": "is_finance",
"type": "string",
"description": "Indicates if document is related to finance"
},
{
"name": "is_earnings",
"type": "string",
"description": "Indicates if document is related to earnings"
}Then during ingestion, set these fields to "yes" or "no":
{
"filename": "financial_report.pdf",
"metadata": {
"is_finance": "yes",
"is_earnings": "yes"
}
}During retrieval, you can filter using boolean logic:
filter_expr = 'content_metadata["is_finance"] == "yes" and content_metadata["is_earnings"] == "yes"'Note: This approach requires defining all possible tags at collection creation time, as the metadata schema cannot be modified after collection creation.
The following are some issues that might arise when you work with custom metadata:
-
Filter Expression Errors
- Verify that the metadate field names are correct.
- Verify that all values are correctly enclosed in quotes.
- Verify all metadata values are strings in filter expressions.
- Verify the operator syntax. For valid expression syntax, refer to Milvus Filtering Documentation.
-
Timestamp Filtering Issues
- Verify that the metadata uses the ISO 8601 format.
- Verify that the time zones are consistent.
- Validate the date range logic.
-
Missing Metadata
- Verify that the metadata was added during ingestion.
- Verify that you specified the correct document filename.
- Validate the metadata structure.