fix: address unresolved review comments from PyPDF File Processor PR#4743#5173
fix: address unresolved review comments from PyPDF File Processor PR#4743#5173RobuRishabh wants to merge 15 commits intollamastack:mainfrom
Conversation
|
Hi @RobuRishabh! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
…lamastack#4743 - Remove legacy chunking fallback and _legacy_chunk_file from vector store mixin; raise RuntimeError if FileProcessor API is not configured - Wire file_processor_api through all vector_io providers (registry, factories, adapter constructors) - Make files_api required in PyPDF adapter and processor constructors - Implement chunked file reading (64KB) for direct uploads to cap memory usage - Add size check on file_id retrieval path against max_file_size_bytes - Wrap openai_retrieve_file in try/except to surface clear ValueError for missing file_id, with test coverage - Make .strip() page filter conditional on clean_text config - Remove unused file_processor_api field from VectorStoreWithIndex - Clean up dead imports (make_overlapped_chunks) from mixin - fixed linters, formats using pre-commit checks - fixed pypdf to handle .txt files Signed-off-by: roburishabh <roburishabh@outlook.com>
43b1105 to
1eb5352
Compare
| with pytest.raises(ValueError, match="Cannot provide both file and file_id"): | ||
| await processor.process_file(file=upload_file, file_id="test_id") | ||
|
|
||
| async def test_file_id_without_files_api(self, processor: PyPDFFileProcessor): |
There was a problem hiding this comment.
looks like you removed a test here, is that purposeful?
There was a problem hiding this comment.
Yes, so following up on this review #4743 (comment), needed to make the files_api a required parameter in both PyPDFFileProcessor and PyPDFFileProcessorAdapter. So now you must provide it when creating a processor. Since it's always there, the "if there's no files_api" check was pointless, so I removed it and replaced it with what happens when a user gives a file ID that doesn't exist
cdoern
left a comment
There was a problem hiding this comment.
I think you are missing additions to VectorIORouter so all the tests are failing bc the args to the router are mismatched.
| def __init__(self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None) -> None: | ||
| super().__init__(inference_api=inference_api, files_api=files_api, kvstore=None) | ||
| def __init__( | ||
| self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api=None |
There was a problem hiding this comment.
| self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api=None | |
| self, config: FaissVectorIOConfig, inference_api: Inference, files_api: Files | None, file_processor_api FileProcessor | None |
I think
There was a problem hiding this comment.
same for all other vector IO providers you added this to
…s-Unresolved-Reviews
Signed-off-by: roburishabh <roburishabh@outlook.com>
…s-Unresolved-Reviews
…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews
…s-Unresolved-Reviews
- Add MIME type parsing safety check to prevent IndexError - Document chunked file reading approach and rationale - Make file_processors a hard dependency for all vector_io providers - Add unit test for missing file_processor_api error handling Signed-off-by: roburishabh <roburishabh@outlook.com>
|
This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
Remove duplicate legacy chunking code that was incorrectly merged alongside the new FileProcessor API path, and fix incomplete RuntimeError syntax. Also remove unused make_overlapped_chunks import Signed-off-by: roburishabh <roburishabh@outlook.com>
…hub.com/RobuRishabh/llama-stack into RHAIENG-1823-Address-Unresolved-Reviews
…PI changes Add missing docstrings to FaissVectorIOAdapter and WeaviateVectorIOAdapter to fix ruff D101. Replace f-string logging with structured key-value style in openai_vector_store_mixin. Update test_openai_vector_store_mixin to implement new abstract methods and use renamed openai_attach_file_to_vector_store API with proper mock setup. Signed-off-by: roburishabh <roburishabh@outlook.com> Made-with: Cursor
|
This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
…s-Unresolved-Reviews
What does this PR do?
Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations.
Key changes:
_legacy_chunk_filemethod and all fallback paths fromOpenAIVectorStoreMixin. The system now raises a clearRuntimeErroriffile_processor_apiis not configured, instead of silently degrading to legacy inline parsing.file_processor_apithrough all vector_io providers: AddApi.file_processorstooptional_api_dependenciesin the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors.files_apirequired in PyPDF constructors: Remove the defaultNonefrom bothPyPDFFileProcessorAdapterandPyPDFFileProcessor, and usedeps[Api.files](bracket access) in the factory to fail fast if somehow missing.file_idretrieval path againstmax_file_size_bytes.file_id: Wrapopenai_retrieve_filein atry/exceptthat surfaces aValueError("File with id '...' not found"), with a new test covering this case..strip()whitespace-only page filter conditional on theclean_textconfig setting.file_processor_apifield fromVectorStoreWithIndexand the now-unusedmake_overlapped_chunksimport from the mixin.Closes #4743
Test Plan
Automated tests
1. Unit tests (mixin + vector_io)
All
test_contextual_retrieval.py(16 tests) andtest_vector_store_config_registration.pytests — these exercise the refactoredOpenAIVectorStoreMixin.2. PyPDF file processor tests (20/20 pass)
uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py3. Full integration suite (replay mode)
uv run --group test pytest -sv tests/integration/ --stack-config=starterResult: 4 failed, 54 passed, 639 skipped, 1 xfailed
All 4 failures are pre-existing and unrelated:
test_safety_with_image— Pydantic schema mismatch (type: 'image'vs'image_url')test_starter_distribution_config_loads_and_resolves/test_postgres_demo_distribution_config_loads— relative pathFileNotFoundErrortest_mcp_tools_list_with_schemas— no local MCP server (Connection refused)No regressions in vector_io, file_search, or ingestion workflows.
Manual E2E verification (with starter distro)
1. Verify route is registered:
Expected:
{ "route": "/v1alpha/file-processors/process", "method": "POST", "provider_types": [ "inline::pypdf" ] }2. Verify OpenAPI contains the endpoint:
3. Direct file upload:
Expected: chunks response with
metadata.processor = "pypdf".4. Via file_id:
Expected: chunks response with
metadata.processor = "pypdf"andfile_idin chunk metadata.