Trial edit by bethh0rn · Pull Request #1 · ritza-co/airweave

bethh0rn · 2026-04-22T15:32:35Z

No description provided.

…eave-ai#1750) * feat(search): add ACL filtering to V2 search + fix SP Online access control - Add acl_principals param to VectorDB protocol and Vespa client - Executor resolves user principals via AccessBroker, builds ACL YQL clause - Admin search-as-user endpoints for all 3 tiers (instant/classic/agentic) - Fail-closed transformer: AC sources without access data default to invisible - SP Online: set access on site, drive, and page entities (drive root permissions) - SP Online: remove dead build_item_entity code - SP token exchange: implement OAuthTokenProvider.get_token_for_resource() - OAuth2Service.exchange_token_for_scope() for resource-scoped token exchange - Fix OAuth callback to pass stored config_fields to lifecycle.validate() - Fix pre-existing D301 ruff error in vespa transformer * feat(sharepoint): enforce required site_url + track Entra groups on all entity types - Make site_url required in SharePointOnlineConfig (one site per connection) - Fix validate_config to validate required fields even when config is empty - Add _track_entity_groups for site, drive, and page entities (enables Entra group expansion) - Update config SSRF tests for new required site_url behavior * feat(sharepoint): switch to application-level client credentials auth - Remove OAuth auth methods, use DIRECT only with client credentials - Add SharePointOnlineAppAuthConfig (tenant_id, client_id, client_secret, private_key) - Implement client_credentials token exchange for Graph API - Implement certificate-based JWT assertion for SP REST API tokens - Use getAllSites for complete site discovery (app permissions) - Add Prefer: deltashowsharingchanges headers for permission change tracking in delta - Make site_url optional (empty = sync all sites in tenant) - Simplify all token providers, group expander, and _get() to single auth path * fix(sharepoint): fix file downloads for client-credentials auth DirectCredentialProvider has no get_token(), so FileService raised "No access token available" for every file download. Two fixes: 1. Recognize SharePoint tempauth= URLs as pre-signed (skip auth header) 2. Fall back to a Graph bearer token via StaticTokenProvider when the download URL is a Graph API content endpoint Tested: 18/18 files now sync successfully with client credentials. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(sharepoint): split into OAuth + client credentials sources with shared base Refactors the single SharePointOnlineSource into two separate sources: - sharepoint_online (OAuth, delegated user auth) — original behavior - sharepoint_online_app (client credentials, app-only auth) — new Both inherit from SharePointOnlineBase which contains all shared sync, browse tree, file download, and ACL logic. Subclasses only override auth-specific hooks: token acquisition, 401 handling, SP REST token exchange, file download auth, and site discovery strategy. Also fixes SP site group expansion by using the actual uploaded certificate PEM for x5t thumbprint computation instead of generating a new self-signed cert each call (which caused thumbprint mismatch with Azure AD). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Daan Manneke <daan@airweave.ai> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Updated the links for various connectors in the overview documentation to point to the correct `/docs/connectors/` paths instead of the previous `/connectors/` paths. This change ensures that users can access the appropriate documentation for each connector.

…or-links fix(docs): update connector links to use the correct documentation paths

…version chore: update Node.js version in GitHub Actions workflow from 18 to 20

…er (airweave-ai#1719) * refactor: consolidate SyncLifecycleService + SyncRecordService into unified SyncService * fix: restore Temporal workflow start for manual sync runs SourceConnectionService.run was missing the Temporal workflow invocation after creating a PENDING sync job, causing API-triggered syncs to remain stuck in PENDING. Also fixes ruff, mypy, and test failures from the unified SyncService refactor. * test: add unit tests for run, get_jobs, cancel_job, count_by_organization Covers the new sync lifecycle proxy methods on SourceConnectionService to improve diff-coverage for the PR. * test: cover force_full_sync, _resolve_collection, _resolve_connection Adds tests for the remaining uncovered paths in SourceConnectionService including error cases for missing collection/connection. * test: add comprehensive unit tests for SyncService lifecycle/CRUD methods Covers get, pause, resume, delete, resolve_destination_ids, trigger_run (validation paths), get_jobs, cancel_job (success/not-found/wrong-status/ temporal-failure/workflow-not-found), validate_force_full_sync, create (federated/no-schedule/with-cron), _resolve_cron, _validate_cron_for_source, _cancel_active_syncs, _wait_for_terminal, and _schedule_cleanup. * fix: use enum value for DB status update + add SyncService unit tests SyncStateMachine.transition was assigning the SyncStatus enum member (name='PAUSED') to the ORM column, but PostgreSQL expects the lowercase value ('paused'). Use .value for correct serialization. Also adds comprehensive unit tests covering get, pause, resume, delete, trigger_run, get_jobs, cancel_job, validate_force_full_sync, create, _resolve_cron, _validate_cron_for_source, and all private helpers. * fix: pass entity count fields through SourceConnectionJob mapping The refactored get_jobs() and cancel_job() methods were constructing SourceConnectionJob objects without entity count fields (entities_inserted, entities_updated, entities_deleted), causing them to always default to 0. This broke E2E tests that assert entities_inserted > 0 after a completed sync. * Complete SourceConnectionJob field mapping with duration_seconds, entities_failed, error_category * fix: add missing mock fields to _make_sync_job_schema The MagicMock-based sync job fixture was missing error_category and entities_skipped, causing Pydantic validation to fail when the service mapped the mock to SourceConnectionJob. * chore: remove audit script and output files from PR * fix: address PR airweave-ai#1719 review comments across sync and source_connection domains - SyncService.create: raise ValueError instead of returning None for federated sources and no-schedule cases - SyncService.delete: accept single sync_id instead of List[UUID] - SyncService.trigger_run: consolidate Temporal workflow start internally, accept collection/connection/force_full_sync params - SyncService.get: raise HTTPException(404) instead of ValueError - SyncService.get_jobs: push limit to DB query instead of in-memory slice - SyncService.cancel_job: add retry logic for Temporal cancellation - SourceConnectionService: remove temporal_workflow_service dependency, remove try/except around event bus publish, use NotFoundException in _resolve helpers - SourceConnectionUpdateService: delegate sync creation to sync_service.create instead of direct repo access - SourceConnectionDeletionService: simplify collection field access - CollectionService.delete: use asyncio.gather for per-sync deletion - OAuthCallbackService: fix resume reason string - FakeSyncService: add explicit SyncServiceProtocol inheritance - Update all protocols, fakes, and test files to match new signatures * Update backend/airweave/domains/collections/service.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * Update backend/airweave/domains/source_connections/service.py Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> * fix: guard against scheduling syncs for federated search sources in update service Add upfront federated_search check in SourceConnectionUpdateService before calling sync_service.create, returning HTTP 400 instead of letting the ValueError propagate from SyncService. * fix: remove stray paren and lint violation breaking CI The asyncio.gather → sequential loop conversion left an unmatched ')' in CollectionService.delete, causing a SyntaxError that cascaded across import-linter, test, and test-public-api checks. Also breaks the _duration_seconds signature to satisfy ruff E501 (line too long). * fix: resolve mypy Column[UUID] and Liskov substitution violations - collections/service: use schema result.id (UUID) instead of ORM db_obj.id (Column[UUID]) for sync_service.delete and repo.remove - delete/update: validate ORM collection to CollectionRecord schema before passing .id to sync_service methods - jobs/repository: widen ctx parameter from ApiContext to BaseContext in get_all_by_sync_id to satisfy the protocol contract * fix: guard sync creation for null-schedule case and fix test mocks - create.py: skip sync_service.create when schedule.cron is explicitly null and sync_immediately is false (prevents ValueError → 500) - delete.py: inline model_validate call to satisfy ruff format - test_delete/test_update: add required CollectionRecord fields to mock collection objects so model_validate succeeds * fix: ruff format and mypy str|None for sync_service.create name arg Inline resolve_destination_ids calls (ruff format) and use `obj_in.name or entry.name` to guarantee str for the name parameter. * fix: remove redundant sync_id ownership check from cancel_job The job lookup in SyncService.cancel_job is already org-scoped via ctx, so the extra sync_id check in SourceConnectionService was unnecessary. Simplifies cancel_job to delegate directly to the sync service. --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* feat: add Custom auth provider for customer-hosted token endpoints Adds a new "custom" auth provider that calls a user-hosted HTTP endpoint to fetch fresh access tokens before each sync. This enables clients who manage their own token lifecycle (e.g. Mozilla) to expose a single endpoint instead of injecting expiring OAUTH_TOKEN values. Backend: - CustomAuthConfig with endpoint_url (SSRF-validated), auth method, and auth value fields stored encrypted on the provider connection - CustomAuthProvider implementation with POST-based token fetching, configurable auth (bearer/api_key_header/none), and full error mapping - Registered in ALL_AUTH_PROVIDERS with empty BLOCKED_SOURCES - Unit tests for create, headers, get_creds_for_source, and validate Frontend: - Fix AuthProviderTable to support multiple connections per provider type (was using .find(), now uses .filter() with list view) - New AuthProviderConnectionsList component for browsing connections - Connection count badge on AuthProviderButton when count > 1 - Custom provider icon (plug + arrow) for light and dark themes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: add request-time SSRF protection, simplify validate contract, clean up debug logs Add validate_url() check before every HTTP request in CustomAuthProvider to prevent DNS rebinding attacks against the user-supplied endpoint URL. Simplify validate() to just check for 2xx at base URL instead of requiring access_token at /__validate__. Document the full endpoint contract in CustomAuthConfig. Remove console.log debug statements and unused import from AuthProviderTable. * fix: remove remaining debug console.log statements from auth provider UI * fix: route auth provider connections through correct token provider and error classification Two bugs found during Custom auth provider e2e testing: 1. lifecycle._get_auth_configuration required both readable_auth_provider_id AND auth_provider_config to route to the auth provider path. Custom provider has no config, so it fell through to the DB credential path and failed with "no integration credential". 2. lifecycle._resolve_token_provider only used AuthProviderTokenProvider for OAuth sources. Non-OAuth sources (GitHub, Stripe) got a DirectCredentialProvider even when credentials came from an auth provider, causing errors to be classified as API_KEY_INVALID instead of AUTH_PROVIDER_CREDENTIALS_INVALID. 3. AuthProviderTokenProvider._fetch_token hardcoded "access_token" as the expected field. Non-OAuth sources use different field names (e.g. personal_access_token for GitHub). Now extracts the first matching runtime auth field from the response. * fix: improve error handling for auth provider credential failures - Handle 404 from Custom endpoint as AuthProviderMissingFieldsError with a clear message instead of leaking raw httpx error - Catch remaining HTTP status codes as AuthProviderConfigError instead of re-raising raw httpx exceptions - Wrap AuthProviderError from credential fetching (step 2) as SourceValidationError so the error classifier can process it - Add AuthProviderError catch-all in classifier to map all auth provider failures to AUTH_PROVIDER_CREDENTIALS_INVALID * fix: scope Custom auth provider credentials by source connection ID Custom endpoint now calls GET {base_url}/{source_connection_id} instead of GET {base_url}/{source_short_name}, allowing customers to return different credentials per source connection. Added optional source_connection_id parameter through the base class, lifecycle, and token provider chain. Custom provider requires it and errors if missing. * feat: normalize Custom auth provider field names and simplify customer contract Customers return {"access_token": "..."} or {"api_key": "..."} — the Custom provider maps to Airweave-internal field names (e.g. personal_access_token for GitHub, api_token for Document360). Also treat refresh_token, client_id, client_secret as optional for auth provider flows since the provider handles token lifecycle. * fix: treat OAuth lifecycle fields as optional in token provider path too The previous fix only applied _AUTH_PROVIDER_OPTIONAL in the lifecycle path. The token provider's _call_provider_with_retry also calls get_creds_for_source with the raw optional fields, causing Dropbox syncs to fail demanding refresh_token. * fix: harden Custom auth provider security, clean up debug logs, improve UI - Block SSRF via redirect by setting follow_redirects=False on httpx clients - Fix stale docstrings referencing source_short_name instead of source_connection_id - Deduplicate AUTH_PROVIDER_OPTIONAL_FIELDS into shared constant in _base.py - Type AuthProviderConnectionsList props with proper AuthProvider/AuthProviderConnection types - Remove 19 debug console.log statements across frontend auth provider components - Add "View endpoint contract" tooltip in Custom provider configure UI - Add tests verifying follow_redirects=False is enforced * feat: gate Custom auth provider behind custom_auth_provider feature flag Add feature_flag support to the @auth_provider decorator, mirroring the existing pattern for sources. Providers with a feature flag are hidden from list_metadata unless the organization has the flag enabled. * fix: use set() instead of list for runtime_auth_optional_fields in test The | operator requires set-like types on both sides. The test was using [] but AUTH_PROVIDER_OPTIONAL_FIELDS is a frozenset. * refactor: extract HTTP status classification to reduce complexity Moves the HTTP status code → exception mapping out of get_creds_for_source into _raise_for_http_status to satisfy ruff C901 (complexity 13 > 10). * fix: resolve mypy violations in diff - Declare endpoint_url/api_key as instance attrs on CustomAuthProvider - Type credentials param as Optional[Dict] instead of Optional[Any] - Cast BaseContext to ApiContext at _get_auth_configuration call site - Use str() to satisfy return type on credential extraction - Remove dead _replace_fields call in test * fix: block ctti source in Custom provider and update e2e test allowlist Custom provider had empty BLOCKED_SOURCES, so ctti (internal test source) showed it as a supported provider. Also add 'custom' to the valid auth provider names in the smoke test. * fix: add source_connection_id param to Composio and Pipedream providers The base class and callers now pass source_connection_id as a keyword argument. Composio and Pipedream need to accept it in their signatures even though they don't use it. * refactor: rename endpoint_url to base_endpoint_url and type create() Address PR review: rename the Custom auth provider's URL field to be more descriptive, and parse credentials into CustomAuthConfig inside create() for typed attribute access. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mypy and import-linter used virtualenvs-create: false, so poetry run couldn't find console scripts like diff-quality. Switch all three jobs to the same pattern: in-project venv with caching. Broken since Mar 16 when diff-quality was added to the mypy job without updating its venv config.

* feat: add Mistral LLM adapter with Large and Small model support Adds MistralLLM provider using the native mistralai SDK with json_schema structured output and OpenAI-compatible tool/function calling. * fix: register MistralLLM in container factory provider_classes * feat: add thinking/reasoning support to Mistral adapter - Parse ThinkChunk content blocks from Magistral and Mistral Small 4 - Pass reasoning_effort param for adjustable reasoning models - Add MAGISTRAL_SMALL model, update MISTRAL_SMALL thinking config - Add 4 new tests for thinking extraction and reasoning_effort * fix: suppress thinking output when caller did not request it * refactor: address review feedback on Mistral adapter - Type `_extract_text` / `_extract_text_and_thinking` with `AssistantMessageContent | Unset | None` instead of `Any`. - Move the empty-content guard to run after text extraction so ThinkChunk-only responses (reasoning models) are caught by the same transient-retry path. - Drop the `str(raw_content)` fallback that could hand non-JSON to the parser. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ity-venv fix(ci): use in-project venvs for mypy and import-linter jobs

The state machine forbids PENDING → CANCELLING, but the cancel flow assumed every job passed through CANCELLING on the way to CANCELLED. A PENDING job hitting the API `cancel_job` endpoint would attempt an invalid transition. Split the cancel path by current status: PENDING jobs now go straight to CANCELLED, while RUNNING jobs keep the two-phase CANCELLING → CANCELLED flow. The orchestrator and workflow both attempt the same transitions defensively — the orchestrator acts eagerly inside the activity, the workflow retries via a shielded activity as a safety net. Duplicate arrivals are harmless because the state machine treats same-state writes as no-ops. Shielded (cancel-path) transition failures are now logged at debug rather than warning, since the orchestrator already handled them. The CANCELLED transition failure in the orchestrator keeps warning level — it means the job reached an unexpected terminal state (COMPLETED/FAILED) during cancellation.

The `scan-and-attest` job used implicit `if: success()` on every step after Trivy, so a critical vulnerability finding would skip SBOM generation, Grype scanning, attestations, and artifact uploads entirely. Chain each step off its real dependencies with `!cancelled()` and outcome checks instead, so the pipeline produces as many supply-chain artifacts as possible regardless of individual scan failures. Also fix the Trivy SARIF upload condition — the previous `steps.trivy.outputs.sarif` reference checked an output the action never exposes, silently skipping every upload. Use `steps.trivy.outcome != 'skipped'` to match the file-based contract documented upstream. Other changes: add Grype SARIF upload to Code Scanning, add `only-fixed` to Grype so unfixable criticals do not gate the build, and let `package-vespa` run when scans fail since it is independent of container vulnerability results.

…ng-sync-jobs fix: cancel path for pending sync jobs

…attest-resilience fix(ci): harden scan-and-attest resilience

- Fix regression where _expand_sp_site_groups and _fetch_sp_group_viewers silently early-return when self._site_url is empty, breaking the multi-site sync mode (getAllSites). Track SP groups per site URL. - Drop role principals (PrincipalType=16, rolemanager|spo-grid-all-users/...) instead of emitting them as fake users. - Emit nested Entra groups (PrincipalType=4, federateddirectoryclaimprovider) as group-to-group memberships (entra:{guid}) rather than flattening to the group's email. - Parse LoginName via a strict regex for the i:0#.f|membership|<email> pattern only, removing the naive split("|")[-1] that accepted role claims. Cursor format for tracked_sp_groups migrated from List[str] to Dict[site_url, List[str]]; legacy list values are discarded defensively and re-collected on next full sync. Incremental and targeted sync paths still pass an empty site_url (noted as follow-up work).

Let deployments boot without any chain-provider API key: when the chain resolves to zero providers, inject an UnavailableLLM null-object instead of raising. Instant search keeps working; classic and agentic search surface LLMUnavailableError on first use, mapped to HTTP 503 via the existing handler. Also expose LLM_FALLBACK_CHAIN as an env var (format: "provider:model,provider:model") so private-cloud deployers can override the chain without forking. Unset falls back to the current hardcoded [together:zai-glm-5, anthropic:claude-sonnet-4.6] default.

Felix flagged the legacy single-site SP token provider as having no callers. Confirmed via grep — only the base declaration and two subclass overrides existed, nothing calling them. Removed all three, plus the likewise-unused _derive_sp_resource_scope helper on the OAuth subclass and the stale base-class docstring reference. _make_sp_token_provider_for_site(site_url) is the sole entry point.

- Derive env-var list in LLMUnavailableError from PROVIDER_API_KEY_SETTINGS so the default message stays in sync when providers are added. - Extract _unavailable() helper in _build_llm_chain to dedupe the two UnavailableLLM return paths and their log lines. - Narrow UnavailableLLM.model_spec return type from Any to LLMModelSpec. - Hoist provider/model lookup dicts to module scope in search config, surface accepted values in enum-declaration order in error messages, and note that SearchConfig.LLM_FALLBACK_CHAIN is evaluated once at class-definition time.

close() is typed -> None, so 'assert await llm.close() is None' trips mypy's func-returns-value check. The test's intent is that close() is a safe no-op (does not raise), which a bare await already covers.

pyproject.toml declared both weaviate (^0.1.2) and weaviate-client (^4.10.2). The former is a PyPI placeholder ("A placeholder package for the Weaviate name") that ships its own weaviate/__init__.py, colliding with the real client. Poetry 2.x now fails the Docker image build with 'Installing weaviate/__init__.py over existing file'; 1.x silently tolerated the overwrite. No code imports weaviate — the only repo reference is a URL string in a test. Removing the placeholder unbreaks the container build.

…oup-expansion fix(sharepoint_online): correct SP site group expansion

- parse_llm_fallback_chain now validates (provider, model) pairs against MODEL_REGISTRY, so combos like together:mistral-large fail at startup with an actionable error instead of crashing later in the factory. - Move the detailed env-var message out of core.exceptions (which had to import from adapters.llm.registry) into UnavailableLLM itself. The exception keeps a generic default pointing at the docs. - Drop the unreachable empty-chain guard in _create_search_services — the parser always returns a non-empty list. - Test the env-var-name assertion dynamically against PROVIDER_API_KEY_SETTINGS so it stays in sync with the registry. - Add a 'Configuring the LLM provider chain' section to the Search docs covering API keys, LLM_FALLBACK_CHAIN format, and fallback semantics, so the exception's doc reference has a real destination.

@hiddeco

Apply @hiddeco's suggestions: - Lift 'self-hosted only' caveat into a Fern Callout. - Tighten 'Out of the box, Airweave...' phrasing. - Rework 'first provider with an API key set that responds...' for clarity. - Reword 'Misconfiguration is caught at startup...'. - Capitalize 'Classic/Agentic' in the fallback-semantics bullet to match the rest of the doc.

…urable-llm-chain feat: lazy + configurable LLM fallback chain

An "anyone in your organization with the link" share is a capability, not an ACL: a user must possess the link URL to access the file. Previously, ``extract_access_control`` mapped ``link.scope == "organization"`` to ``AccessControl.is_public = True``, causing the search broker to skip all viewer/group checks for any file with such a link. Net effect of the bug: any file that ever had an org-scoped share link created became visible in search to every authenticated user in the tenant — including files on private sites with restricted folder permissions, where only specific users / groups should have access. Reported by Mistral on a private site whose files appeared in search results for unrelated tenant users. Fix: only ``link.scope == "anonymous"`` flips ``is_public``. Org-scoped link permissions are now skipped — their audience is already represented via the ``SharingLinks.<itemId>.<scope><role>.<linkId>`` SP site group that SharePoint creates per link, which the connector tracks separately and resolves through ACL membership at search time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g-link-not-public fix(sharepoint_online): organization-scoped sharing links no longer bypass access control

…nket-attach The SharePoint Online connector blanket-attached every site group returned by ``/_api/web/sitegroups`` to every file in the site (``_fetch_sp_group_viewers`` plus the merge block in ``_full_sync``). That over-granted in three ways: 1. ``SharingLinks.<otherFile>.…`` system groups (one per sharing link in the site) attached to every file, so a user who redeemed a link for *one* file gained access to *every* file in the site. 2. Custom site groups (e.g. an admin's ``engineering-test`` group granted only on a specific subfolder) attached to every file, so members saw files outside the folder they were granted on. 3. Default ``Members`` / ``Owners`` / ``Visitors`` and any ``Limited Access System Group`` re-attached to files where the admin had broken inheritance and removed those very groups, undoing deliberate access tightening. Fix --- Trust the per-item Graph permissions response, which already returns the correct set of principals for both intact and broken inheritance. The one thing Graph does not represent as a ``siteGroup`` grant is the sharing-link audience: it returns a ``link`` permission instead, while SharePoint internally tracks redeemers as members of a ``SharingLinks.<itemId>.<scopeRole>.<linkId>`` site group. We translate that ``link`` permission into the matching SharingLinks group viewer on that one file, scoped by the file's SharePoint UniqueId. Changes ------- - ``acl.py``: new ``link_permission_to_sp_group_viewer`` helper. Empirically verified scope/type → group-suffix mapping (organization+edit → OrganizationEdit, organization+view → OrganizationView, users+edit / users+view both → Flexible). ``extract_access_control`` grows an optional ``sp_unique_id`` parameter; when present, non-anonymous link permissions are appended as the per-link site-group viewer. - ``client.py``: new ``GraphClient.get_item_sp_unique_id`` — fetches just ``sharepointIds.listItemUniqueId`` for an item. - ``source.py``: deletes ``_fetch_sp_group_viewers`` and the merge block. Each file-build path now does a conditional UniqueId lookup only when the item's permissions actually contain a ``link`` block (most files have none, so the per-file cost is zero by default). - ``builders.py``: ``build_file_entity`` plumbs ``sp_unique_id`` through. Tests ----- 20 unit tests (existing + new) covering scope/type → group-suffix mapping, the anonymous → ``is_public`` short-circuit (no derived viewer), missing sp_unique_id / link_id / unknown scope returning ``None``, and the combined "link + explicit grants" payload. Verification ------------ End-to-end against neenacorp.sharepoint.com with five test fixtures designed to expose each over-grant mode independently. Pre/post viewer shrink confirmed (e.g. jean-doc.txt 8 viewers → 3, eng-doc.txt 7 → 3). Search-as-user matrix shows the SharingLinks redeemer (acl_test_user1) went from seeing all 4 fixture files to seeing only the file the link points to; the custom-group member (acl_test_user3) went from all 4 to just the file the group was granted on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anket-attach fix(sharepoint_online): scope sharing-link viewers per-file, drop blanket-attach

… into per-user memberships The "Everyone except external users" claim (``c:0-.f|rolemanager|spo-grid-all-users/<tenantId>``) is the default audience SharePoint adds to most sites' Members and Visitors groups — i.e. the standard permission model on classic team sites and communication sites. SharePoint returns it as a member of site groups with ``PrincipalType=4`` (same as Entra federated groups), but with the ``rolemanager`` claim provider rather than ``federateddirectoryclaimprovider``. Our existing PT=4 branch in ``_parse_sp_group_member`` only matched the Entra shape, so the claim hit ``return None`` and was silently dropped. Net effect: any file accessible only via a site group that contained this claim was invisible in search to internal users without an alternate access path. No log line, no warning — just a silent drop. Fix --- Translate the claim into a synthetic group ``claim:everyone_except_external``, populated once per sync with the tenant's internal members (``userType eq 'Member'`` filter excludes B2B guests, preserving the claim's semantics). The broker's existing recursive group expansion then resolves user → claim:everyone_except_external → sp:<X> → file at search time with no broker, schema, or search-path changes. Also: log a single info-level line whenever we encounter a PT=4 LoginName we don't recognize, so future unknown claim shapes (custom rolemanager roles, legacy Windows claims, etc.) surface in operator logs without breaking sync. Changes ------- - ``source.py``: - ``EVERYONE_EXCEPT_EXTERNAL_PRINCIPAL`` synthetic group constant. - ``_EVERYONE_EXCEPT_EXTERNAL_LOGIN_RE`` regex. - ``_parse_sp_group_member`` recognizes the claim and returns the sentinel; docstring corrected (PT=4 with rolemanager, not PT=16). - ``_is_unrecognized_pt4_login`` classifier for diagnostic logging. - ``_expand_sp_site_groups`` flips ``_needs_internal_user_enum`` when the sentinel is yielded. - New ``_expand_everyone_except_external`` enumerates internal users via Graph and yields user → claim memberships once per sync. - ``generate_access_control_memberships`` runs the enumeration only when the flag was set — zero cost on tenants that don't use the claim. - ``client.py``: new ``GraphClient.list_internal_tenant_users`` paginates ``/users?$filter=userType eq 'Member'``. Tests ----- 6 new unit tests in ``test_sharepoint_online_group_expansion.py`` covering the claim regex (lower- and uppercase tenant GUIDs), the synthetic sentinel return, ``_is_unrecognized_pt4_login`` for both known and unknown PT=4 shapes, and the non-PT=4 case. Existing 488 source-unit tests still pass. Verification ------------ End-to-end against neenacorp.sharepoint.com on a communication site (M365-group-backed sites refuse the claim by org policy, so the bug surface needs a non-group site). Set up: a custom site group ``claim-test-group`` containing only the claim; a folder with broken inheritance granted only to that group; a file ``claim-doc.txt`` inside. Pre-fix: 0 membership rows for ``claim-test-group`` in the DB despite the claim being present in SharePoint. Three internal users (``lennert@``, ``acl_test_user1@``, ``acl_test_user5@``) returned 0 hits on a search for the file's canary token. Post-fix: 49 user → claim rows + 1 claim → SP-group row in the DB. Same three users now return 1 hit each. PR airweave-ai#2's regression matrix on the access-control-tests collection remains unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eryone-except-external fix(sharepoint_online): expand "Everyone except external users" claim into per-user memberships

orhanrauf and others added 30 commits April 10, 2026 17:24

Merge pull request airweave-ai#1753 from airweave-ai/docs/fix-connect…

e54db77

…or-links fix(docs): update connector links to use the correct documentation paths

chore: update Node.js version in GitHub Actions workflow from 18 to 20

c21ba64

Merge pull request airweave-ai#1755 from airweave-ai/chore/bump-node-…

fee1fb1

…version chore: update Node.js version in GitHub Actions workflow from 18 to 20

Merge pull request airweave-ai#1757 from airweave-ai/fix/ci-diff-qual…

27b448a

…ity-venv fix(ci): use in-project venvs for mypy and import-linter jobs

Merge pull request airweave-ai#1759 from airweave-ai/fix/cancel-pendi…

ca49bc2

…ng-sync-jobs fix: cancel path for pending sync jobs

Merge pull request airweave-ai#1761 from airweave-ai/fix/ci-scan-and-…

5ac45f6

…attest-resilience fix(ci): harden scan-and-attest resilience

fix(tests): drop redundant assertion on UnavailableLLM.close()

8c4bdd9

close() is typed -> None, so 'assert await llm.close() is None' trips mypy's func-returns-value check. The test's intent is that close() is a safe no-op (does not raise), which a bare await already covers.

Merge pull request airweave-ai#1760 from airweave-ai/fix/sp-online-gr…

f25ead8

…oup-expansion fix(sharepoint_online): correct SP site group expansion

Merge pull request airweave-ai#1762 from airweave-ai/feat/lazy-config…

32cfea0

…urable-llm-chain feat: lazy + configurable LLM fallback chain

Merge pull request airweave-ai#1774 from airweave-ai/fix/sp-online-or…

b1583b7

…g-link-not-public fix(sharepoint_online): organization-scoped sharing links no longer bypass access control

Merge pull request airweave-ai#1775 from airweave-ai/fix/sp-online-bl…

fbd6566

…anket-attach fix(sharepoint_online): scope sharing-link viewers per-file, drop blanket-attach

Merge pull request airweave-ai#1776 from airweave-ai/fix/sp-online-ev…

27b294f

…eryone-except-external fix(sharepoint_online): expand "Everyone except external users" claim into per-user memberships

bethh0rn added 5 commits May 11, 2026 15:29

Edit: Welcome

988035e

Edit: Quickstart

d3be983

Proofread: Concepts

02e0e63

Define RAG abbreviation (used later in Concepts)

6dc42a3

Proofread: Search

8fff722

sixhobbits force-pushed the trial-edit branch from 9150a5d to 8fff722 Compare May 11, 2026 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trial edit#1

Trial edit#1
bethh0rn wants to merge 35 commits into
mainfrom
trial-edit

bethh0rn commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

bethh0rn commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants