Skip to content

Trial edit#1

Open
bethh0rn wants to merge 35 commits into
mainfrom
trial-edit
Open

Trial edit#1
bethh0rn wants to merge 35 commits into
mainfrom
trial-edit

Conversation

@bethh0rn
Copy link
Copy Markdown

No description provided.

orhanrauf and others added 30 commits April 10, 2026 17:24
…eave-ai#1750)

* feat(search): add ACL filtering to V2 search + fix SP Online access control

- Add acl_principals param to VectorDB protocol and Vespa client
- Executor resolves user principals via AccessBroker, builds ACL YQL clause
- Admin search-as-user endpoints for all 3 tiers (instant/classic/agentic)
- Fail-closed transformer: AC sources without access data default to invisible
- SP Online: set access on site, drive, and page entities (drive root permissions)
- SP Online: remove dead build_item_entity code
- SP token exchange: implement OAuthTokenProvider.get_token_for_resource()
- OAuth2Service.exchange_token_for_scope() for resource-scoped token exchange
- Fix OAuth callback to pass stored config_fields to lifecycle.validate()
- Fix pre-existing D301 ruff error in vespa transformer

* feat(sharepoint): enforce required site_url + track Entra groups on all entity types

- Make site_url required in SharePointOnlineConfig (one site per connection)
- Fix validate_config to validate required fields even when config is empty
- Add _track_entity_groups for site, drive, and page entities (enables Entra group expansion)
- Update config SSRF tests for new required site_url behavior

* feat(sharepoint): switch to application-level client credentials auth

- Remove OAuth auth methods, use DIRECT only with client credentials
- Add SharePointOnlineAppAuthConfig (tenant_id, client_id, client_secret, private_key)
- Implement client_credentials token exchange for Graph API
- Implement certificate-based JWT assertion for SP REST API tokens
- Use getAllSites for complete site discovery (app permissions)
- Add Prefer: deltashowsharingchanges headers for permission change tracking in delta
- Make site_url optional (empty = sync all sites in tenant)
- Simplify all token providers, group expander, and _get() to single auth path

* fix(sharepoint): fix file downloads for client-credentials auth

DirectCredentialProvider has no get_token(), so FileService raised
"No access token available" for every file download. Two fixes:

1. Recognize SharePoint tempauth= URLs as pre-signed (skip auth header)
2. Fall back to a Graph bearer token via StaticTokenProvider when the
   download URL is a Graph API content endpoint

Tested: 18/18 files now sync successfully with client credentials.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(sharepoint): split into OAuth + client credentials sources with shared base

Refactors the single SharePointOnlineSource into two separate sources:
- sharepoint_online (OAuth, delegated user auth) — original behavior
- sharepoint_online_app (client credentials, app-only auth) — new

Both inherit from SharePointOnlineBase which contains all shared sync,
browse tree, file download, and ACL logic. Subclasses only override
auth-specific hooks: token acquisition, 401 handling, SP REST token
exchange, file download auth, and site discovery strategy.

Also fixes SP site group expansion by using the actual uploaded certificate
PEM for x5t thumbprint computation instead of generating a new self-signed
cert each call (which caused thumbprint mismatch with Azure AD).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Daan Manneke <daan@airweave.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Updated the links for various connectors in the overview documentation to point to the correct `/docs/connectors/` paths instead of the previous `/connectors/` paths. This change ensures that users can access the appropriate documentation for each connector.
…or-links

fix(docs): update connector links to use the correct documentation paths
…version

chore: update Node.js version in GitHub Actions workflow from 18 to 20
…er (airweave-ai#1719)

* refactor: consolidate SyncLifecycleService + SyncRecordService into unified SyncService

* fix: restore Temporal workflow start for manual sync runs

SourceConnectionService.run was missing the Temporal workflow invocation
after creating a PENDING sync job, causing API-triggered syncs to remain
stuck in PENDING. Also fixes ruff, mypy, and test failures from the
unified SyncService refactor.

* test: add unit tests for run, get_jobs, cancel_job, count_by_organization

Covers the new sync lifecycle proxy methods on SourceConnectionService
to improve diff-coverage for the PR.

* test: cover force_full_sync, _resolve_collection, _resolve_connection

Adds tests for the remaining uncovered paths in SourceConnectionService
including error cases for missing collection/connection.

* test: add comprehensive unit tests for SyncService lifecycle/CRUD methods

Covers get, pause, resume, delete, resolve_destination_ids, trigger_run
(validation paths), get_jobs, cancel_job (success/not-found/wrong-status/
temporal-failure/workflow-not-found), validate_force_full_sync, create
(federated/no-schedule/with-cron), _resolve_cron, _validate_cron_for_source,
_cancel_active_syncs, _wait_for_terminal, and _schedule_cleanup.

* fix: use enum value for DB status update + add SyncService unit tests

SyncStateMachine.transition was assigning the SyncStatus enum member
(name='PAUSED') to the ORM column, but PostgreSQL expects the lowercase
value ('paused'). Use .value for correct serialization.

Also adds comprehensive unit tests covering get, pause, resume, delete,
trigger_run, get_jobs, cancel_job, validate_force_full_sync, create,
_resolve_cron, _validate_cron_for_source, and all private helpers.

* fix: pass entity count fields through SourceConnectionJob mapping

The refactored get_jobs() and cancel_job() methods were constructing
SourceConnectionJob objects without entity count fields (entities_inserted,
entities_updated, entities_deleted), causing them to always default to 0.
This broke E2E tests that assert entities_inserted > 0 after a completed sync.

* Complete SourceConnectionJob field mapping with duration_seconds, entities_failed, error_category

* fix: add missing mock fields to _make_sync_job_schema

The MagicMock-based sync job fixture was missing error_category and
entities_skipped, causing Pydantic validation to fail when the service
mapped the mock to SourceConnectionJob.

* chore: remove audit script and output files from PR

* fix: address PR airweave-ai#1719 review comments across sync and source_connection domains

- SyncService.create: raise ValueError instead of returning None for
  federated sources and no-schedule cases
- SyncService.delete: accept single sync_id instead of List[UUID]
- SyncService.trigger_run: consolidate Temporal workflow start internally,
  accept collection/connection/force_full_sync params
- SyncService.get: raise HTTPException(404) instead of ValueError
- SyncService.get_jobs: push limit to DB query instead of in-memory slice
- SyncService.cancel_job: add retry logic for Temporal cancellation
- SourceConnectionService: remove temporal_workflow_service dependency,
  remove try/except around event bus publish, use NotFoundException
  in _resolve helpers
- SourceConnectionUpdateService: delegate sync creation to
  sync_service.create instead of direct repo access
- SourceConnectionDeletionService: simplify collection field access
- CollectionService.delete: use asyncio.gather for per-sync deletion
- OAuthCallbackService: fix resume reason string
- FakeSyncService: add explicit SyncServiceProtocol inheritance
- Update all protocols, fakes, and test files to match new signatures

* Update backend/airweave/domains/collections/service.py

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* Update backend/airweave/domains/source_connections/service.py

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

* fix: guard against scheduling syncs for federated search sources in update service

Add upfront federated_search check in SourceConnectionUpdateService
before calling sync_service.create, returning HTTP 400 instead of
letting the ValueError propagate from SyncService.

* fix: remove stray paren and lint violation breaking CI

The asyncio.gather → sequential loop conversion left an unmatched ')'
in CollectionService.delete, causing a SyntaxError that cascaded across
import-linter, test, and test-public-api checks. Also breaks the
_duration_seconds signature to satisfy ruff E501 (line too long).

* fix: resolve mypy Column[UUID] and Liskov substitution violations

- collections/service: use schema result.id (UUID) instead of ORM
  db_obj.id (Column[UUID]) for sync_service.delete and repo.remove
- delete/update: validate ORM collection to CollectionRecord schema
  before passing .id to sync_service methods
- jobs/repository: widen ctx parameter from ApiContext to BaseContext
  in get_all_by_sync_id to satisfy the protocol contract

* fix: guard sync creation for null-schedule case and fix test mocks

- create.py: skip sync_service.create when schedule.cron is explicitly
  null and sync_immediately is false (prevents ValueError → 500)
- delete.py: inline model_validate call to satisfy ruff format
- test_delete/test_update: add required CollectionRecord fields to mock
  collection objects so model_validate succeeds

* fix: ruff format and mypy str|None for sync_service.create name arg

Inline resolve_destination_ids calls (ruff format) and use
`obj_in.name or entry.name` to guarantee str for the name parameter.

* fix: remove redundant sync_id ownership check from cancel_job

The job lookup in SyncService.cancel_job is already org-scoped via ctx,
so the extra sync_id check in SourceConnectionService was unnecessary.
Simplifies cancel_job to delegate directly to the sync service.

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
* feat: add Custom auth provider for customer-hosted token endpoints

Adds a new "custom" auth provider that calls a user-hosted HTTP endpoint
to fetch fresh access tokens before each sync. This enables clients who
manage their own token lifecycle (e.g. Mozilla) to expose a single
endpoint instead of injecting expiring OAUTH_TOKEN values.

Backend:
- CustomAuthConfig with endpoint_url (SSRF-validated), auth method, and
  auth value fields stored encrypted on the provider connection
- CustomAuthProvider implementation with POST-based token fetching,
  configurable auth (bearer/api_key_header/none), and full error mapping
- Registered in ALL_AUTH_PROVIDERS with empty BLOCKED_SOURCES
- Unit tests for create, headers, get_creds_for_source, and validate

Frontend:
- Fix AuthProviderTable to support multiple connections per provider type
  (was using .find(), now uses .filter() with list view)
- New AuthProviderConnectionsList component for browsing connections
- Connection count badge on AuthProviderButton when count > 1
- Custom provider icon (plug + arrow) for light and dark themes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: add request-time SSRF protection, simplify validate contract, clean up debug logs

Add validate_url() check before every HTTP request in CustomAuthProvider
to prevent DNS rebinding attacks against the user-supplied endpoint URL.
Simplify validate() to just check for 2xx at base URL instead of requiring
access_token at /__validate__. Document the full endpoint contract in
CustomAuthConfig. Remove console.log debug statements and unused import
from AuthProviderTable.

* fix: remove remaining debug console.log statements from auth provider UI

* fix: route auth provider connections through correct token provider and error classification

Two bugs found during Custom auth provider e2e testing:

1. lifecycle._get_auth_configuration required both readable_auth_provider_id
   AND auth_provider_config to route to the auth provider path. Custom
   provider has no config, so it fell through to the DB credential path
   and failed with "no integration credential".

2. lifecycle._resolve_token_provider only used AuthProviderTokenProvider
   for OAuth sources. Non-OAuth sources (GitHub, Stripe) got a
   DirectCredentialProvider even when credentials came from an auth
   provider, causing errors to be classified as API_KEY_INVALID instead
   of AUTH_PROVIDER_CREDENTIALS_INVALID.

3. AuthProviderTokenProvider._fetch_token hardcoded "access_token" as the
   expected field. Non-OAuth sources use different field names (e.g.
   personal_access_token for GitHub). Now extracts the first matching
   runtime auth field from the response.

* fix: improve error handling for auth provider credential failures

- Handle 404 from Custom endpoint as AuthProviderMissingFieldsError
  with a clear message instead of leaking raw httpx error
- Catch remaining HTTP status codes as AuthProviderConfigError instead
  of re-raising raw httpx exceptions
- Wrap AuthProviderError from credential fetching (step 2) as
  SourceValidationError so the error classifier can process it
- Add AuthProviderError catch-all in classifier to map all auth
  provider failures to AUTH_PROVIDER_CREDENTIALS_INVALID

* fix: scope Custom auth provider credentials by source connection ID

Custom endpoint now calls GET {base_url}/{source_connection_id} instead
of GET {base_url}/{source_short_name}, allowing customers to return
different credentials per source connection. Added optional
source_connection_id parameter through the base class, lifecycle, and
token provider chain. Custom provider requires it and errors if missing.

* feat: normalize Custom auth provider field names and simplify customer contract

Customers return {"access_token": "..."} or {"api_key": "..."} — the
Custom provider maps to Airweave-internal field names (e.g.
personal_access_token for GitHub, api_token for Document360).

Also treat refresh_token, client_id, client_secret as optional for auth
provider flows since the provider handles token lifecycle.

* fix: treat OAuth lifecycle fields as optional in token provider path too

The previous fix only applied _AUTH_PROVIDER_OPTIONAL in the lifecycle
path. The token provider's _call_provider_with_retry also calls
get_creds_for_source with the raw optional fields, causing Dropbox syncs
to fail demanding refresh_token.

* fix: harden Custom auth provider security, clean up debug logs, improve UI

- Block SSRF via redirect by setting follow_redirects=False on httpx clients
- Fix stale docstrings referencing source_short_name instead of source_connection_id
- Deduplicate AUTH_PROVIDER_OPTIONAL_FIELDS into shared constant in _base.py
- Type AuthProviderConnectionsList props with proper AuthProvider/AuthProviderConnection types
- Remove 19 debug console.log statements across frontend auth provider components
- Add "View endpoint contract" tooltip in Custom provider configure UI
- Add tests verifying follow_redirects=False is enforced

* feat: gate Custom auth provider behind custom_auth_provider feature flag

Add feature_flag support to the @auth_provider decorator, mirroring the
existing pattern for sources. Providers with a feature flag are hidden
from list_metadata unless the organization has the flag enabled.

* fix: use set() instead of list for runtime_auth_optional_fields in test

The | operator requires set-like types on both sides. The test was using
[] but AUTH_PROVIDER_OPTIONAL_FIELDS is a frozenset.

* refactor: extract HTTP status classification to reduce complexity

Moves the HTTP status code → exception mapping out of get_creds_for_source
into _raise_for_http_status to satisfy ruff C901 (complexity 13 > 10).

* fix: resolve mypy violations in diff

- Declare endpoint_url/api_key as instance attrs on CustomAuthProvider
- Type credentials param as Optional[Dict] instead of Optional[Any]
- Cast BaseContext to ApiContext at _get_auth_configuration call site
- Use str() to satisfy return type on credential extraction
- Remove dead _replace_fields call in test

* fix: block ctti source in Custom provider and update e2e test allowlist

Custom provider had empty BLOCKED_SOURCES, so ctti (internal test source)
showed it as a supported provider. Also add 'custom' to the valid
auth provider names in the smoke test.

* fix: add source_connection_id param to Composio and Pipedream providers

The base class and callers now pass source_connection_id as a keyword
argument. Composio and Pipedream need to accept it in their signatures
even though they don't use it.

* refactor: rename endpoint_url to base_endpoint_url and type create()

Address PR review: rename the Custom auth provider's URL field to be
more descriptive, and parse credentials into CustomAuthConfig inside
create() for typed attribute access.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mypy and import-linter used virtualenvs-create: false, so poetry run
couldn't find console scripts like diff-quality. Switch all three jobs
to the same pattern: in-project venv with caching. Broken since Mar 16
when diff-quality was added to the mypy job without updating its venv
config.
* feat: add Mistral LLM adapter with Large and Small model support

Adds MistralLLM provider using the native mistralai SDK with json_schema
structured output and OpenAI-compatible tool/function calling.

* fix: register MistralLLM in container factory provider_classes

* feat: add thinking/reasoning support to Mistral adapter

- Parse ThinkChunk content blocks from Magistral and Mistral Small 4
- Pass reasoning_effort param for adjustable reasoning models
- Add MAGISTRAL_SMALL model, update MISTRAL_SMALL thinking config
- Add 4 new tests for thinking extraction and reasoning_effort

* fix: suppress thinking output when caller did not request it

* refactor: address review feedback on Mistral adapter

- Type `_extract_text` / `_extract_text_and_thinking` with
  `AssistantMessageContent | Unset | None` instead of `Any`.
- Move the empty-content guard to run after text extraction so
  ThinkChunk-only responses (reasoning models) are caught by the
  same transient-retry path.
- Drop the `str(raw_content)` fallback that could hand non-JSON
  to the parser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ity-venv

fix(ci): use in-project venvs for mypy and import-linter jobs
The state machine forbids PENDING → CANCELLING, but the cancel
flow assumed every job passed through CANCELLING on the way to
CANCELLED. A PENDING job hitting the API `cancel_job` endpoint
would attempt an invalid transition.

Split the cancel path by current status: PENDING jobs now go
straight to CANCELLED, while RUNNING jobs keep the two-phase
CANCELLING → CANCELLED flow. The orchestrator and workflow both
attempt the same transitions defensively — the orchestrator acts
eagerly inside the activity, the workflow retries via a shielded
activity as a safety net. Duplicate arrivals are harmless because
the state machine treats same-state writes as no-ops.

Shielded (cancel-path) transition failures are now logged at
debug rather than warning, since the orchestrator already handled
them. The CANCELLED transition failure in the orchestrator keeps
warning level — it means the job reached an unexpected terminal
state (COMPLETED/FAILED) during cancellation.
The `scan-and-attest` job used implicit `if: success()` on
every step after Trivy, so a critical vulnerability finding
would skip SBOM generation, Grype scanning, attestations,
and artifact uploads entirely. Chain each step off its real
dependencies with `!cancelled()` and outcome checks instead,
so the pipeline produces as many supply-chain artifacts as
possible regardless of individual scan failures.

Also fix the Trivy SARIF upload condition — the previous
`steps.trivy.outputs.sarif` reference checked an output the
action never exposes, silently skipping every upload. Use
`steps.trivy.outcome != 'skipped'` to match the file-based
contract documented upstream.

Other changes: add Grype SARIF upload to Code Scanning, add
`only-fixed` to Grype so unfixable criticals do not gate the
build, and let `package-vespa` run when scans fail since it
is independent of container vulnerability results.
…ng-sync-jobs

fix: cancel path for pending sync jobs
…attest-resilience

fix(ci): harden scan-and-attest resilience
- Fix regression where _expand_sp_site_groups and _fetch_sp_group_viewers
  silently early-return when self._site_url is empty, breaking the
  multi-site sync mode (getAllSites). Track SP groups per site URL.
- Drop role principals (PrincipalType=16, rolemanager|spo-grid-all-users/...)
  instead of emitting them as fake users.
- Emit nested Entra groups (PrincipalType=4, federateddirectoryclaimprovider)
  as group-to-group memberships (entra:{guid}) rather than flattening to
  the group's email.
- Parse LoginName via a strict regex for the i:0#.f|membership|<email>
  pattern only, removing the naive split("|")[-1] that accepted role
  claims.

Cursor format for tracked_sp_groups migrated from List[str] to
Dict[site_url, List[str]]; legacy list values are discarded defensively
and re-collected on next full sync.

Incremental and targeted sync paths still pass an empty site_url (noted
as follow-up work).
Let deployments boot without any chain-provider API key: when the chain resolves
to zero providers, inject an UnavailableLLM null-object instead of raising.
Instant search keeps working; classic and agentic search surface
LLMUnavailableError on first use, mapped to HTTP 503 via the existing handler.

Also expose LLM_FALLBACK_CHAIN as an env var (format:
"provider:model,provider:model") so private-cloud deployers can override the
chain without forking. Unset falls back to the current hardcoded
[together:zai-glm-5, anthropic:claude-sonnet-4.6] default.
Felix flagged the legacy single-site SP token provider as having no
callers. Confirmed via grep — only the base declaration and two
subclass overrides existed, nothing calling them. Removed all three,
plus the likewise-unused _derive_sp_resource_scope helper on the OAuth
subclass and the stale base-class docstring reference.

_make_sp_token_provider_for_site(site_url) is the sole entry point.
- Derive env-var list in LLMUnavailableError from PROVIDER_API_KEY_SETTINGS
  so the default message stays in sync when providers are added.
- Extract _unavailable() helper in _build_llm_chain to dedupe the two
  UnavailableLLM return paths and their log lines.
- Narrow UnavailableLLM.model_spec return type from Any to LLMModelSpec.
- Hoist provider/model lookup dicts to module scope in search config,
  surface accepted values in enum-declaration order in error messages,
  and note that SearchConfig.LLM_FALLBACK_CHAIN is evaluated once at
  class-definition time.
close() is typed -> None, so 'assert await llm.close() is None' trips
mypy's func-returns-value check. The test's intent is that close() is a
safe no-op (does not raise), which a bare await already covers.
pyproject.toml declared both weaviate (^0.1.2) and weaviate-client
(^4.10.2). The former is a PyPI placeholder ("A placeholder package
for the Weaviate name") that ships its own weaviate/__init__.py,
colliding with the real client. Poetry 2.x now fails the Docker image
build with 'Installing weaviate/__init__.py over existing file'; 1.x
silently tolerated the overwrite.

No code imports weaviate — the only repo reference is a URL string in
a test. Removing the placeholder unbreaks the container build.
…oup-expansion

fix(sharepoint_online): correct SP site group expansion
- parse_llm_fallback_chain now validates (provider, model) pairs against
  MODEL_REGISTRY, so combos like together:mistral-large fail at startup
  with an actionable error instead of crashing later in the factory.
- Move the detailed env-var message out of core.exceptions (which had
  to import from adapters.llm.registry) into UnavailableLLM itself. The
  exception keeps a generic default pointing at the docs.
- Drop the unreachable empty-chain guard in _create_search_services —
  the parser always returns a non-empty list.
- Test the env-var-name assertion dynamically against
  PROVIDER_API_KEY_SETTINGS so it stays in sync with the registry.
- Add a 'Configuring the LLM provider chain' section to the Search docs
  covering API keys, LLM_FALLBACK_CHAIN format, and fallback semantics,
  so the exception's doc reference has a real destination.
Apply @hiddeco's suggestions:
- Lift 'self-hosted only' caveat into a Fern Callout.
- Tighten 'Out of the box, Airweave...' phrasing.
- Rework 'first provider with an API key set that responds...' for
  clarity.
- Reword 'Misconfiguration is caught at startup...'.
- Capitalize 'Classic/Agentic' in the fallback-semantics bullet to
  match the rest of the doc.
…urable-llm-chain

feat: lazy + configurable LLM fallback chain
An "anyone in your organization with the link" share is a capability,
not an ACL: a user must possess the link URL to access the file.
Previously, ``extract_access_control`` mapped ``link.scope == "organization"``
to ``AccessControl.is_public = True``, causing the search broker to skip
all viewer/group checks for any file with such a link.

Net effect of the bug: any file that ever had an org-scoped share link
created became visible in search to every authenticated user in the
tenant — including files on private sites with restricted folder
permissions, where only specific users / groups should have access.

Reported by Mistral on a private site whose files appeared in search
results for unrelated tenant users.

Fix: only ``link.scope == "anonymous"`` flips ``is_public``. Org-scoped
link permissions are now skipped — their audience is already represented
via the ``SharingLinks.<itemId>.<scope><role>.<linkId>`` SP site group
that SharePoint creates per link, which the connector tracks separately
and resolves through ACL membership at search time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g-link-not-public

fix(sharepoint_online): organization-scoped sharing links no longer bypass access control
…nket-attach

The SharePoint Online connector blanket-attached every site group returned
by ``/_api/web/sitegroups`` to every file in the site (``_fetch_sp_group_viewers``
plus the merge block in ``_full_sync``). That over-granted in three ways:

1. ``SharingLinks.<otherFile>.…`` system groups (one per sharing link in the
   site) attached to every file, so a user who redeemed a link for *one*
   file gained access to *every* file in the site.
2. Custom site groups (e.g. an admin's ``engineering-test`` group granted
   only on a specific subfolder) attached to every file, so members saw
   files outside the folder they were granted on.
3. Default ``Members`` / ``Owners`` / ``Visitors`` and any
   ``Limited Access System Group`` re-attached to files where the admin
   had broken inheritance and removed those very groups, undoing
   deliberate access tightening.

Fix
---
Trust the per-item Graph permissions response, which already returns the
correct set of principals for both intact and broken inheritance.

The one thing Graph does not represent as a ``siteGroup`` grant is the
sharing-link audience: it returns a ``link`` permission instead, while
SharePoint internally tracks redeemers as members of a
``SharingLinks.<itemId>.<scopeRole>.<linkId>`` site group. We translate
that ``link`` permission into the matching SharingLinks group viewer
on that one file, scoped by the file's SharePoint UniqueId.

Changes
-------
- ``acl.py``: new ``link_permission_to_sp_group_viewer`` helper. Empirically
  verified scope/type → group-suffix mapping
  (organization+edit → OrganizationEdit, organization+view → OrganizationView,
  users+edit / users+view both → Flexible). ``extract_access_control`` grows
  an optional ``sp_unique_id`` parameter; when present, non-anonymous link
  permissions are appended as the per-link site-group viewer.
- ``client.py``: new ``GraphClient.get_item_sp_unique_id`` — fetches just
  ``sharepointIds.listItemUniqueId`` for an item.
- ``source.py``: deletes ``_fetch_sp_group_viewers`` and the merge block.
  Each file-build path now does a conditional UniqueId lookup only when
  the item's permissions actually contain a ``link`` block (most files
  have none, so the per-file cost is zero by default).
- ``builders.py``: ``build_file_entity`` plumbs ``sp_unique_id`` through.

Tests
-----
20 unit tests (existing + new) covering scope/type → group-suffix mapping,
the anonymous → ``is_public`` short-circuit (no derived viewer), missing
sp_unique_id / link_id / unknown scope returning ``None``, and the
combined "link + explicit grants" payload.

Verification
------------
End-to-end against neenacorp.sharepoint.com with five test fixtures
designed to expose each over-grant mode independently. Pre/post viewer
shrink confirmed (e.g. jean-doc.txt 8 viewers → 3, eng-doc.txt 7 → 3).
Search-as-user matrix shows the SharingLinks redeemer (acl_test_user1)
went from seeing all 4 fixture files to seeing only the file the link
points to; the custom-group member (acl_test_user3) went from all 4 to
just the file the group was granted on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anket-attach

fix(sharepoint_online): scope sharing-link viewers per-file, drop blanket-attach
… into per-user memberships

The "Everyone except external users" claim
(``c:0-.f|rolemanager|spo-grid-all-users/<tenantId>``) is the default
audience SharePoint adds to most sites' Members and Visitors groups —
i.e. the standard permission model on classic team sites and
communication sites. SharePoint returns it as a member of site groups
with ``PrincipalType=4`` (same as Entra federated groups), but with the
``rolemanager`` claim provider rather than ``federateddirectoryclaimprovider``.

Our existing PT=4 branch in ``_parse_sp_group_member`` only matched the
Entra shape, so the claim hit ``return None`` and was silently dropped.
Net effect: any file accessible only via a site group that contained
this claim was invisible in search to internal users without an
alternate access path. No log line, no warning — just a silent drop.

Fix
---
Translate the claim into a synthetic group
``claim:everyone_except_external``, populated once per sync with the
tenant's internal members (``userType eq 'Member'`` filter excludes
B2B guests, preserving the claim's semantics). The broker's existing
recursive group expansion then resolves
user → claim:everyone_except_external → sp:<X> → file at search time
with no broker, schema, or search-path changes.

Also: log a single info-level line whenever we encounter a PT=4
LoginName we don't recognize, so future unknown claim shapes
(custom rolemanager roles, legacy Windows claims, etc.) surface
in operator logs without breaking sync.

Changes
-------
- ``source.py``:
  - ``EVERYONE_EXCEPT_EXTERNAL_PRINCIPAL`` synthetic group constant.
  - ``_EVERYONE_EXCEPT_EXTERNAL_LOGIN_RE`` regex.
  - ``_parse_sp_group_member`` recognizes the claim and returns the
    sentinel; docstring corrected (PT=4 with rolemanager, not PT=16).
  - ``_is_unrecognized_pt4_login`` classifier for diagnostic logging.
  - ``_expand_sp_site_groups`` flips ``_needs_internal_user_enum`` when
    the sentinel is yielded.
  - New ``_expand_everyone_except_external`` enumerates internal users
    via Graph and yields user → claim memberships once per sync.
  - ``generate_access_control_memberships`` runs the enumeration only
    when the flag was set — zero cost on tenants that don't use the claim.
- ``client.py``: new ``GraphClient.list_internal_tenant_users`` paginates
  ``/users?$filter=userType eq 'Member'``.

Tests
-----
6 new unit tests in ``test_sharepoint_online_group_expansion.py``
covering the claim regex (lower- and uppercase tenant GUIDs), the
synthetic sentinel return, ``_is_unrecognized_pt4_login`` for both
known and unknown PT=4 shapes, and the non-PT=4 case. Existing 488
source-unit tests still pass.

Verification
------------
End-to-end against neenacorp.sharepoint.com on a communication site
(M365-group-backed sites refuse the claim by org policy, so the bug
surface needs a non-group site). Set up: a custom site group
``claim-test-group`` containing only the claim; a folder with broken
inheritance granted only to that group; a file ``claim-doc.txt`` inside.

Pre-fix: 0 membership rows for ``claim-test-group`` in the DB despite
the claim being present in SharePoint. Three internal users
(``lennert@``, ``acl_test_user1@``, ``acl_test_user5@``) returned 0 hits
on a search for the file's canary token.

Post-fix: 49 user → claim rows + 1 claim → SP-group row in the DB.
Same three users now return 1 hit each. PR airweave-ai#2's regression matrix on
the access-control-tests collection remains unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eryone-except-external

fix(sharepoint_online): expand "Everyone except external users" claim into per-user memberships
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants