Skip to content

Security audit logging#85

Open
tombentley wants to merge 12 commits intokroxylicious:mainfrom
tombentley:audit-logging
Open

Security audit logging#85
tombentley wants to merge 12 commits intokroxylicious:mainfrom
tombentley:audit-logging

Conversation

@tombentley
Copy link
Member

No description provided.

@tombentley tombentley requested a review from a team as a code owner December 3, 2025 00:59
@tombentley tombentley changed the title Initial proposal for audit logging Security audit logging Dec 3, 2025

Goals:

* enable users to _easily_ collect a _complete_ log of security-related events
Copy link
Member

@k-wall k-wall Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be clear that a complete log should include both the actions performed by the Kafka client and any (async) operations cause by the filters themselves.

Copy link
Member Author

@tombentley tombentley Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

I suppose for the purpose of being able to correlate with Broker logs it would be better to know that a certain request originated in the proxy not with a client accessing the proxy. The alternative, of not audit logging proxy-orginated requests, would be confusing at best, and possibly indistinguishable from log tampering to someone who was looking at the broker logs closely enough.

It should be noted that there can things like queries to Authorizers which should not be logged, because they're not an attempt to perform the action being queried. (E.g. the implementing the IncludeTopicAuthorizedOperations in a Metadata request).

So the answer to the question of "what to log?" isn't always "everything". I think if we tried to make it "everything" we could end up in a mire of event modelling for the many edge cases which in theory someone might care about distinguishing from each other, but in practice someone or something has to analyse those logs and draw conclusions. The closer we model the complex and evolving reality, the harder it is for someone to draw the correct conclusions, and the more we end up being constrained by the API aspect of this proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to allow for logging of events within plugins. The Authorization plugin provides a great example. The runtime doesn't really know about Authorizers in a deep way (just a plugin), but they're actually implementing logic which deserves specific audit logging. And ideally that logging would be consistent over Authorizer implementations (e.g. a Deny from the AclAuthorizer is the same as a Deny from an OpaAuthorizer).

One way to do this, I think, is for the Filter API to provide a method for logging an event. At the level of the Filter API we don't need to be prescriptive about what those events look like (we could just say java.lang.Record, so we knew they were Jackson serializable). We're just promising that they'll be emitted to the same things as the events generated natively by the runtime, and with the right attributes (like the event time and the sessionId and I guess the filterId). The Authorization filter would then take on responsibility for calling that method. Crucially the event classes could be defined alongside the Authorizer API, which how we'd end up with consistency of the event schema across different Authorizer impls.

- `resourceType` — The type of the resource (e.g. `Topic`)
- `resourceName` — The name of the resource (e.g. `my-topic`
* `Read` — Emitted when a client successfully reads records from a topic. It is called `Read` rather than `Fetch` because it covers reads generally, including the `ShareFetch` API key. It will be possible to disable these events, because of the potential for high volume.
- `topicName` — The name of the topic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the audit of client ids, group names and possibly, transactional ids?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of those pertain to the record data itself. I suppose a bad actor might try (and possibly succeed) to use a transactional id of some other service to cause a kind of denial of service attack by fencing off the legitimate producer. Likewise with groups, maybe Eve can prevent processing of some partitions by getting them assigned to her rogue app. But those things just seem a bit far-fetched, so I'm not super-keen to go adding them up-front.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of those pertain to the record data itself.

why aren't we considering events such as resetting a consumer group offset a security event? Causing a consumer to skip a record or fetch a record twice seems very interesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand you're right. Someone could use that as an attack vector in the right circumstances.

But I think there are lots of reasons not to go over-broad on what we're trying to cover:

  • For offset commit... well it doesn't look like a terrible strong signal of something security related going on. Clients commit offsets all the time. Re-processing happens and is not unusual most of the time. The one thing I can think of which could be a bit more specific is fetching from the start of the log. A data exfiltration might look like that. But even that is quite weak: Consumers don't have to store their offets in Kafka at all, so such a check is easily evaded.
  • The Kafka broker's logging already covers requests which get as far as the broker. All we need in the proxy is logging which allow correlation with that. This might be enough of a reason to scale back parts of this proposal.
  • I think we could come up with a security angle for most RPCs. People expect systems to work and anything which makes them not work could hypothetically manifest, at least, as an attack on the availability of the system/DoS. So then we would end up with an audit log that is really more like a protocol trace.
  • The more types of event you define the more API you're committing to. That inhibits our ability to evolve things in the future.
  • If we were going to implement something in the proxy we should use normal logging for protocol tracing. It doesn't really need to be an API, as is proposed here.
  • The more more types of event we define, and the more data produced, the harder it is to analyse.
  • If you research what sorts of events, and event categories, SIEM systems are interested in they're relatively coarse grained.
  • We can always add more events in the future: We don't need to achieve perfect coverage in this proposal, so long as it's not too inflexible for the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be enough of a reason to scale back parts of this proposal.

@k-wall I was thinking about what this would look like if we took the position of not logging all the details of requests and responses in the proxy, but taking the position that those should be logged on the broker cluster if you want that kind of depth. We would still log all the runtime-local things, like connections, authentications, authorizations and so on, as described in this proposal. I think if we did that we could model events like this:

  1. RequestIngress (from client)
  2. RequestEgress (to broker)
  3. RequestInject (originator is a filter)
  4. RequestShortcircuit
  5. ResponseEgress (to the client)

If we took that position then we'd only need to log the correlationId, sessionId (and maybe the API key) for RequestEgress because you could recover what was sent by correlation with the broker's kafka.request.logger logger. We could reduce the scope of this proposal, because we'd not be ending up with a "higher level" API that for example was trying to have a single read event which covered Fetch and ShareFetch. This seems to me to be a better decomposition into event types what I've proposed.

Aside: This starts to feel like OTel traces and spans. However, it doesn't seem to be compatible with OTel. OTel (i.e. app-level) "requests" would tend to correspond with Kafka records. But you can't meaningfully propagate an OTel context kept within records with the events above because records can be batched together, so there's no single "parent span".

@k-wall
Copy link
Member

k-wall commented Dec 3, 2025

@tombentley thanks for getting this ball rolling.

Copy link
Member

@SamBarker SamBarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoped event model (proxysession) is efficient to produce but requires stateful stream processing to interpret — a consumer must join session-scoped events against ProxyStartup (and potentially a VirtualClusterStartup, so that we could log SASL inspection vs termination) to reconstruct basic context like wall-clock time. Standard SIEM ingest pipelines perform lookup enrichment against static reference data; they cannot perform this kind of temporal join across a live event stream.

This has compliance implications. NIST SP 800-92 and PCI DSS Requirement 10.3 both require that each audit record independently contain the event's timestamp, origin, and identity context. An event whose timestamp can only be derived by joining against a prior event in the same stream does not satisfy this requirement without preprocessing.

An audit log that requires a stream processor to be interpretable also has a more fundamental problem: who audits the stream processor? The integrity of the audit trail becomes dependent on the correctness of the join logic, which is an unappealing property for a security primitive.

Sources:

Comment on lines +75 to +78
* `ClientSaslAuthFailure` — Emitted when a client completes SASL authentication unsuccessfully
- `attemptedAuthorizedId` — The authorized id the client attempted to use, if known.
* `ClientSaslAuthSuccess` — Emitted when a client completes SASL authentication successfully
- `authorizedId` — The authorised id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about M_TLS auth events?

I can see there is a distinction between each in that security teams might want to know about each, but are they really different types of event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model I'm aiming for is that we use different events which each have a well-defined schema. It seems to me that the schema for mTLS and SASL are different. The TLS certificates are quite complex things and are often controlled by a whole different set of processes within an org than SASL credentials are. There's also the fact that both can be in used at the same time. For these reasons I think it's clearer to treat them as separate things.

Comment on lines +79 to +86
* `OperationAllowed` — Emitted when an `Authorizer` allows access to a resource.
- `op` — The operation that was allowed (e.g. `READ`)
- `resourceType` — The type of the resource (e.g. `Topic`)
- `resourceName` — The name of the resource (e.g. `my-topic`
* `OperationDenied` — Emitted when an `Authorizer` denies access to a resource.
- `op` — The operation that was denied (e.g. `READ`)
- `resourceType` — The type of the resource (e.g. `Topic`)
- `resourceName` — The name of the resource (e.g. `my-topic`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy can only authoritatively attest to denials. When the proxy's authorizer permits an operation, the request is forwarded to the broker, which may still deny it. OperationDenied is therefore a definitive statement about the operation's fate, but the symmetrical event is not OperationAllowed — it is OperationForwarded. A complete audit trail requires correlating proxy OperationForwarded events with broker-side authorization logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point you're making and it seems reasonable. But I'm don't really like the terminology of "forwarding operations". We forward requests. Some requests represent (collections of) operations.

Comment on lines +75 to +78
* `ClientSaslAuthFailure` — Emitted when a client completes SASL authentication unsuccessfully
- `attemptedAuthorizedId` — The authorized id the client attempted to use, if known.
* `ClientSaslAuthSuccess` — Emitted when a client completes SASL authentication successfully
- `authorizedId` — The authorised id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pair of events conflates two very distinct things. Given the proxy works in two modes SASL inspector and SASL terminator.

The audit event is a witness (SASL inspection) or a Decision (terminator) while those events look identical from a logging point of view they would be interpreted quite differently by users of SIEM systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pair of events

I think you're referring to SASL success vs SASL failure?

Do we really need to export these distinctions outside the system though? Once the user has chosen which mode they're going to use I don't really see that the distinction matters. Or at least not to the extent that the thing on the receiving end needs to tell the difference (because its handling will be different). What I mean is: I assume the users knows they decided to use inspector and can make their own inferences about the consequences of SASL events, or they know they chose terminator and can make different inferences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think you're referring to SASL success vs SASL failure?

Yes.

I assume the users knows they decided to use inspector and can make their own inferences about the consequences of SASL events, or they know they chose terminator and can make different inferences.

I was assuming the people consuming the audit log/ SIEM system were distinct from the people administering the proxy and thus they don't know how the proxy is configured thus any understanding of what was being reported based on the deployment context is gone.

The guide from the NIST and PCI DSS specs is that events should be self contained so I think the witness vs arbiter distinction is important to the semantic understanding of the event.

@k-wall
Copy link
Member

k-wall commented Mar 2, 2026

I'm happy with the overall direction of the proposal. I think the idea to focus a key set of auditable events (read, write etc) rather than the fully set of api keys is the write one. I do wonder if users might want the audit trail to include the allow/deny decisions of the broker too. This would be useful in the case where the user cannot change the Kafka Broker (they are using a Kafka Service of some kind), but we can always add this at a later point.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
@tombentley tombentley requested a review from a team as a code owner March 9, 2026 05:07
@tombentley
Copy link
Member Author

@k-wall @SamBarker thanks for your reviews so far. I've substantially rewritten the proposal based on some of your feedback, and also more research. Please could you take another look?

Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
Copy link
Member

@robobario robobario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom, looks close

How did you get on with researching secure formats/signatures etc? JWS was looking ugly since it was just a base64 blob right? Do we have any requirements to try and make the messages tamperproof or try and provide evidence of authorship through signatures.

Comment on lines +255 to +261
* `addr`: A network address.
* `vc`: A virtual cluster, as defined in the proxy's configuration.
* `tc`: A target cluster, as defined in the proxy's configuration.
* `nodeId`: A Kafka server with a cluster.
* `topicName` and `topicId`: A topic within a cluster. Either can be used. Both should be included if known.
* `groupId`: A group within a cluster.
* `transactionalId`: A transactional within a cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these going to be part of the API?

I wonder if we should include an ObjectRefBuilder interface on the AuditableActionBuilder so Filters could do something like:

aab.objectRefBuilder().withVirtualCluster(context.virtualClusterName())
       .withGroupId(groupId)
       .withCustomScopes(Map.of("x", "y")) // required that "x" is not a well-known scope.
       .withCustomScopes("a", "b") // require that "a" is not a well-known scope.
       .build()

Though if we did that, what if the Proxy introduces new well-known scopes in future? Maybe the packaging rule for custom scopes could/should be enforced harshly like:

logger = context.auditLogger("my.custom.package");
...
aab.objectRefBuilder()
       .withCustomScopes("a", "b") // explodes with error because a is not prefixed with "my.custom.package"
       .build()

Similar for the Context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice I expect it will be uncommon to need to use custom scopes. I think we could add a builder like you suggest later on if it turned out that they were common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm poking at a related though with https://github.com/kroxylicious/design/pull/85/changes#r2928363739

Comment on lines +569 to +573
/**
* Allows the emitter to short-circuit the creation of an audit event
* if it has no interest in routing it.
*/
boolean isInterested(String action, @Nullable String status);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering what this is for. It sounds a bit like the AuditLogger should be able to interact with this to prevent constructing AuditEvents if no emitter is interested.

Maybe we should also offer Supplier style options like:

actionBuilder.withContext(this::buildContext)

to help Filters benefit from this short-circuiting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that that if no emitters are interested (based on the action and status) then we can avoid instantiating the event entirely. The JIT is likely to apply inlining and dead code elimination, so we should avoid the cost of the with*() and log() methods in the builder, together with the cost of creating their arguments. So for a relatively small fee we can make the auditing lighter weight when it's not used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow how this is meant to work. The proposal only defines isInterested on the emitter, but plugins don't interact with the emitter they interact with the AuditLogger which doesn't define an isInterested or related method.

if we surface isInterested or maybe isAuditable on the audit logger interface we can do

if (filterContext.auditLogger().isInterested("Write", null)) {
      filterContext.auditLogger().action("Write")
          .withObjectRef(...)
          .log();
  }

Which makes some sense to me. and the aggregated decision across emitters gets done by the audit logger.

However if we just leave it defined on AuditEmitter then I don't think it carries enough weight, avoiding seralisaiton costs etc is purely an internal concern of the specific emitter.


* The null alternative: Do Nothing. This means users continue to have a poor and fragile experience which in itself could be grounds to not adopt the proxy.

* Just use the existing SLF4J application logging (e.g., with a logger named `audit` where all these events get logged). This approach would not:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also gives you Kafka emission options (I saw this was debated earlier). You can wire up a log4j2 kafka appender and away you go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work for little effort. I dropped the Kafka emitter idea for now to make the proposal more focussed and easier to approve. If we need it, it can be added later.

@tombentley
Copy link
Member Author

How did you get on with researching secure formats/signatures etc? JWS was looking ugly since it was just a base64 blob right? Do we have any requirements to try and make the messages tamperproof or try and provide evidence of authorship through signatures.

We don't currently have that requirement, so it's a "Future Work". Here's probably not the best place to go down the rabbit hole of options and trade offs.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
Signed-off-by: Tom Bentley <tbentley@redhat.com>
The items in the `audit` array work exactly like `filterDefinitions`.


### The `LoggingAuditEmitter` implementation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robobario I revised this bit last night, and added a metric emitter, if you want to take a closer look.

Comment on lines +680 to +683
* `kroxylicious_audit_ClientAuthenticate_success[virtual_cluster=my-cluster]`
* `kroxylicious_audit_ClientAuthenticate_failure[virtual_cluster=my-cluster]`
* `kroxylicious_audit_Write_success[virtual_cluster=my-cluster, topic_name=my-topic]`
* `kroxylicious_audit_Write_failure[virtual_cluster=my-cluster, topic_name=my-topic]`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples show a tension: As described, the actions and object scopes are using camel case. But for a metrics emitter we ought to follow Prometheus conventions and use snake case. That means this emitter would need to transform the former to the latter, which isn't always possible to do nicely if people decide to use upper case initialisms in their identifiers.

Assuming we did that then a further complication exists for the user: They see kroxylicious_audit_client_authenticate_failure increasing so they want to look at the logs to see what's going on. They need to know what they're searching for. They will need to convert kroxylicious_audit_client_authenticate_failure into "action":"ClientAuthenticate".*"status":, which is not as trivial a transformation as it could be.

So I wonder whether action names should be snake case?
That would make the package prefixing look a bit ugly, but I think that's more of a corner case TBH.


Using the above mapping, and the example `ClientAuthenticate` and `Write` actions described earlier, we could have meters like this:

* `kroxylicious_audit_ClientAuthenticate_success[virtual_cluster=my-cluster]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use labels for action and outcome?

kroxylicious_audit_outcome[action="ClientAuthenticate", outcome="success", virtual_cluster=my-cluster]

This would make it easier to build visuals on kroxylicious_audit_outcome breaking down by outcome or action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: might dodge the casing concerns too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has some appeal, but:

  • I find it a bit weird that the value of the action tag determines the other tags (but I'm not an SRE, so my opinion might not count for much).
  • It would prevent a call site from using a objectRef with scope of action or outcome. I guess we could prefix all the scope tags with scope_, or postfix with _scope (which reads more nicely), at the cost of a few more bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Rob's suggestion of using labels for action and outcome is worth pursuing. Embedding the action name in the metric name is less idiomatic Prometheus and makes common queries unnecessarily awkward.

With the proposed approach, calculating the authentication failure rate as a proportion of all authentication requests requires summing the success and failure metrics for the denominator:

rate(kroxylicious_audit_ClientAuthenticate_failure[5m])
/ (rate(kroxylicious_audit_ClientAuthenticate_failure[5m]) + rate(kroxylicious_audit_ClientAuthenticate_success[5m]))

With action and outcome as labels:

rate(kroxylicious_audit_outcome{action="ClientAuthenticate", outcome="failure"}[5m])
/ rate(kroxylicious_audit_outcome{action="ClientAuthenticate"}[5m])

More importantly, cross-action queries become possible. For example, "what is the overall failure rate across all audit actions?":

rate(kroxylicious_audit_outcome{outcome="failure"}[5m])
/ rate(kroxylicious_audit_outcome[5m])

That query is impossible with action names embedded in metric names without knowing them all in advance.

The camelCase vs snake_case tension Tom identifies also largely disappears — label values don't need to follow Prometheus naming conventions, so action="ClientAuthenticate" is fine as-is.

On the tag name collision concern: suffixing resource-derived labels (e.g. vc_resource, topicName_resource) would sidestep any collision with action or outcome cleanly — though the exact suffix is dependent on resolving the naming question raised separately.

One thing worth calling out regardless of the naming approach: topicName as an objectRef scope in the metrics emitter is a cardinality risk. A cluster with many topics would produce a metric series per topic, which could cause real problems for Prometheus. The warning in the proposal about high-cardinality scopes is easy to miss — it might be worth making topicName explicitly opt-in rather than opt-out.

- action: ClientAuthenticate
logFailureAt: WARN
- action: Write
logAt: DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so logAt is sugar for logFailureAt + logSuccessAt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


* At a high level, each action is a fact descibing **who** has attempted to do **what** on **which** object, **when** they did this, and **what** was the outcome.
Having a common structure makes it easy to perform common queries against actions without having to know the schema or semantics for each type of action. For example it makes is easier to ask "what has Eve done?" if every action represents the principal for Eve in the same way.
* Each audit action represents _one_ action on _one_ resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something Gemini spotted: we have a theoretical possibility of emitting events that are identical but describe two distinct actions. An external system wouldn't know if it was duplicated data or logically distinct events. Maybe we should consider including a unique identifier per audit action like:

UUID id()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. The time is likely, but definitely not certain, to be different. I'm also not sure, if all other fields are identical, whether we really need uniqueness. I mean, often you're looking for evidence that some event happened at all, or never happened. Or you're interested in how many times it happened. You don't need a unique ID for any of those things. In fact, I struggle to see why you would need it if you're not able to correlate an event's unique id with any other thing. Also it feels like something that an emitter could add itself. It doesn't seem to be necessary to make it part of the API.

So right now I'm inclined to leave it out. It can be added later if we need it.

Copy link
Member

@robobario robobario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @tombentley.

Copy link
Member

@SamBarker SamBarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rewrite Tom, this is a much stronger proposal and the who/what/which/when structure gives a solid foundation for a robust audit framework.

One meta thought: should we be publishing JSON Schema definitions for the audit events? It feels like a natural extension of the goal that events are a versioned API — a published schema makes the compatibility guarantees concrete and independently verifiable, and gives SIEM integrators something to validate against without needing to inspect the Java types.

There's a follow-on question of whether the schema should be authoritative over the Java implementation — types generated from schema rather than schema derived from types, as we do with the Kubernetes API. That's worth deciding here rather than deferring, as it would affect how the implementation is structured.

* When the event happened.
* @return The instant that the action happened.
*/
Instant time();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry time is a bit ambiguous, especially for ingestion pipelines that might want to add their own times. How about

Suggested change
Instant time();
Instant occurredAt();

to make explicit that this is the time the action occurred at the proxy, disambiguating it from other timestamps that emitter implementations may wish to add (e.g. emittedAt,
ingestedAt).

* machine-readable indication of the reason why the action was unsuccessful.
* This could be an exception class name or a Kafka error code
*/
String status();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using null to signal success feels problematic: null typically means "absent" or "unknown" rather than a positive outcome, and it requires consumers to special-case null rather than matching on a meaningful value. It also leaks into the isInterested emitter API where filtering for successes means checking for null rather than a positive value.

The conditionality here, status and reason are only meaningful on failure, is a natural fit for a sealed type:

sealed interface Outcome permits Success, Failure {}

record Success() implements Outcome {}

record Failure(String status, String reason) implements Outcome {}

AuditableAction would then have a single Outcome outcome() method:

interface AuditableAction {
    Outcome outcome();
    // ...
}

This:

  • encodes the success/failure distinction in the type system rather than in a nullability convention
  • eliminates the need for @Nullable annotations on status and reason
  • makes pattern matching on outcomes clean and exhaustive
  • serialises naturally with Jackson using @JsonTypeInfo

The `ProxyActor` can be useful for actions which happen outside of the context of a Kafka request or response.
This could be used as the actor for applications startup and shutdown events.

The `ClientActor` can minimally represent a TCP client connected to the proxy by using the `attr` and `session` components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused, what attr and session components? They aren't in the interface as far as I can see and I don't remember seeing them in the codebase.

/**
* An actor representing this proxy instance
*/
public non-sealed interface ProxyActor extends Actor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit empty.

ProxyActor carries no identifying information, which means a consumer receiving a ProxyStart action has no way to identify which proxy instance it came from.

At minimum it should carry:

  • instanceName() — the user-configured name for this proxy instance
  • hostName() — the host it is running on

And arguably a process identifier (UUID or PID) to distinguish restarts on the same host. The earlier version of this proposal had processUuid on ProxyStartup for exactly this purpose — correlating all events from a single process lifetime.

*
* <p>Plugins providing their own coordinates should package-prefix their scope names.</p>
*/
Map<String, String> objectRef();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flat Map<String, String> representation seems to me to conflate two concerns: the internal representation of object coordinates and the external representation emitted to consumers. You describe the scopes are nested (e.g. topicName is only meaningful within a vc) but the type doesn't enforce this — a plugin could emit { "topicName": "payments" } without a cluster scope and it would be silently ambiguous.

A tree structure would capture the containment relationship accurately in the Java API:

interface ObjectRef {
    String scope();
    String value();
    List<ObjectRef> children();
}

The emitter is then responsible for rendering this appropriately for its target. The LoggingAuditEmitter could flatten by joining parent and child scope keys with a separator:

{
  "vc": "my-cluster",
  "vc.topicName": "payments",
  "vc.topicId": "abc-123"
}

This gives SIEM consumers flat key-value pairs that are easy to query, while the internal representation correctly models the containment relationship. Other emitters could render the tree differently if they had reason to.

Wrapping in an ObjectRef type rather than exposing Map<String, String> directly on AuditableAction also preserves API evolution room, consistent with the approach taken for
Correlation and Context.

Comment on lines +606 to +608
A point of difference from other plugins is that the `AuditEmitter` has the `close()` method, and the factory does not.
This is because the emitters are intentionally instantiated once on application startup and will not be dynamically reconfigurable.
This absence of reconfigurability makes it harder for an attacker to turn the logging off during an attack.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think saying that audit logging is not re-configurable is a reasonable property, it's a fairly weak guarantee — anyone with enough access to re-configure the proxy probably has enough access to re-configure the underlying logging framework too. I'm not sure there's a solution that fully addresses that threat model, but I don't think non-reconfigurability alone gets us there.

Since the number of action is expected to be small it will use a different logger for each action, simply by prefixing the action with
`audit.`.
For example `audit.ClientAuthenticate` will be the logger for all `ClientAuthenticate` actions.
This will allow users to easily use their logging configuration to filter out any actions they're not interested in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal says the LoggingAuditEmitter will "render the actions as JSON and log them using the proxy's application logging stack" — but it's not clear whether this means pre-serializing to a JSON string and logging that, or using the SLF4J 2.0 structured logging API (LoggingEventBuilder.addKeyValue()) to pass structured key-value pairs to the appender.

The distinction matters in practice. If events are pre-serialized as a string, a JSON appender will wrap them as an opaque string value inside its own envelope — SIEM consumers end up with JSON-within-JSON that requires nested parsing to query. If structured key-value pairs are passed via the SLF4J API, a JSON appender can embed the fields natively and they become directly queryable.

There is a tension with the self-contained event goal however — structured SLF4J logging means the final schema is partly determined by the appender configuration, which could vary between deployments, whereas pre-serialization gives the emitter full control over the schema.

What way are you leaning? I think the trade offs from either approach are worth calling out here which ever way we choose.

*
* <p>Plugins providing their own coordinates should package-prefix their scope names.</p>
*/
Map<String, String> objectRef();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately to the type discussion the naming here isn't consistent within the proposal itself — the field and type use objectRef, the configuration uses scopes (objectScopeMapping), and the prose alternates between "scopes", "coordinates" and "object references". Given these will all be public APIs and configuration, it's worth settling on a single term before they're committed to.

For what it's worth, resource has some appeal — it's the term Kafka itself uses in its authorisation model, it's unambiguous in an audit context, and resource_vc, resource_topicName as metric label names would read naturally to anyone familiar with Kafka.


Using the above mapping, and the example `ClientAuthenticate` and `Write` actions described earlier, we could have meters like this:

* `kroxylicious_audit_ClientAuthenticate_success[virtual_cluster=my-cluster]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Rob's suggestion of using labels for action and outcome is worth pursuing. Embedding the action name in the metric name is less idiomatic Prometheus and makes common queries unnecessarily awkward.

With the proposed approach, calculating the authentication failure rate as a proportion of all authentication requests requires summing the success and failure metrics for the denominator:

rate(kroxylicious_audit_ClientAuthenticate_failure[5m])
/ (rate(kroxylicious_audit_ClientAuthenticate_failure[5m]) + rate(kroxylicious_audit_ClientAuthenticate_success[5m]))

With action and outcome as labels:

rate(kroxylicious_audit_outcome{action="ClientAuthenticate", outcome="failure"}[5m])
/ rate(kroxylicious_audit_outcome{action="ClientAuthenticate"}[5m])

More importantly, cross-action queries become possible. For example, "what is the overall failure rate across all audit actions?":

rate(kroxylicious_audit_outcome{outcome="failure"}[5m])
/ rate(kroxylicious_audit_outcome[5m])

That query is impossible with action names embedded in metric names without knowing them all in advance.

The camelCase vs snake_case tension Tom identifies also largely disappears — label values don't need to follow Prometheus naming conventions, so action="ClientAuthenticate" is fine as-is.

On the tag name collision concern: suffixing resource-derived labels (e.g. vc_resource, topicName_resource) would sidestep any collision with action or outcome cleanly — though the exact suffix is dependent on resolving the naming question raised separately.

One thing worth calling out regardless of the naming approach: topicName as an objectRef scope in the metrics emitter is a cardinality risk. A cluster with many topics would produce a metric series per topic, which could cause real problems for Prometheus. The warning in the proposal about high-cardinality scopes is easy to miss — it might be worth making topicName explicitly opt-in rather than opt-out.

* `ClientClose` -- a client connection being closed
* `ServerConnect` -- the proxy connecting to a server
* `ServerClose` -- a server connection being closed
* `ClientAuthenticate` -- a client's subjects changing (e.g. via `FilterContext.clientSaslAuthenticationSuccess()` or successful TLS handshake) or an authentication failure (e.g. via `FilterContext.clientSaslAuthenticationFailure()` or a failed TLS handshake). Note that SASL Inspectors and SASL Terminators do not need to use `FilterContext.auditLogger()`, instead audit logging happens as a side-effect of calling `FilterContext.clientSaslAuthenticationSuccess()` and `FilterContext.clientSaslAuthenticationSuccess()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This side-effect approach is problematic. clientSaslAuthenticationSuccess() is called by both SASL terminators and SASL inspectors, but the resulting ClientAuthenticate audit event looks identical in both cases. In the terminator case the proxy is the authentication authority and the event is an authoritative record of its decision. In the inspector case the proxy is witnessing the broker's decision — emitting the same event implies an authority the proxy does not have.

The issue isn't whether the inspector should emit a ClientAuthenticate audit event — it may well be appropriate for it to do so. The issue is that the implicit side-effect removes any opportunity for the call site to declare the epistemological status of the event. Making audit emission explicit — through the AuditLogger API — would allow the inspector to annotate the event appropriately, making clear that it represents an observed outcome rather than an authoritative proxy decision, which directly supports the goal that the proxy logs its own decisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants