Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions cilium/CFP-30984-dns-proxy-ha.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# CFP-30984: toFQDN DNS proxy HA

**SIG: SIG-POLICY**

**Begin Design Discussion:** 2024-04-08

**Cilium Release:** 1.17

**Authors:** Hemanth Malla <hemanth.malla@datadoghq.com>, Vipul Singh <singhvipul@microsoft.com>

## Summary

Cilium agent uses a proxy to intercept all DNS queries and obtain necessary information for enforcing toFQDN network policies. However, the lifecycle of this proxy is coupled with cilium agent. When an endpoint has a toFQDN network policy in place, cilium installs a redirect to capture all DNS traffic. So, when the agent is unavailable all DNS requests time out, including when DNS name to IP address mappings are already in place for this name.DNS policy unload on shutdown can be enabled on agent, but it works only when L7 policy is set to * and agent is shutdown gracefully.

This CFP introduces a standalone DNS proxy that can be run alongside cilium agent which should eliminate hard dependency for names that already have policy map entries in place.

## Motivation

Users rely on toFQDN policies to enforce network policies against traffic to remote destinations outside the cluster, typically to blob storage / other services on the internet. Rolling out cilium agent should not result in traffic being dropped. Introducing a high availablility (HA) mode will allow for adoption of toFQDN network policies in critical environments.

## Goals

* Introduce a streaming gRPC API for exchanging FQDN policy related information.
* Introduce standalone DNS proxy (SDP) that binds on the same port as built-in proxy with SO_REUSEPORT and uses the above mentioned API to notify agent of new DNS resolutions.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a queue to renotify in case of failure to notify the agent ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for redirecting packets either to SDP or CA dnsproxy. For updating dns->ip mappings it uses grpc channel and there will be retry on failures i guess

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this part get some more thought? I don't think it necessarily has to be in scope to solve this problem in the initial CFP, but I can see you're already thinking about this. It would be nice to at least clarify what the intended behavior is right now even if a better solution is planned for later. This is also what I'm thinking about from this thread below: https://github.com/cilium/design-cfps/pull/32/files/616ae893539fcab4a47e15de023215ddae46eec9#r1710516516 .

* Leverage the bpf maps to resolve IP address to endpoint ID and identity ID for enforcing L7 DNS policy.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(upgrade): What are the backwards compatibility expectations imposed by this goal?

The reason I ask is that for the most part we assume that the bpf maps keys and values can be modified upon upgrade. For conntrack we typically don't do this as changing the map can be lossy and we don't currently have a good way to migrate that data, but for other map types we can and do delete bpf maps upon upgrade and then repopulate them from userspace, sometimes even with different key/value types.

I recognize that upgrade for SDP is marked as a future milestone so we may not need to resolve that in this current CFP as-is before merging as "Implementable", but then part of the question is---what does it mean for upgrade or mixed agent/SDP versions to be not a valid configuration? Do we require minor version matches and how will the SDP be designed to properly interpret bpf map content other than by first detecting the format (maybe BTF can play a role here?) and then subsequently either handling the content or failing out due to version mismatch.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, envoy also reads bpf maps directly. How its being handled there? can follow the same.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I ended up with the same concern down in the other thread below, maybe we can converge there: https://github.com/cilium/design-cfps/pull/32/files/616ae893539fcab4a47e15de023215ddae46eec9#r1714521132


## Non-Goals

* Updating new DNS <> IP mappings when agent is down is out of scope

## Proposal

### Overview

![Standalone DNS proxy Overview](./images/standalone-dns-proxy-overview.png)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: Is step 8 accurate in this diagram?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the step 8 should say Response for the request sent by the pod not DNS response.
I think step 8 can be removed too as policy calculation for a request/response to a particular fqdn is not something we are changing as a part of this CFP. (We are only considering the DNS request.)


There are two parts to enforcing toFQDN network policy. L4 policy enforcement against IP adresses resolved from a FQDN and policy enforcement on DNS requests (L7 DNS policy). In order to enforce L4 policy, per endpoint policy bpf maps need to be updated. We'd like to avoid multiple processes writing entries to policy maps, so the standalone DNS proxy (SDP) needs a mechansim to notify agent of newly resolved FQDN <> IP address mappings. This CFP proposes exposing a new gRPC streaming API from cilium agent to do this. Since the connection is bi-directional, cilium agent can re-use the same connection to notify the SDP of L7 DNS policy changes.
Additionally SDP also needs to translate IP address to endpoint ID and identity in order to enforce policy by reusing the logic from agent's DNS proxy. Our proposal involves retrieving the endpoint and identity data directly from the `cilium_lxc` and `cilium_ipcache` BPF maps, respectively.

### RPC Methods
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: This section would benefit from describing the expected call path behaviour. That is to say, which component is calling this RPC on which other component? There is also the nuts and bolts about how the gRPC stream is opened, who initiates, security protections for the socket and so on. This can be a brief sentence for each call.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, will add the brief description.


Method : UpdateMappings

_rpc UpdatesMappings(steam FQDNMapping) returns (Result){}_
Request :
```
message FQDNMapping {
string FQDN = 1;
repeated bytes IPS = 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I'm not super familiar with gRPC types here but I assume there's away that the bytes lists can encode variable length and hence IPv4 + IPv6 mappings?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. It can encode ipv4 + ipv6 mappings.
We might need to address how we read from the stream in case there is huge chunk of data. But that would be more of implementation details.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Not sure if we specifically need to note this down for implementation phase or you'll track it anyway - feel free to either add a note into the CFP or mark this comment as resolved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a note to the CFP itself.

int32 TTL = 3;
bytes client_ip = 4;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: I note that the other API below shares endpoint_id between cilium-agent and dnsproxy, but this API goes for client_ip instead which I assume has a 1:1 mapping. Is there some context behind that?

(I suspect we do need the client context associated with mappings to keep them properly separated, though I couldn't specifically explain why right now, I just recall that's how we structure it in cilium-agent... I'm sure there's good reason)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other API is sending the endpoint_id as that is used as a key in dns rules map lookup.(at the dns proxy end)
We can send the endpoint_id as well here and retrieve the ip from the endpoint id. We need both ip and endpoint id at the cilium agent end so either way should work.
We kept it as client_ip as that is what we are getting we get the DNS response. If we need to send the endpoint_id, there will a subsequent call either in local cache/bpf map to get the ip to endpoint id mapping.

int32 response_code = 5;
}
```
Response :
```
message Result {
bool success = 1;
}
```

Method : UpdatesDNSRules

_rpc UpdatesDNSRules(stream DNSPolicyRules) returns (Result){}_
Request :
```
message FQDNSelector {
string match_name = 1;
string match_pattern = 2;
}

message DNSPolicyRule {
string selector_string = 1;
repeated FQDNSelector port_rules= 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixup: Is this the FQDNSelector or port_rules? typo?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related, what's the difference between this and the selector_string?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should rename the FQDNSelector to PortRules. (since they FqdnSelector is used for rules in cilium code base, I believe that is why it was kept like that)
The selector_string is the result of the function String()(method used to create a string based on MatchName/MatchPattern) . It is used as a key for the map store in dnsproxy codebase.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm: Something I'm struggling a little bit with these fields is whether they are the logical parts of rules at a specific level of abstraction, or whether it's taking Cilium internals and converting them into public API. In the latter case the concern I have is that we may end up constraining the way that future Cilium versions work because the API is too tied to the implementation details from today.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid concern. we should name in generic and not tied to cilium internals

repeated string match_labels = 3;
repeated uint32 selections = 4;
Comment on lines +75 to +76
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I can see some terminology here leaking over from Cilium internals. It may be the case that the API makes sense to export these things but it does make me wonder exactly how well abstracted the underlying mechanisms are, and to what degree this API is baking in expectations about the Cilium implementation. For instance how is match_labels different from selections? Are they both needed? What assumptions are we making about how Cilium's internals behaves and what it will need to do in order to properly inform the FQDN proxy about what it should do? How did you come up with this specific list of parameters, and have you compared it with how Envoy handles L7 policy?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some degree, this is getting into implementation details and we don't necessarily need to resolve these prior to merging the CFP as "implementable". I mainly raise these because the API is provided here in the CFP. Ultimately though the API design probably needs some dedicated consideration and some of the implementation details may not be known until a PR is opened, so I don't know whether it makes sense to drill deeper on these aspects here in the CFP or defer to the Code PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can name them in more generic way so that we don't tie it with the cilium internals. Let me update those.

}

message DNSPolicyRules {
uint64 endpoint_id = 1;
uint32 port = 2;
repeated DNSPolicyRule rules = 3;
}
```

Response :
```
message Result {
message string = 1;
}
```

### Tracking policy updates to SDP instances

Since SDP can be deployed out of band and users can choose to completely disable built-in proxy to run multiple instances of SDP, agent should be prepared to handle multiple instances. In order to ensure all instances have upto date policy revisions, agent will maintain a mapping of ACKed policy revision numbers against stream ID.
Comment thread
hemanthmalla marked this conversation as resolved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run multiple instances of SDP

How will multiple instances on the same node be managed? If we are assuming daemonsets, AFAIK you cannot scale them to multiple instances on the same node.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple instances refers to CA dnsproxy and SDP. In another instance, while upgrading SDP, there can be 2 instances of SDP running at same time and still design allows that to be handled.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(use-case): I'm largely assuming that handling the DNS traffic directly in the agent will on average provide better performance characteristics, particularly because it is able to more easily trigger and handle policy reaction events and does not require encoding into the gRPC channel or the related scheduling to receive, handle, and respond to the messages.

With that in mind, what is the use case for two external proxies and disabling the in-built one? I ask because multiplexing DNS agents is additional complication to the implementation so it would be good to understand why that complexity is worthwhile.

Since policy revision numbers are reset when agent restarts, we need to unconditionally send policy updates to SDP on agent restart.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you anticipate that DNS packets will flow at stable state? Do you have specific ideas in mind, prefer the agent / split evenly / something more fancy?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with split evenly (default with SO_REUSEPORT). Eventually we can consider adding a bpf prog to control the socket selection from the reuseport group. I have a standalone PoC for this here https://github.com/hemanthmalla/reuseport_ebpf/blob/main/bpf/reuseport_select.c

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to mention default balanced in the CFP and add reuseport idea to future milestones?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will update the CFP to include this.

## Impacts / Key Questions

### Getting the DNS Rules

We need the DNS Rules for the Standalone DNS proxy to enforce the L7 DNS policy. The policy are the source of turth for the rules and are propagated to the agent when we apply the policy. We explore the options to get the DNS rules for the DNS proxy.

#### Running the GRPC sever in the agent

We can run a gRPC server in the agent to serve the DNS rules to the DNS proxy. SDP will be responsible for creating the connection with the agent. And once SDP establish a connection agent can keep track of the stream and send the DNS rules to the SDP. Agent can then reuse the same stream to send updates in the DNS rules.
In case, cilium agent is still not up, SDP will keep trying to connect to the agent until the connection is established.

##### Pros

* SDP instances has the responsibility to connect to the agent.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why is this a pro? Seems like just an implementation detail. (I don't necessarily think it's a bad thing, the text just doesn't identify the assumption about why this is a benefit vs an alternative.)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We thought about different ways of sending the DNS rules to SDP.
One way was sending rules to a particular ip:port and in that case, Cilium agent has the responsibility to find if SDP is running and then send out the rules.

In this case, where SDP is connecting to the agent. It is the responsibility of the SDP to make sure it connects to agent rather than other way around.
This reduces the overhead for the cilium agent to look out for processes(in case of multiple instances of proxies running) and send them rules.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read "Simpler to configure and scale out SDP instances since the agent only needs a socket to be configured and does not need to explicitly reference the SDP instances". That's a pro 👍

* Reusing the same stream will be effiecient in terms of resources as we are not creating/destroying the stream for every update.

##### Cons

* An overhead on the cilium agent to keep track of the streams of the connected SDP instances.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm: This con could be mitigated by avoiding statefulness unless really necessary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams were stored so that we don't need to recreate a new stream with each policy updates and reuse the same streams to send the data. Let me know if I misunderstood.

Copy link
Copy Markdown
Member

@joestringer joestringer Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I did not read that the first time I looked through this part of the proposal.

I guess my feedback is more: This is not strictly required to satisfy the goals of the CFP, so is it required in the initial version or should it be deferred to a later milestone and get the rest in first. This can have API impacts so it's worth thinking about that part so we don't have to immediately redesign the API and how it interacts. At the same time, depending on the scale targets you're aiming for, this might be biting off more than needs to be chewed at once. It does add a bunch of complexity, so the "con" here can be mitigated by just ... not doing this yet. Build the initial implementation, get user feedback, start to scale test, find the bottlenecks and figure out what's necessary.

* Streams are not thread safe, so if we have multiple threads using the same stream we will need to handle the synchronization.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This line I think is glossing over the underlying structural implications in the agent.

Sure we need to write this in a thread-safe manner. We're also talking about designing a system that takes information from potentially multiple threads and feeding the latest updates through to another component. Some of this is just inherent in what the proposal aims to achieve.

Is this a con of this design aspect or is this more like an implementation challenge that we just need to keep in mind? (I'm not sure whether we necessarily have to understand to this degree of detail in the CFP prior to merging as "Implementable" but if you do want further design discussion then it may be worth considering how to expand out and highlight the discussion points you would like to target on this point).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right in terms of this being an implementation challenge. Just wanted to point out that we will need to keep this in mind.
Let me reiterate that.

* If SDP in future decides to handle policy updates as well, it will be tricky. SDP needs to keep trying to sync with CA as well as CA needs to keep a track of the SDP instances to send updates.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(upgrade): From a protocol perspective, I suspect that just having versioned protocols with expected semantics would provide the mechanism these components need in order to change the behaviour later. Unless you're looking to pull that much bigger design in-scope (which I would recommend against), the key here will be to just identify what we need to mitigate the risk and keep options open in future.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Don't want to pull this in.
The idea was: If in future, SDP may also handle the policy updates in case agent is down. But I can remove this for now as this is not needed can be taken up as separate thing.


![standalone-dns-proxy-up-scenario](./images/standalone-dns-proxy-up-scenario.png)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: These diagrams could do with a brief sentence providing context about what they're explaining. For a short while I thought I was looking at stable state and some of the arrows might be representing DNS packets, but then I realized that this is indicating the direction of the requests and flow of data, plus the expected behaviour during component outages.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, let me update those.

![standalone-dns-proxy-down-scenario](./images/standalone-dns-proxy-down-scenario.png)

### Reading from file system on startup

SDP can read from the file system(`ep_config.json`) and get the DNS Rules on bootup. In case, it is able to connect to cilium agent, cilium agent will send the current snapshot of the dns rules to sdp. In case, cilium agent is down, SDP can continue serving the DNS requests based on DNS rules retrieved from the filesystem on bootup.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm: Is this a facility that already exists? Any comment on whether this should be part of ep_config.json or stored separately? Does this make it more complicated to reconcile with the state from gRPC (or, put more generally, how will SDP track changes over time)?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a facility that already exists?

We will need to update the state file with DNS rules as soon as the policy is applied. This PR is trying to address that: cilium/cilium#33412

The SDP always reconcile from the state first and then gets the data from cilium agent through grpc. This helps SDP to have enough information to resolve DNS request even if Cilium agent is not present.

Any comment on whether this should be part of ep_config.json or stored separately

We can think about storing the data needed by SDP in a separate file but that can be handled during implementation I suppose.

Does this make it more complicated to reconcile with the state from gRPC (or, put more generally, how will SDP track changes over time)?

Assuming Cilium agent will also be reading from the same state. It should not complicate things IMO to reconcile the state from grpc even if we store the rules in a separate file.

put more generally, how will SDP track changes over time)?

Not sure what this means. Cilium agent is the source of truth so once SDP is connected to the Cilium agent, the SDP will have the latest rules. In case, agent is not present, SDP still has the local cache to serve the request. Also if SDP restarts, the state file will be used a fallback until agent is back.

Copy link
Copy Markdown

@tamilmani1989 tamilmani1989 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDP reading from statefile/maps is only on bootup and then it connects to agent via grpc and update its in-mem state. If CA is down and SDP restarts, SDP can serve with whatever it read from statefile/maps and this allows existing pods dns requests to be served uninterrupted even when CA goes down

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling here is that all the filesystem stuff is not strictly necessary because having both agents down at the same time already broke HA. It would reduce the scope of the CFP and move it forward to just defer all the filesystem stuff to a subsequent milestone and re-evaluate how hard the requirement is for that aspect of the implementation. Just thinking from a simplicity perspective and how the implementation achieves the core goals. I can see some potential for more resilience with this, I just think that it's probably overthinking the problem at this stage if the basics are not yet upstream.


### Q : Reason behind lazily updating the DNS rules in the ep_config.json file?

### Discovering endpoint metadata from SDP

In order to enforce L7 DNS policy, SDP needs to translate IP address to endpoint ID and identity ID. The simplest option is to reuse the same gRPC stream and implement another method to get the mappings from cilium agent. However, the actual source of truth are the bpf maps and SDP is expected to resolve the DNS queries even when agent is down, so we could read the information directly from the bpf maps. We prefer the bpf option since it does not rely on agent availability.

#### Option 1a: gRPC method

Get the endpoint ID and identity for a given IP address by making a gRPC call to cilium agent.

##### Pros

* Avoid interacting with low-level details. Simpler to maintain and stay upto date with any datapath changes.
* All data can be cached in-memory, in an event of agent being unavailable, SDP can lookup from in-memory cache

##### Cons

* Reliance on gRPC call in the hot-path
* Does not cache the data for endpoints that never made a DNS query.
* In an event where SDP restarts when agent's gRPC service is unavailable, all cached state is lost and SDP cannot translate IP to endpoint ID or identity.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is not a con, this is a fundamental truth of high availability: At least one component providing the service needs to be available at a point in time. If you take both down, you lose availability.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a con because reading from bpf maps can help us serve the DNS request even if both are restarted. The highly available part can be achieved by marSurge on daemonset for the DNS request. link


#### Option 1b: Listen to ipcache updates via grpc

Similar to envoy, SDP can listen to ipcache updates and maintain a local cache of IP to endpoint ID mappings via grpc. This changes the way DNS proxy currently behaves i.e pull based model to push based model.

##### Pros

* Simpler to maintain and stay upto date with any new ip<>identity changes.
* All data can be cached in-memory, in an event of agent being unavailable, SDP can lookup from in-memory cache

##### Cons

* Need a mechanism for reconciling the cache in case of SDP restarts.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this a con of this specific design? Regardless of the solution to this problem, there needs to be code that runs on startup of SDP to pull the state it needs to do its job.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, con of this specific design. The solution is to send the current state to the listener when it starts listening. But since we recommend reading from bpf maps, this won't be needed.
Also we kept the options we considered for getting the data(endpoint id and identity). Let us know if we need to remove those options and focus on what we want to propose.
SDP will be reading from the state files for reconciliation at the start.

* SDP will be aware of all the endpoints data, which might not be needed for DNS proxy.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm: I'm not worried about this con.

Endpoints are all local anyway and the scale of data is not that high. DNS is the standard service discovery mechanism in k8s. If the user runs significant numbers of applications not using DNS, then they can't rely on DNS-based features in Cilium anyway and then they probably don't need DNS HA. Even then there may be ways to optimize through scheduling.

Copy link
Copy Markdown

@tamilmani1989 tamilmani1989 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.. also we validated with 200+ pods on a node with reasonable amount of dns requests (1k per sec) and mem consumed by dnsproxy is around 50mb


#### Option 2: Read from bpf maps

Read mappings from bpf maps `cilium_lxc` for endpoint ID and `cilium_ipcache` for identity.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(upgrade): FYI we've floated the idea of deprecating cilium_lxc. This raises bpf API stability questions.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. The plan is to have abstraction layer and that fetches these mappings.. so SDP shouldn't care what changes underneath.


##### Pros

* Reduced resource utilization in agent since agent doesn't need to process rpc calls in hotpath
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we benchmark both the options to get an idea for performance ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? A key feature of the DNS proxy is learning DNS -> IP mappings and informing the policy engine about them, so that the policy engine inside the cilium-agent can calculate policy impacts, allocate identities for those IPs, and populate BPF policymaps based on the new identities. If we do not ensure policy is plumbed before releasing DNS responses to the clients, then there is a high chance to automatically impose ~1s latency on subsequent TCP connections, because the first SYN of the connection will be dropped due to policy and the networking stack will delay a second before retrying.

(We could mitigate this in a number of ways, but I think that if we explore the use of inter-process RPC in hotpath holistically then it may not be the suboptimal path)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something that would help to explore this question would be if there's a more detailed breakdown of the expected lifecycle of policy computation and separately lifecycle of a DNS request+response path so we can analyze exactly what the dependencies are and how the handling of different events may impact the overall behaviour.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, there is no change to policy computation by agent. dns->ip mappings will be updated to agent via grpc api and agent does policy enforcement and only after getting ack from agent, dns response is returned back to client. Reading bpf maps is mainly for dnsproxy to get reconcile with ip->identity mappings and dnsproxy to bootup without relying on agent. @hemanthmalla correct me if i'm wrong

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably not reading the "agent doesn't need to process rpc calls in the hotpath" with the same set of assumptions then. @tamilmani1989 your description makes sense to me for the typical path when newly learned IP mappings are discovered, and for that specific scenario it sounds like the hotpath to me.

Copy link
Copy Markdown
Member Author

@hemanthmalla hemanthmalla Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joestringer I agree with you WRT the RPC method to update DNS<>IP mappings. gRPC overhead for this method will likely be low relative to the computation needed to process the request.

But I intended this section to discuss how SDP would discover metadata to enforce L7 DNS policy and requests in this context are rpc calls to fetch identity and endpoint mappings. If we fetch those directly from bpf I was thinking we can skip a couple of rpc calls.

I'll update this to qualify the "agent doesn't need to process rpc" part.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Yeah in terms of network handling latency the order of delay is likely to be local cache < syscall < rpc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(use-case): I don't see a recent update related to this.

As I understand, Hemanth's point is that L7 policy could plausibly be more efficient (albeit still with bpf map lookup syscalls in the hotpath rather than direct userspace cache like option 1b). But DNS L7 policy is a much less common use case than ToFQDN policies. For ToFQDNs the point is moot because we want to ensure policy plumbing occurs before continuing, so on average the agent ends up in the hotpath anyway.

* IP to endpoint ID or identity translation does not rely on agent availability.
* DNS can be resolved even when agent is down.

##### Cons

* Low level interactions from components external to cilium agent.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm: I think this con probably hints at a few things, I'm not sure it's inherently a bad thing to interact at a low level. I recognize there's the "external to cilium agent" aspect, ie an additional dependency for SDP to work. I guess it would be simpler if there is just one source of truth and synchronization rather than multiple.

Additionally there is a question about how stable the bpf map API interfaces are and the degree to which the Cilium community wants to provide those guarantees (my base opinion is: Avoid making such guarantees wherever possible unless we have really good reasons why we think those things will never change).

Copy link
Copy Markdown

@tamilmani1989 tamilmani1989 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting know @joestringer. wouldn't this be same problem for envoy too? can't we abstract what changes under the hood and SDP-Cilium Agent contract remains undisturbed.

Copy link
Copy Markdown
Member

@joestringer joestringer Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's a concern for Envoy and I'm not entirely sure that @cilium/envoy folks are thinking this through either - I think it's a byproduct of splitting the Envoy lifecycle out in a dedicated DaemonSet. This dependency was originally developed when Envoy and Cilium were guaranteed to be the same version due to shipping and running in the same container. As far as I'm aware we haven't formalized these expectations for Envoy as a DaemonSet either, and the risk is we are going to break users at some point (or datapath developers will have to do a lot of extra work to mitigate that).


### Flag to disable built-in proxy

Adding a flag in cilium agent to disable built-in dns proxy allows for de-coupling of SDP's lifecycle from agent. The current implementation of cilium agent depends on the built-in dns proxy for restoring the endpoint for DNS rules. With the flag enabled we will need to refactor to remove that dependency. Cilium agent can retrieve the DNS rules from in memory policy representation (agent already parses this information from ep_config.json on startup).

#### Pros

* Allows for de-coupling upgrade cycles of SDP and cilium agent.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(lifecycle): I don't understand why cilium-agent's configuration impacts SDP lifecycle. If SDP is a different DaemonSet, then you lifecycle it through k8s by managing that daemonset. The cilium-agent isn't in the picture.

Copy link
Copy Markdown

@tamilmani1989 tamilmani1989 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind disabling agent dnsproxy and running only SDP is that we don't need to maintain two copies of same instance and if any vulnerability/issue find in dnsproxy component, we would able to upgrade SDP alone without taking down cilium agent..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does upgrading SDP alone without taking down cilium-agent matter? From a DNS proxy HA perspective, you would take one component down at a time and upgrade them to mitigate the issue. In order to reconfigure Cilium to disable the built-in proxy, you would need to restart it anyway, thereby introducing a similar level of risk into the environment. Then longer term after the security event you would want to migrate back onto having Cilium in the main path for better resource utilization / efficiency.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do I need to maintain 2 instances if I can achieve my goal with 1 instance? Also, I don't require to take cilium agent down at all as its critical system component if I just need to upgrade SDP. Taking cilium agent would affect pod creation/deletion, policy creation/deletion during that time window

In order to reconfigure Cilium to disable the built-in proxy, you would need to restart it anyway, thereby introducing a similar level of risk into the environment

This will happen only if customer upgrades. Also once its upgraded, later if there any patch or cve need to be fixed in dnsproxy, we are not going to restart agent but just SDP

Then longer term after the security event you would want to migrate back onto having Cilium in the main path for better resource utilization / efficiency.

There is definitely trade off between maintainability and performance. Thats why i'm not proposing this as default instead user can have opt-in to disable agent proxy based on their requirement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the use case. Users will have to run both the agent and SDP anyway, and I don't see any difference to maintainability if one or both components are handling DNS proxy traffic. They're also both components that will need to be maintained, upgraded, etc.

The agent will need to have DNS proxying enabled in order to restart/upgrade the SDP without downtime, or otherwise every time you upgrade SDP you would have to also cycle the agent to enable/disable DNS proxying in the agent. The overall goal of the CFP is high availability, meaning that if one component goes down, the other component is still available to provide service. I don't understand how that squares with the idea of reducing the number of regular running instances down to 1. If you want one DNS proxy, you can do that today by just running cilium-agent.

Now, whether or not the traffic is actually directed towards both cilium-agent and SDP can be another question - I think there are probably ways to guide that traffic towards one or both active dnsproxy instances.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent will need to have DNS proxying enabled in order to restart/upgrade the SDP without downtime,

This is not required if we do max surge upgrade for SDP which guarantees old SDP can run along with new SDP until new sdp ready for min seconds. https://kubernetes.io/blog/2022/09/15/app-rollout-features-reach-stable/#how-use-daemonset-maxsurge

Anyway, as you said in other comments that this is not core part of this CFP. Can create a separate ticket for this specifically and can discuss there. so that dont want to deviate away from core idea (which is bootstrapping SDP as separate component)

* Allows any delegated DNS proxy to be plugged in.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(lifecycle): I don't understand why cilium-agent's implementation details of its own proxy impacts the ability for other delegated DNS proxies to be plugged in.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope it should not. can remove this.

* Reduces the resource utilization in cilium agent.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Sure, but the equivalent resource needs to be allocated in the external DNS proxy instead. I'm not sure this is a win for users.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shared the main motive in above comment.


#### Cons

* Higher latency in DNS resolution as the DNS queries are always handled by a separate process.

### Q: Restoring L7 DNS policy when agent is unavailable

In an event where SDP restarts when agent's gRPC service is unavailable, SDP can read state from `ep_config.json` and restore policy.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion(use-case): Worth noting that we're probably now thinking about up to three sources of truth for SDP - cilium-agent, files, and bpf maps. Do we really think that the benefits of the additional complexity outweigh the complexity?

Even if the answer is yes, the CFP would probably benefit from having a dedicated section to consider how the sources of truth interact and how SDP ultimately recovers and gets back in line with the proper expected configuration depending on the particular availability scenarios you want to address. Maybe one way to tackle this is to outline in a "testing" section exactly which cases you expect to work and how, then we can revisit aspects like this to confirm what difference they will make.

(EDIT: I see there is a table at the end that is kinda similar to what I'm asking for, but I didn't feel that it really answered this particular question)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this through a bit more, "Can I..."

  1. Stop cilium (NB: new mappings now break)
  2. Stop SDP (Endpoints with L7 DNS rules can no longer resolve DNS)
  3. Start SDP without starting Cilium (Now restore DNS)
  4. ??? Don't start Cilium

We've already broken the HA by losing both components. I agree that it would be more resilient if either component can be independently restarted at will, but if the user wants HA, then these need to happen at different times, not at the same time. I can see the argument that if Cilium has some serious unresolvable issue and then you need to restart SDP for some reason this would make the system more resilient. As a property of the system I like it, though the scenarios get a bit more obscure. Is this the problem we're trying to solve with this aspect of the design?

Maybe taking another angle on it, let's say this is the case - Would it be easier to rely on Cilium-agent and SDP coordinating on the appropriate file format etc to handle this case, or should SDP just handle its own state store/restore? The latter would be less entangled with Cilium and might make it easier for SDP to optimize for the problems it needs to solve vs. relying on the overloaded ep_config.json.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DNS rules specifically though, we aren't relying on bpf maps right ? ( IIRC, this info isn't stored in any bpf maps right ? )
This just allows SDP to enforce the last known state instead of dropping all DNS requests when agent is down. This sounded to us like a low hanging fruit ( conditional to cilium/cilium#33412 making it in )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think these aspects are getting ahead of the core problem and trying to solve steps 2 and 3, more niche advanced ordering problems, and it would simplify everything to eliminate the filesystem aspects and focus on the core problems first.

I don't think it hurts the design or causes additional work to just ... not solve the "bootstrap SDP first" case to begin with. If we want HA, then at least one component must be on all the time. Once both components go down, we've lost HA. If only the agent is up, then SDP restarts, then the agent will still be up to serve the rules. If only SDP is up, then it will keep the rules in memory and doesn't need to restore from disk.

There's a lot of complexity being added by all the filesystem aspects of the CFP. Maybe that complexity is actually worthwhile and we end up agreeing down the line that it should be done this way. Right now though I just think it makes the whole design more complicated than it needs to be in order to solve something that isn't a core problem of the CFP.

Splitting the filesystem aspects out into their own CFP would also make it easier to evaluate just that aspect and for instance whether SDP itself can just store+restore that or whether we need a files-based API agreement between components. It may be low-hanging fruit, but to use the analogy we haven't planted the tree yet 😅 .


### Indentity for upstream dns pods getting updated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixup: Please run a spellchecker (unfortunately I tend to get distracted by these during review quite easily, and there are several other typos).

Suggested change
### Indentity for upstream dns pods getting updated
### Identity for upstream dns pods getting updated


We are sending the set of identities for upstream dns pods along with the dns rules to the SDP through grpc stream. In case the upstream dns pods labels are updated, it will trigger the policy recalculation and the new set of identities will be sent to the SDP. This will help in enforcing the L7 DNS policy for the updated set of identities.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorm: 👍 it was good to call this out. I think the wording here is a bit loose and it doesn't entirely match my understanding of where you'll need to hook into the code in order to ensure that SDP gets the info it needs, but presumably you'll get to that during implementation; not sure we need to iterate further in the CFP on this aspect.

It is however worth noting that this is another critical case for testing. Failing to implement the policy correctly can result in a security issue.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is PR that vipul referenced above will take care of it


### Scenarios and Expected Behaviour

In an ideal scenario, both cilium agent and SDP should be interacting with each other through a biredirectional grpc stream. However, in an event where either of them restarts/upgrade/downgrade, what should be the expected behavior ?

#### Cilium Agent is down, SDP is running

* SDP should be able to proxy requests and enfore L7 DNS policy based on the existing bpf maps and in memory DNS rules. The datapath of already configured policies will work. Any new mappings will not be updated until cilium agent is back up.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What happens to the new mappings?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify the mappings refers to the new policy map updates.
I think it follows the same pattern as it is now in case of cilium agent not available. The policy won't be applied until cilium agent is up and running.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, how does the design ensure eventual consistency on that front once the agent comes back up?

Copy link
Copy Markdown

@vipul-21 vipul-21 Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDP retries the connection to cilium agent if there is a connection failure to cilium agent. I think a controller (a separate go routine) should be able keep a track of an active connection with cilium agent and retrigger connection request in it is broken. This is constant retry mechanism, but we can discuss if we want to make it exponential backoff as well.
Eventual consistency will be achieved either through next dns request or constant retry would update the mappings once agent is back.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I think it makes sense to integrate this into the text and just highlight that anything more is up to implementation details.

* In case SDP restarts while cilium agent is down, the SDP should be able to read the rules from the filesystem and restore them. Since Cilium agent writes the rules to the filesystem lazily, SDP might be reading an outdated DNS rules. This is a limitation of the current implementation and can be improved in future.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One requirement is to have a config option to disable Cilium Agent dnsproxy and run just SDP so that allows any delegated dns proxy to be plugged and also an option to reduce the memory/cpu of agent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be worth putting a test plan in place to evaluate the impact of that. For instance I can see potential memory/cpu of agent decrease but that work would be transferred equivalently to SDP. Also SDP-only would have higher latency for enforcing policies for newly learned names.

Copy link
Copy Markdown

@tamilmani1989 tamilmani1989 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atleast for our usecase, we don't require 2 instances of dnsproxy to be running since we make sure SDP is available all the time even during upgrade. Also this allows user to run delegated (custom) dnsproxy as long as it interop with cilium agent.

#### Cilium Agent is running, SDP is down/restarting

* In case Cilium Agent has the DNS proxy running as well, then DNS queries will be served. If built-in proxy is disabled, SDP can also be configured to run with multiple replicas.
* Once SDP is starts up, it will connect to the agent and get the latest policy updates.

#### SDP Upgrade(Given Cilium Agent is Running)

* Cilium agent can keep serving the DNS queries through in built DNS proxy. In case, the inbult dns proxy is disabled, the SDP needs an upgrade path. This can be achieved using the `maxsurge` upgrade.

| Agent | Builtin DNS Proxy | SDP | Datapath | DNS |
|-------|-------------------|-----|----------|-----|
| Running | Disabled | Running | Works | Works |
| Running | Running | Down | Works | Works |
| Running | Running | Running | Works | Works |
| Running | Down | Down | Works | Does not works |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Curious if there's any insight on this line you're thinking about? Usually when the agent is up, the built-in dnsproxy is also up. Modulo bugs I'm not sure I understand what the considerations or implications are for this line.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were added to cover all the scenarios. You're right that built in dns proxy is coupled with agent status. But wanted to make sure we capture the scenarios and expected behavior where DNS proxy is not able to start for any reason(failure to bind to port, invalid configuration etc)

| Down | Down | Running | Datapath should work too if already configured. | Works for already configured policy |

### Permissions to access the BPF maps

Since SDP needs to read from the bpf maps, it needs to have the necessary permissions to access the bpf maps. Permissions needed are `CAP_SYS_ADMIN` and `CAP_BPF`. [Need to verify if the permissions are dependent on linux distribution.]

## Future Milestones

* Plumb toFQDN policy for new names when agent is unavailable
* The upgrade scenario of SDP.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added cilium/images/standalone-dns-proxy-overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.