feat(F3): enriched DAG schema with prereqs, acceptance, audit hooks#240
feat(F3): enriched DAG schema with prereqs, acceptance, audit hooks#240KooshaPari wants to merge 1 commit into
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
CodeAnt AI is reviewing your PR. |
Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
|
Legacy Tooling Scan Report
No violations detected. This is a WARN-mode scan. Fix before strict enforcement begins. |
|
| self.children.entry(from.clone()).or_default().push(to.clone()); | ||
| self.parents.entry(to).or_default().push(from); |
There was a problem hiding this comment.
Suggestion: Duplicate edges are appended without any deduplication check, so repeated add_edge(from, to) calls create parallel duplicates in both adjacency lists. This inflates edge_count and causes contract inconsistencies with serialization (from_dag deduplicates edges via BTreeSet), leading to lossy round-trips and unexpected graph behavior. Reject or ignore already-existing edges before pushing. [logic error]
Severity Level: Major ⚠️
- ⚠️ Duplicate edges inflate `Dag::edge_count` unexpectedly.
- ⚠️ Serialization round-trip drops duplicates, losing original structure.Steps of Reproduction ✅
1. In a consumer or test, construct a DAG as in `crates/byteport-dag/src/dag.rs:168-173`:
create `let mut dag = Dag::new();`, add nodes `"a"` and `"b"` via
`dag.add_node("a").unwrap();` and `dag.add_node("b").unwrap();`, then call
`dag.add_edge("a", "b").unwrap();` twice; each call executes `add_edge` at `dag.rs:69-78`,
where lines 76-77 append `to` and `from` without any deduplication, resulting in
`children["a"] == ["b", "b"]` and `parents["b"] == ["a", "a"]`.
2. Calling `dag.edge_count()` at `crates/byteport-dag/src/dag.rs:95-98` on this DAG sums
the lengths of all `children` vectors, returning `2` for the single logical edge `"a" ->
"b"` because it is stored twice, inflating edge metrics compared to a unique-edge
interpretation.
3. Serializing this DAG with `DagSchema::from_dag(&dag, "1.0.0")` at
`crates/byteport-dag/src/serialize.rs:81-120` uses a `BTreeSet<(String, String)>`
(`edges_set` at lines 95-103) to collect edges; inserting `(node.clone(), child.clone())`
from `dag.iter_nodes()` and `dag.children_of(node)` deduplicates the two `"a" -> "b"`
entries, so `DagSchema.edges.len()` becomes `1` even though `dag.edge_count()` was `2`.
4. If a consumer then reconstructs a DAG via `schema.into_dag()` at
`serialize.rs:131-139`, it iterates over `self.edges` and calls
`dag.add_edge(edge.from.clone(), edge.to.clone())?;` once per schema edge, producing a new
DAG with `edge_count() == 1` despite the original DAG having `edge_count() == 2`,
demonstrating that duplicate adjacency entries are silently dropped by
`from_dag`/`into_dag` round-trip and leading to inconsistent edge counts and behavior
between the core `Dag` structure and the schema/serialization layer.(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** crates/byteport-dag/src/dag.rs
**Line:** 76:77
**Comment:**
*Logic Error: Duplicate edges are appended without any deduplication check, so repeated `add_edge(from, to)` calls create parallel duplicates in both adjacency lists. This inflates `edge_count` and causes contract inconsistencies with serialization (`from_dag` deduplicates edges via `BTreeSet`), leading to lossy round-trips and unexpected graph behavior. Reject or ignore already-existing edges before pushing.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| let nodes: Vec<SchemaNode> = dag | ||
| .iter_nodes() | ||
| .map(|id| SchemaNode { |
There was a problem hiding this comment.
Suggestion: Nodes are emitted in raw HashSet iteration order, which is non-deterministic across runs. This makes YAML/JSON output unstable for the same logical graph, causing flaky snapshots, noisy diffs, and non-reproducible artifacts; sort node IDs before building nodes. [logic error]
Severity Level: Critical 🚨
- ❌ YAML schema output unstable across identical DAG executions.
- ⚠️ Snapshot tests and config diffs become flaky, non-reproducible.Steps of Reproduction ✅
1. In consumer code, build a DAG using the public API in `crates/byteport-dag/src/dag.rs`:
create `let mut dag: Dag<String> = Dag::new();` (line 51) and add several nodes via
`Dag::add_node` (lines 55–63), as shown in the crate-level example in
`crates/byteport-dag/src/lib.rs:21–26`.
2. Serialize this DAG to a schema by calling `DagSchema::from_dag(&dag, "2.0.0")`
(definition at `crates/byteport-dag/src/serialize.rs:81–120`), which constructs `nodes:
Vec<SchemaNode>` using `dag.iter_nodes()` (lines 82–83) and `SchemaNode` mapping (line 84
onward); `iter_nodes()` is implemented in `Dag::iter_nodes` (dag.rs:115–117) as
`self.nodes.iter()` over a `HashSet<K>` (dag.rs:23–30).
3. Because `Dag.nodes` is a `HashSet<K>` and `HashSet::iter()` order is hash-dependent and
not guaranteed to be stable across processes, two separate runs of the same binary that
both call `DagSchema::from_dag` on logically identical DAGs can produce `nodes` vectors in
different orders (even if node insertion order was the same).
4. When callers serialize the schema via `DagSchema::to_yaml()` (serialize.rs:146–149) or
`to_json_pretty()` (lines 165–167), the YAML/JSON array for `nodes` will reflect this
non-deterministic iteration order, causing unstable serialized output that makes snapshot
tests, config diffs, or stored artifacts noisy and non-reproducible even though the
underlying graph is unchanged.(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** crates/byteport-dag/src/serialize.rs
**Line:** 82:84
**Comment:**
*Logic Error: Nodes are emitted in raw `HashSet` iteration order, which is non-deterministic across runs. This makes YAML/JSON output unstable for the same logical graph, causing flaky snapshots, noisy diffs, and non-reproducible artifacts; sort node IDs before building `nodes`.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| let mut edges_set: BTreeSet<(String, String)> = BTreeSet::new(); | ||
| for node in dag.iter_nodes() { | ||
| if let Some(children) = dag.children_of(node) { | ||
| for child in children { | ||
| edges_set.insert((node.clone(), child.clone())); |
There was a problem hiding this comment.
Suggestion: from_dag silently deduplicates edges by inserting them into a BTreeSet, but Dag currently allows duplicate edges. This makes serialization lossy: from_dag(...).into_dag() can return a graph with fewer edges than the input, breaking round-trip expectations for callers that rely on edge multiplicity or exact edge_count. Preserve edges as-is (or reject duplicates explicitly at DAG insertion time) so behavior is consistent. [api mismatch]
Severity Level: Major ⚠️
- ❌ DagSchema round-trip drops duplicate edges from input DAG.
- ⚠️ Edge-count based logic misbehaves after schema serialization.Steps of Reproduction ✅
1. Construct a DAG with duplicate edges using the public API in
`crates/byteport-dag/src/dag.rs`: create a `Dag<String>` via `Dag::new()` (line 51) and
call `Dag::add_node` (lines 55–63) for nodes `"a"` and `"b"`, then call
`Dag::add_edge("a".into(), "b".into())` twice (lines 66–79) so that `children` holds
`["b", "b"]` and `edge_count()` (lines 95–98) returns `2`.
2. Convert this DAG into a schema by calling `DagSchema::from_dag(&dag, "2.0.0")`
(constructor defined at `crates/byteport-dag/src/serialize.rs:81–120`), which builds
`nodes` from `dag.iter_nodes()` (lines 82–93) and collects edges into `edges_set:
BTreeSet<(String, String)>` at lines 95–101.
3. Observe in `from_dag` that for each parent/child pair the code inserts `(node.clone(),
child.clone())` into `edges_set` (line 100), so duplicate edges from the original `Dag`
are deduplicated when `edges_set.into_iter()` is mapped into `Vec<SchemaEdge>` (lines
104–112), yielding only one `SchemaEdge { from: "a", to: "b", .. }`.
4. Reconstruct a DAG from the schema with `DagSchema::into_dag()` (lines 128–140), which
calls `Dag::add_node` for each schema node (lines 133–135) and `Dag::add_edge` once per
`SchemaEdge` (lines 136–138), then compare `edge_count()` (dag.rs:95–98) between the
original `Dag` (2) and reconstructed `Dag` (1) to see that `from_dag(...).into_dag()` is
lossy whenever callers rely on edge multiplicity or exact edge counts.(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** crates/byteport-dag/src/serialize.rs
**Line:** 96:100
**Comment:**
*Api Mismatch: `from_dag` silently deduplicates edges by inserting them into a `BTreeSet`, but `Dag` currently allows duplicate edges. This makes serialization lossy: `from_dag(...).into_dag()` can return a graph with fewer edges than the input, breaking round-trip expectations for callers that rely on edge multiplicity or exact `edge_count`. Preserve edges as-is (or reject duplicates explicitly at DAG insertion time) so behavior is consistent.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix| format!("{:?}", child), | ||
| )); | ||
| } | ||
| DfsColor::White => visit(dag, child, color, order)?, |
There was a problem hiding this comment.
Suggestion: The DFS implementation is recursive, so very deep DAGs can overflow the call stack and abort the process at runtime. Use an explicit stack-based iterative DFS (or document/enforce a depth limit) to avoid stack-overflow crashes on large dependency chains. [possible bug]
Severity Level: Major ⚠️
- ❌ dfs_sort panics on deep DAGs via stack overflow.
- ⚠️ Long dependency chains unsafe unless using kahn_sort alternative.Steps of Reproduction ✅
1. In consumer code, construct a very deep DAG (e.g., a long linear chain) using the
public `Dag` API in `crates/byteport-dag/src/dag.rs`: create `Dag<K>` via `Dag::new()`
(line 51), repeatedly call `add_node` (lines 55–63) for nodes `n0..nN`, and call
`add_edge` (lines 66–79) for each consecutive pair `(ni, ni+1)` to build a chain thousands
of nodes deep.
2. Call the public topological sort function `topo::dfs_sort(&dag)` defined in
`crates/byteport-dag/src/topo.rs:81–97`, which initializes the `color` map and `order`
vector, then iterates `for start in dag.iter_nodes()` (line 88) and calls `visit(dag,
start, &mut color, &mut order)?` (line 90) for white nodes.
3. Inside `visit` (topo.rs:99–128), for each node it iterates through
`dag.children_of(node)` (lines 110–112), and on encountering a white child it recursively
calls `visit(dag, child, color, order)?` (line 119), causing the call stack depth to grow
in proportion to the longest path length in the DAG.
4. For sufficiently deep DAGs, these recursive `visit` calls form a call chain of length
N, and on typical systems a large N (e.g., tens of thousands) will exhaust the process
stack and trigger a stack overflow panic, aborting the process at runtime for callers that
rely on `dfs_sort` instead of the non-recursive `kahn_sort`.(Use Cmd/Ctrl + Click for best experience)
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** crates/byteport-dag/src/topo.rs
**Line:** 119:119
**Comment:**
*Possible Bug: The DFS implementation is recursive, so very deep DAGs can overflow the call stack and abort the process at runtime. Use an explicit stack-based iterative DFS (or document/enforce a depth limit) to avoid stack-overflow crashes on large dependency chains.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46a18222a1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self.children.entry(from.clone()).or_default().push(to.clone()); | ||
| self.parents.entry(to).or_default().push(from); |
There was a problem hiding this comment.
When a caller adds an edge that closes a path back to its source, such as a -> b -> c followed by c -> a, this method returns Ok(()) and mutates both adjacency lists, so the public Dag type can contain cycles despite being documented as acyclic and despite exposing CycleDetected. This also lets DagSchema::into_dag() accept cyclic schemas as valid Dags until a later scheduler/toposort call happens to catch them.
Useful? React with 👍 / 👎.



User description
Summary
Add enriched DAG schema types to the BytePort DAG foundation, supporting prerequisites, acceptance criteria, and audit hooks for Phenotype compute/infra automation.
Context
The DAG crate (byteport-dag) provides core data structures (Dag, topological sort, parallel-bucket scheduler) but lacks domain-specific schema enrichment needed for Phenotype compute/infra. This change adds the schema layer so each node can carry pre-execution gating conditions, post-execution validation rules, and lifecycle audit hooks — all serializable to YAML/JSON for portability.
This is unit F3 in the DAG foundation + automation epic (epic_F).
Changes
schemamodule (src/schema.rs): Enriched node/edge types with:Prerequisiteenum (ImageReady, SecretAvailable, ResourceExists, ApiHealthy, FileExists, EnvironmentVariable, CustomScript)AcceptanceCriterionenum (ExitCode, OutputContains, HttpOk, MetricThreshold, LogCheck, CustomCheck)AuditHookenum (Webhook, LogEntry, MetricEmit, Notify) withHookTiming(Pre, Post, OnSuccess, OnFailure)SchemaNodewith optionallabel,description,prerequisites,acceptance,audit_hooks, and arbitrarymetadataSchemaEdgewith optionallabelandconditionserializemodule (src/serialize.rs):DagSchemanow uses the shared enriched types fromschemamodule; expanded test coverage to 22 tests including round-trip and cross-format consistencyCargo.toml): Addedcrates/byteport-dagto workspace membersKey Implementation Details
#[serde(tag = "type")]internally-tagged enum representation for clean YAML/JSON outputPartialEqomitted from types containingf64fields (MetricThreshold,MetricEmit) sincef64does not implementEqUse Cases
Testing
Links
CodeAnt-AI Description
Add a reusable DAG library with scheduling and portable schema output
What Changed
Impact
✅ Clearer DAG validation✅ Faster parallel task planning✅ Portable workflow definitions💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.