Skip to content

Simulator Hangs When Using STG-Generated Chakra Traces with Astra-sim + NS3 #18

@Leonard226

Description

@Leonard226

I am using symbolic_tensor_graph to generate synthetic chakra traces in the .et format. I am using the following command:

python3 /opt/symbolic_tensor_graph/main.py \
	--output_dir output \
	--output_name workload.%d.et \
	--comm_group_file comm_group.json \
	--chakra_schema_version v0.0.4 \
	--dp 8 \
	--pp 1 \
	--tp 1 \
	--sp 1 \
	--weight_sharded 0 \
	--batch 32 \
	--din 2048 \
	--dout 2048 \
	--dmodel 2048 \
	--dff 8192 \
	--seq 1024 \
	--head 24 \
	--num_stacks 16

And this successfully generates the following files:

comm_group.json  workload.1.et	workload.3.et  workload.5.et  workload.7.et
workload.0.et	 workload.2.et	workload.4.et  workload.6.et

Now, i want to feed those traces into astra-sim + ns3, passing a pointer to the workload workload.%d.et and to the communicator group file comm_group.json, as you can see here:

WORKLOAD=opt/synthetic_traces/ml/output/workload
SYSTEM=system.json
NETWORK=config.txt
MEMORY=remote_memory.json
LOGICAL_TOPOLOGY=logical_topology.json
COMM_GROUP=/opt/synthetic_traces/ml/output/comm_group.json

${NS3_BIN} \
	--workload-configuration=${WORKLOAD} \
	--system-configuration=${SYSTEM} \
	--network-configuration=${NETWORK} \
	--logical-topology-configuration=${LOGICAL_TOPOLOGY} \
	--remote-memory-configuration=${MEMORY} \
	--comm-group-configuration=${COMM_GROUP}

However, when I start the simulation, the simulator gets suck here:

ASTRA-sim + NS3
There are 8 npus: 8,
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
[2025-07-08 11:34:43.558] [system::topology::RingTopology] [info] ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1 total nodes in ring: 8
QP is enabled 
maxRtt=161 maxBdp=9016

I am confident that the issue is not the topology, system or memory configuration.
Additionally, I found this slide stating that this is a known issue that is currently being addressed. However, I found that the PR#167 was closed, and I am wondering whether a solution has been found?

Image

Source: https://github.com/astra-sim/symbolic_tensor_graph/blob/main/docs/stg_demo_241006.pptx

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions