Skip to content

about training num_envs increase #19

@yitingHH-bit

Description

@yitingHH-bit

hello apirrone,

actualy , I would like to increase my traning num_envs to 4096 , but i have not yet found any postion ,,,,

tranks for you time

and cause I met this wird issue ,when I running
uv run playground/open_duck_mini_v2/runner.py --task flat_terrain_backlash --num_timesteps 300000000

[Poly ref data] Processing ...
[Poly ref data] Done processing
/home/jack/NA3/Open_Duck_Playground/.venv/lib/python3.12/site-packages/jax/_src/abstract_arrays.py:111: RuntimeWarning: overflow encountered in cast
return literals.LiteralArray(np.asarray(x, dtype), weak_type=False)
Observation size: 101
PPO params: {'action_repeat': 1, 'batch_size': 256, 'clipping_epsilon': 0.2, 'discounting': 0.97, 'entropy_cost': 0.005, 'episode_length': 1000, 'learning_rate': 0.0003, 'max_grad_norm': 1.0, 'normalize_observations': True, 'num_envs': 8192, 'num_evals': 15, 'num_minibatches': 32, 'num_resets_per_eval': 1, 'num_timesteps': 300000000, 'num_updates_per_batch': 4, 'reward_scaling': 1.0, 'unroll_length': 20}

STEP: 0 reward: 13.509370803833008 reward_std: 9.080975532531738

Saving checkpoint (step: 0): /home/jack/NA3/Open_Duck_Playground/checkpoints/2025_09_18_183319_0
WARNING:absl:[process=0][thread=MainThread][operation_id=1] _SignalingThread.join() waiting for signals ([]) blocking the main thread will slow down blocking save times. This is likely due to main thread calling result() on a CommitFuture.
=== EXPORT ONNX ===
W0000 00:00:1758209599.751697 23297 gpu_device.cc:2430] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
W0000 00:00:1758209599.754798 23297 gpu_device.cc:2430] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 12.0. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
I0000 00:00:1758209599.757384 23297 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1513 MB memory: -> device: 0, name: NVIDIA GeForce RTX 5060 Ti, pci bus id: 0000:01:00.0, compute capability: 12.0
(101,) (101,)
2025-09-18 18:33:19.832019: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_PTX'

2025-09-18 18:33:19.832037: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'

2025-09-18 18:33:19.832044: W tensorflow/core/framework/op_kernel.cc:1844] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
(101,) (101,)
2025-09-18 18:33:19.835539: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_INVALID_PTX'

2025-09-18 18:33:19.835550: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'

2025-09-18 18:33:19.835555: W tensorflow/core/framework/op_kernel.cc:1844] INTERNAL: 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
Traceback (most recent call last):
File "/home/jack/NA3/Open_Duck_Playground/playground/open_duck_mini_v2/runner.py", line 65, in
main()
File "/home/jack/NA3/Open_Duck_Playground/playground/open_duck_mini_v2/runner.py", line 61, in main
runner.train()
File "/home/jack/NA3/Open_Duck_Playground/playground/common/runner.py", line 114, in train
_, params, _ = train_fn(
^^^^^^^^^
File "/home/jack/NA3/Open_Duck_Playground/.venv/lib/python3.12/site-packages/brax/training/agents/ppo/train.py", line 697, in train
policy_params_fn(current_step, make_policy, params)
File "/home/jack/NA3/Open_Duck_Playground/playground/common/runner.py", line 78, in policy_params_fn
export_onnx(
File "/home/jack/NA3/Open_Duck_Playground/playground/common/export_onnx.py", line 105, in export_onnx
example_output = tf_policy_network(example_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jack/NA3/Open_Duck_Playground/.venv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jack/NA3/Open_Duck_Playground/playground/common/export_onnx.py", line 69, in call
inputs = (inputs - self.mean) / self.std
~~~~~~~^~~~~~~~~~~
tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling MLP.call().

{{function_node _wrapped__Sub_device/job:localhost/replica:0/task:0/device:GPU:0}} 'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE' [Op:Sub] name:

Arguments received by MLP.call():
• inputs=tf.Tensor(shape=(1, 101), dtype=float32)

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions