Skip to content

Conversation

@hzy46
Copy link
Contributor

@hzy46 hzy46 commented Dec 26, 2025

This PR provides a minimal reinforcement training example using the OpenAI Client.

It should be useful for those who wants to train their agents without a certain framework.

Related to:
#320

Copilot AI review requested due to automatic review settings December 26, 2025 09:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a minimal reinforcement learning training example that demonstrates how to train agents using the OpenAI Client without requiring a specific framework. The example uses the GSM8K dataset with the Qwen2.5-1.5B-Instruct model for a question-answering task with exact-match rewards.

Key changes:

  • Implements a reinforcement learning training script using VERL algorithm with GRPO advantage estimator
  • Creates an async agent function that queries an LLM endpoint via OpenAI client and emits rewards based on exact answer matching
  • Provides setup documentation for running the example on a single A100 80GB GPU with Ray cluster

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
examples/openai_client/train.py Implements the main training script with VERL configuration, GSM8K agent logic, dataset loading, and trainer setup
examples/openai_client/README.md Provides quick start instructions for Ray cluster setup and training execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,25 @@
# OpenAI Client Example

This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `verl`.
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to 'verl' in the description is lowercase, but it appears to be a proper noun referring to the VERL reinforcement learning framework. Consider using consistent capitalization (VERL) throughout the documentation.

Suggested change
This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `verl`.
This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `VERL`.

Copilot uses AI. Check for mistakes.
answer = answer.group(1)
else:
answer = last_message
except Exception as e:
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using broad exception catching with 'except Exception as e' can mask unexpected errors and make debugging difficult. Consider catching more specific exceptions (e.g., OpenAI-specific exceptions, network errors) to handle different failure modes appropriately.

Copilot uses AI. Check for mistakes.
except Exception as e:
print("Failure:", str(e))
answer = "None"
gt_answer = re.search(regex_pattern, task["answer"]).group(1)
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular expression search can return None if the pattern is not found in the ground truth answer, which would cause an AttributeError when calling .group(1). This should be handled similarly to how the answer extraction is handled on lines 118-122.

Suggested change
gt_answer = re.search(regex_pattern, task["answer"]).group(1)
gt_match = re.search(regex_pattern, task["answer"])
if gt_match:
gt_answer = gt_match.group(1)
else:
gt_answer = task["answer"]

Copilot uses AI. Check for mistakes.
# Log some responses for better clarity
if random.random() < 0.01:
print(
f"--------\nQuestion: {task['question']}\nResponse: {last_message}\nGround Truth: {gt_answer}\nReward: {reward}\n"
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'last_message' may not be defined if an exception occurs before line 116. When the exception is caught on line 123, the code on line 140 will reference 'last_message' in the print statement, potentially causing an UnboundLocalError.

Copilot uses AI. Check for mistakes.
else:
answer = last_message
except Exception as e:
print("Failure:", str(e))
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message only prints the exception string without logging which step failed or what input caused the failure. Consider improving the error message to include context such as the question being processed for better debugging.

Suggested change
print("Failure:", str(e))
print(
f"Failure while processing question: {task['question']!r}. Error: {e}"
)
last_message = "None"

Copilot uses AI. Check for mistakes.
openai_base_url = llm.endpoint
temperature = llm.sampling_parameters.get("temperature", 1.0)

client = AsyncOpenAI(
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding 'dummy' as the API key could be misleading for users who might expect they need to provide a valid API key. Consider adding a comment explaining why this is acceptable in this context (because it's querying a local endpoint).

Suggested change
client = AsyncOpenAI(
client = AsyncOpenAI(
# Using a dummy API key is fine here because we are querying a local LLM proxy
# endpoint that does not require a real OpenAI API key.

Copilot uses AI. Check for mistakes.
api_key="dummy",
base_url=openai_base_url,
)
regex_pattern = r"####\s*(.+)(\s*|$)"
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern is compiled on every function call. For better performance, consider defining this pattern as a module-level constant since it's static and reused across all invocations.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +104
client = AsyncOpenAI(
api_key="dummy",
base_url=openai_base_url,
)
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AsyncOpenAI client is instantiated on every function call. Consider moving client creation outside the function or reusing a single client instance to avoid the overhead of creating new clients repeatedly.

Copilot uses AI. Check for mistakes.

```bash
ray stop
env WANDB_API_KEY=XXXXX RAY_DEBUG=legacy HYDRA_FULL_ERROR=1 VLLM_USE_V1=1 ray start --head --dashboard-host=0.0.0.0
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ray start command uses --dashboard-host=0.0.0.0, which exposes the Ray dashboard on all network interfaces and can allow anyone on the network to access and control the Ray cluster (including running arbitrary code) if additional protections are not in place. In environments where this command is copied as-is (e.g., shared clusters or cloud VMs), this creates a real risk of remote compromise. Consider binding the dashboard to 127.0.0.1 by default or explicitly documenting that 0.0.0.0 should only be used behind proper network access controls (e.g., firewall, SSH tunnel, or authenticated proxy).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant