-
Notifications
You must be signed in to change notification settings - Fork 836
Provide an OpenAI Client training example with reinforcement learning #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a minimal reinforcement learning training example that demonstrates how to train agents using the OpenAI Client without requiring a specific framework. The example uses the GSM8K dataset with the Qwen2.5-1.5B-Instruct model for a question-answering task with exact-match rewards.
Key changes:
- Implements a reinforcement learning training script using VERL algorithm with GRPO advantage estimator
- Creates an async agent function that queries an LLM endpoint via OpenAI client and emits rewards based on exact answer matching
- Provides setup documentation for running the example on a single A100 80GB GPU with Ray cluster
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| examples/openai_client/train.py | Implements the main training script with VERL configuration, GSM8K agent logic, dataset loading, and trainer setup |
| examples/openai_client/README.md | Provides quick start instructions for Ray cluster setup and training execution |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,25 @@ | |||
| # OpenAI Client Example | |||
|
|
|||
| This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `verl`. | |||
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference to 'verl' in the description is lowercase, but it appears to be a proper noun referring to the VERL reinforcement learning framework. Consider using consistent capitalization (VERL) throughout the documentation.
| This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `verl`. | |
| This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `VERL`. |
| answer = answer.group(1) | ||
| else: | ||
| answer = last_message | ||
| except Exception as e: |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using broad exception catching with 'except Exception as e' can mask unexpected errors and make debugging difficult. Consider catching more specific exceptions (e.g., OpenAI-specific exceptions, network errors) to handle different failure modes appropriately.
| except Exception as e: | ||
| print("Failure:", str(e)) | ||
| answer = "None" | ||
| gt_answer = re.search(regex_pattern, task["answer"]).group(1) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regular expression search can return None if the pattern is not found in the ground truth answer, which would cause an AttributeError when calling .group(1). This should be handled similarly to how the answer extraction is handled on lines 118-122.
| gt_answer = re.search(regex_pattern, task["answer"]).group(1) | |
| gt_match = re.search(regex_pattern, task["answer"]) | |
| if gt_match: | |
| gt_answer = gt_match.group(1) | |
| else: | |
| gt_answer = task["answer"] |
| # Log some responses for better clarity | ||
| if random.random() < 0.01: | ||
| print( | ||
| f"--------\nQuestion: {task['question']}\nResponse: {last_message}\nGround Truth: {gt_answer}\nReward: {reward}\n" |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'last_message' may not be defined if an exception occurs before line 116. When the exception is caught on line 123, the code on line 140 will reference 'last_message' in the print statement, potentially causing an UnboundLocalError.
| else: | ||
| answer = last_message | ||
| except Exception as e: | ||
| print("Failure:", str(e)) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message only prints the exception string without logging which step failed or what input caused the failure. Consider improving the error message to include context such as the question being processed for better debugging.
| print("Failure:", str(e)) | |
| print( | |
| f"Failure while processing question: {task['question']!r}. Error: {e}" | |
| ) | |
| last_message = "None" |
| openai_base_url = llm.endpoint | ||
| temperature = llm.sampling_parameters.get("temperature", 1.0) | ||
|
|
||
| client = AsyncOpenAI( |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding 'dummy' as the API key could be misleading for users who might expect they need to provide a valid API key. Consider adding a comment explaining why this is acceptable in this context (because it's querying a local endpoint).
| client = AsyncOpenAI( | |
| client = AsyncOpenAI( | |
| # Using a dummy API key is fine here because we are querying a local LLM proxy | |
| # endpoint that does not require a real OpenAI API key. |
| api_key="dummy", | ||
| base_url=openai_base_url, | ||
| ) | ||
| regex_pattern = r"####\s*(.+)(\s*|$)" |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern is compiled on every function call. For better performance, consider defining this pattern as a module-level constant since it's static and reused across all invocations.
| client = AsyncOpenAI( | ||
| api_key="dummy", | ||
| base_url=openai_base_url, | ||
| ) |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AsyncOpenAI client is instantiated on every function call. Consider moving client creation outside the function or reusing a single client instance to avoid the overhead of creating new clients repeatedly.
|
|
||
| ```bash | ||
| ray stop | ||
| env WANDB_API_KEY=XXXXX RAY_DEBUG=legacy HYDRA_FULL_ERROR=1 VLLM_USE_V1=1 ray start --head --dashboard-host=0.0.0.0 |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray start command uses --dashboard-host=0.0.0.0, which exposes the Ray dashboard on all network interfaces and can allow anyone on the network to access and control the Ray cluster (including running arbitrary code) if additional protections are not in place. In environments where this command is copied as-is (e.g., shared clusters or cloud VMs), this creates a real risk of remote compromise. Consider binding the dashboard to 127.0.0.1 by default or explicitly documenting that 0.0.0.0 should only be used behind proper network access controls (e.g., firewall, SSH tunnel, or authenticated proxy).
This PR provides a minimal reinforcement training example using the OpenAI Client.
It should be useful for those who wants to train their agents without a certain framework.
Related to:
#320