Provide an OpenAI Client training example with reinforcement learning #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

hzy46 wants to merge 2 commits into microsoft:main from hzy46:zhiyuhe/openai_client_example

Contributor

hzy46 commented Dec 26, 2025

This PR provides a minimal reinforcement training example using the OpenAI Client.

It should be useful for those who wants to train their agents without a certain framework.

Related to:
#320


          commit

b79b7e3

Copilot AI review requested due to automatic review settings

December 26, 2025 09:36

Copilot started reviewing on behalf of hzy46

December 26, 2025 09:37

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds a minimal reinforcement learning training example that demonstrates how to train agents using the OpenAI Client without requiring a specific framework. The example uses the GSM8K dataset with the Qwen2.5-1.5B-Instruct model for a question-answering task with exact-match rewards.

Key changes:

Implements a reinforcement learning training script using VERL algorithm with GRPO advantage estimator
Creates an async agent function that queries an LLM endpoint via OpenAI client and emits rewards based on exact answer matching
Provides setup documentation for running the example on a single A100 80GB GPU with Ray cluster

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File	Description
examples/openai_client/train.py	Implements the main training script with VERL configuration, GSM8K agent logic, dataset loading, and trainer setup
examples/openai_client/README.md	Provides quick start instructions for Ray cluster setup and training execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/openai_client/README.md

		@@ -0,0 +1,25 @@
		# OpenAI Client Example

		This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `verl`.

Copilot AI Dec 26, 2025

The reference to 'verl' in the description is lowercase, but it appears to be a proper noun referring to the VERL reinforcement learning framework. Consider using consistent capitalization (VERL) throughout the documentation.

Suggested change

      
            This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `verl`.
          
            This is a minimal example demonstrating how to use the OpenAI client to query an LLM endpoint and train a model with reinforcement learning using `VERL`.

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

+                          answer = answer.group(1)
+                      else:
+                          answer = last_message
+                  except Exception as e:

Copilot AI Dec 26, 2025

Using broad exception catching with 'except Exception as e' can mask unexpected errors and make debugging difficult. Consider catching more specific exceptions (e.g., OpenAI-specific exceptions, network errors) to handle different failure modes appropriately.

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

+                  except Exception as e:
+                      print("Failure:", str(e))
+                      answer = "None"
+                  gt_answer = re.search(regex_pattern, task["answer"]).group(1)

Copilot AI Dec 26, 2025

The regular expression search can return None if the pattern is not found in the ground truth answer, which would cause an AttributeError when calling .group(1). This should be handled similarly to how the answer extraction is handled on lines 118-122.

Suggested change

      
                gt_answer = re.search(regex_pattern, task["answer"]).group(1)
          
                gt_match = re.search(regex_pattern, task["answer"])
          
                if gt_match:
          
                    gt_answer = gt_match.group(1)
          
                else:
          
                    gt_answer = task["answer"]

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

+                  # Log some responses for better clarity
+                  if random.random() < 0.01:
+                      print(
+                          f"--------\nQuestion: {task['question']}\nResponse: {last_message}\nGround Truth: {gt_answer}\nReward: {reward}\n"

Copilot AI Dec 26, 2025

The variable 'last_message' may not be defined if an exception occurs before line 116. When the exception is caught on line 123, the code on line 140 will reference 'last_message' in the print statement, potentially causing an UnboundLocalError.

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

+                      else:
+                          answer = last_message
+                  except Exception as e:
+                      print("Failure:", str(e))

Copilot AI Dec 26, 2025

The error message only prints the exception string without logging which step failed or what input caused the failure. Consider improving the error message to include context such as the question being processed for better debugging.

Suggested change

      
                    print("Failure:", str(e))
          
                    print(
          
                        f"Failure while processing question: {task['question']!r}. Error: {e}"
          
                    )
          
                    last_message = "None"

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

+                  openai_base_url = llm.endpoint
+                  temperature = llm.sampling_parameters.get("temperature", 1.0)
+                  client = AsyncOpenAI(

Copilot AI Dec 26, 2025

Hardcoding 'dummy' as the API key could be misleading for users who might expect they need to provide a valid API key. Consider adding a comment explaining why this is acceptable in this context (because it's querying a local endpoint).

Suggested change

      
                client = AsyncOpenAI(
          
                client = AsyncOpenAI(
          
                    # Using a dummy API key is fine here because we are querying a local LLM proxy
          
                    # endpoint that does not require a real OpenAI API key.

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

+                      api_key="dummy",
+                      base_url=openai_base_url,
+                  )
+                  regex_pattern = r"####\s*(.+)(\s*|$)"

Copilot AI Dec 26, 2025

The regex pattern is compiled on every function call. For better performance, consider defining this pattern as a module-level constant since it's static and reused across all invocations.

Copilot uses AI. Check for mistakes.

examples/openai_client/train.py

Comment on lines +101 to +104

+                  client = AsyncOpenAI(
+                      api_key="dummy",
+                      base_url=openai_base_url,
+                  )

Copilot AI Dec 26, 2025

The AsyncOpenAI client is instantiated on every function call. Consider moving client creation outside the function or reusing a single client instance to avoid the overhead of creating new clients repeatedly.

Copilot uses AI. Check for mistakes.

examples/openai_client/README.md

+              ```bash
+              ray stop
+              env WANDB_API_KEY=XXXXX RAY_DEBUG=legacy HYDRA_FULL_ERROR=1 VLLM_USE_V1=1 ray start --head --dashboard-host=0.0.0.0

Copilot AI Dec 26, 2025

The ray start command uses --dashboard-host=0.0.0.0, which exposes the Ray dashboard on all network interfaces and can allow anyone on the network to access and control the Ray cluster (including running arbitrary code) if additional protections are not in place. In environments where this command is copied as-is (e.g., shared clusters or cloud VMs), this creates a real risk of remote compromise. Consider binding the dashboard to 127.0.0.1 by default or explicitly documenting that 0.0.0.0 should only be used behind proper network access controls (e.g., firewall, SSH tunnel, or authenticated proxy).

Copilot uses AI. Check for mistakes.


          fix last message

2be767e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet