Skip to content

Fix deprecated loss reduction API and sync tutorial docs#3

Open
adny-code wants to merge 3 commits intojingedawang:mainfrom
adny-code:pr/upstream-code-doc-sync
Open

Fix deprecated loss reduction API and sync tutorial docs#3
adny-code wants to merge 3 commits intojingedawang:mainfrom
adny-code:pr/upstream-code-doc-sync

Conversation

@adny-code
Copy link
Copy Markdown

@adny-code adny-code commented May 7, 2026

Summary

  • replace the deprecated reduce argument in F.cross_entropy with reduction
  • fix the TransformerBlock typo in the implementation and tutorial snippet
  • align the alignment-data explanation in dataset.py with the actual sampling logic
  • add a focused regression test for the reduce_loss=False path used by DPO

Why

The code currently relies on reduce_loss=False when computing DPO rewards. Using the deprecated reduce argument still works, but it emits a PyTorch deprecation warning. This change keeps the behavior the same while making the code forward-compatible.

The tutorial chapter also needs to stay consistent with the implementation so readers do not copy an outdated class name.

Validation

  • executed the focused regression test for reduce_loss=False successfully in the v100 environment
  • confirmed the unreduced loss path no longer emits the PyTorch deprecation warning

Copilot AI review requested due to automatic review settings May 7, 2026 10:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the TutorialLLM implementation and tutorial materials to remove use of PyTorch’s deprecated F.cross_entropy(..., reduce=...) API, fixes the TransformerBlock naming typo, and adds a focused regression test for the unreduced-loss path used by DPO.

Changes:

  • Replace deprecated reduce argument in F.cross_entropy with reduction while preserving reduced vs unreduced behavior.
  • Rename TranformerBlock to TransformerBlock in the implementation and partially in the tutorial chapter.
  • Update dataset documentation text and add a regression test for reduce_loss=False (unreduced loss) without the deprecation warning.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
tests/test_run.py Adds a regression test for the unreduced loss path (reduce_loss=False) and checks for absence of the deprecated-loss warning.
model.py Fixes TransformerBlock typo usage and switches cross-entropy to reduction=; includes minor comment edits.
dataset.py Aligns documentation text with actual alignment negative sampling and clarifies batch padding/masking docs.
book/di-10-jie-dai-ma-shi-xian-zhong-de-zhong-dian-xuan-du.md Updates tutorial snippet to use TransformerBlock in one location (but still has remaining inconsistent/outdated snippets).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread model.py Outdated
Comment on lines 59 to 60
# Scale the attention weights to avoid the problem of vanishing gradients.
weights *= dim_embed ** -0.5
Comment thread model.py Outdated
# Get the token embedding and position embedding
token_embedding = self.token_embedding_table(token_ids) # (B, T) -> (B, T, dim_embed)
# The absolute position embedding is quite old fashioned but it's good enough for our tutorial
position_embedding = self.position_embedding_table(torch.arange(T, device=self.device)) # (T) -> (T, dim_embed)
Comment thread tests/test_run.py Outdated

assert logits.shape == (8, 17)
assert loss.shape == (8,)
assert not any('size_average and reduce args will be deprecated' in str(w.message) for w in caught)
Comment thread dataset.py Outdated
Comment on lines 99 to 100
# Note that we add a special character '\0' in the end, which is used as an end-of-text token(will be index 0 in the vocabulary).
# An end-of-text token is useful to let the model know when to stop generating text.
Comment thread dataset.py Outdated
Comment on lines +232 to +233
Emplace 0 to the positions that exceed the actual length of each item, and mask these positions in the label by setting them to -100.
This is necessary to let the model know where to stop(first 0 in label) and ignore the rest padding tokens in the loss calculation.
Comment thread dataset.py Outdated

Returns:
A batch of input token id lists and label token ids. The label refer to the next character of each input sequence
A batch of input token id lists and label token ids. The label refer to the next character of each input sequence.
self.token_embedding_table = nn.Embedding(vocabulary_size, dim_embed)
self.position_embedding_table = nn.Embedding(max_length, dim_embed)
self.transformer_blocks = nn.Sequential(*[TranformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])
self.transformer_blocks = nn.Sequential(*[TransformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

self.token_embedding_table = nn.Embedding(vocabulary_size, dim_embed)
self.position_embedding_table = nn.Embedding(max_length, dim_embed)
self.transformer_blocks = nn.Sequential(*[TranformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])
self.transformer_blocks = nn.Sequential(*[TransformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])
Comment thread tests/test_run.py Outdated

assert logits.shape == (8, 17)
assert loss.shape == (8,)
assert not any('size_average and reduce args will be deprecated' in str(w.message) for w in caught)
Comment thread model.py Outdated
@@ -89,7 +89,8 @@ def __init__(self, dim_embed: int, num_heads: int, head_size: int, max_length: i
self.heads = nn.ModuleList([AttentionHead(dim_embed, head_size, max_length) for _ in range(num_heads)])
# Create a linear layer to project the concatenated output of all heads to the original dimension.
# In our case, the concatenated output is happen to be the same as the original dimension, so we can skip
Comment thread dataset.py Outdated
Comment on lines 99 to 100
# Note that we add a special character '\0' in the end, which is used as an end-of-text token(will be index 0 in the vocabulary).
# An end-of-text token is useful to let the model know when to stop generating text.
Comment thread dataset.py
Comment on lines 231 to 234
Process a batch of token id lists.
Emplace 0 to the positions that exceed the actual length of each item, and mask these positions in the label by setting them to -100.
This is necessary to let the model know where to stop(first 0 in label) and ignore the rest padding tokens in the loss calculation.

Comment thread dataset.py Outdated

Returns:
A batch of input token id lists and label token ids. The label refer to the next character of each input sequence
A batch of input token id lists and label token ids. The label refer to the next character of each input sequence.
@adny-code
Copy link
Copy Markdown
Author

adny-code commented May 7, 2026

Addressed the current review feedback in 23ed735.

  • fixed attention score scaling to use the projected head dimension
  • made position embeddings follow the input tensor device instead of the constructor string
  • made the focused regression test assert on the absence of warnings rather than warning text
  • cleaned up the dataset docstrings and synchronized the remaining tutorial snippets with the implementation

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread tests/test_run.py Outdated
Test the unreduced loss path used by DPO without relying on dataset loading.
"""
torch.manual_seed(2024)
# Keep the constructor device intentionally stale to verify forward uses the input tensor device.
Comment thread tests/test_run.py Outdated

assert logits.shape == (8, 17)
assert loss.shape == (8,)
assert not caught
@adny-code
Copy link
Copy Markdown
Author

adny-code commented May 8, 2026

Followed up on the latest automated review in b47f7b4.

  • narrowed the test comment so it only describes the positional-index device behavior
  • made the warning assertion fail only on the deprecated reduce/size_average warning
  • fixed the small MultiHeadAttention comment grammar issue

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/test_run.py:40

  • Docstring typo: "overal" should be "overall".
    """
    Test the overal pipeline runs without error.
    """

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants