Fix deprecated loss reduction API and sync tutorial docs by adny-code · Pull Request #3 · jingedawang/TutorialLLM

adny-code · 2026-05-07T10:43:15Z

Summary

replace the deprecated reduce argument in F.cross_entropy with reduction
fix the TransformerBlock typo in the implementation and tutorial snippet
align the alignment-data explanation in dataset.py with the actual sampling logic
add a focused regression test for the reduce_loss=False path used by DPO

Why

The code currently relies on reduce_loss=False when computing DPO rewards. Using the deprecated reduce argument still works, but it emits a PyTorch deprecation warning. This change keeps the behavior the same while making the code forward-compatible.

The tutorial chapter also needs to stay consistent with the implementation so readers do not copy an outdated class name.

Validation

executed the focused regression test for reduce_loss=False successfully in the v100 environment
confirmed the unreduced loss path no longer emits the PyTorch deprecation warning

Copilot

Pull request overview

This PR updates the TutorialLLM implementation and tutorial materials to remove use of PyTorch’s deprecated F.cross_entropy(..., reduce=...) API, fixes the TransformerBlock naming typo, and adds a focused regression test for the unreduced-loss path used by DPO.

Changes:

Replace deprecated reduce argument in F.cross_entropy with reduction while preserving reduced vs unreduced behavior.
Rename TranformerBlock to TransformerBlock in the implementation and partially in the tutorial chapter.
Update dataset documentation text and add a regression test for reduce_loss=False (unreduced loss) without the deprecation warning.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
tests/test_run.py	Adds a regression test for the unreduced loss path (`reduce_loss=False`) and checks for absence of the deprecated-loss warning.
model.py	Fixes TransformerBlock typo usage and switches cross-entropy to `reduction=`; includes minor comment edits.
dataset.py	Aligns documentation text with actual alignment negative sampling and clarifies batch padding/masking docs.
book/di-10-jie-dai-ma-shi-xian-zhong-de-zhong-dian-xuan-du.md	Updates tutorial snippet to use `TransformerBlock` in one location (but still has remaining inconsistent/outdated snippets).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Scale the attention weights to avoid the problem of vanishing gradients.
        weights *= dim_embed ** -0.5


        # Get the token embedding and position embedding
        token_embedding = self.token_embedding_table(token_ids) # (B, T) -> (B, T, dim_embed)
+        # The absolute position embedding is quite old fashioned but it's good enough for our tutorial
        position_embedding = self.position_embedding_table(torch.arange(T, device=self.device)) # (T) -> (T, dim_embed)


+
+    assert logits.shape == (8, 17)
+    assert loss.shape == (8,)
+    assert not any('size_average and reduce args will be deprecated' in str(w.message) for w in caught)


+        # Note that we add a special character '\0' in the end, which is used as an end-of-text token(will be index 0 in the vocabulary).
        # An end-of-text token is useful to let the model know when to stop generating text.


+        Emplace 0 to the positions that exceed the actual length of each item, and mask these positions in the label by setting them to -100.
+        This is necessary to let the model know where to stop(first 0 in label) and ignore the rest padding tokens in the loss calculation.



        Returns:
-            A batch of input token id lists and label token ids. The label refer to the next character of each input sequence
+            A batch of input token id lists and label token ids. The label refer to the next character of each input sequence.


        self.token_embedding_table = nn.Embedding(vocabulary_size, dim_embed)
        self.position_embedding_table = nn.Embedding(max_length, dim_embed)
-        self.transformer_blocks = nn.Sequential(*[TranformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])
+        self.transformer_blocks = nn.Sequential(*[TransformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])


Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

        self.token_embedding_table = nn.Embedding(vocabulary_size, dim_embed)
        self.position_embedding_table = nn.Embedding(max_length, dim_embed)
-        self.transformer_blocks = nn.Sequential(*[TranformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])
+        self.transformer_blocks = nn.Sequential(*[TransformerBlock(dim_embed, num_head, max_length) for _ in range(num_layer)])


+
+    assert logits.shape == (8, 17)
+    assert loss.shape == (8,)
+    assert not any('size_average and reduce args will be deprecated' in str(w.message) for w in caught)


@@ -89,7 +89,8 @@ def __init__(self, dim_embed: int, num_heads: int, head_size: int, max_length: i
        self.heads = nn.ModuleList([AttentionHead(dim_embed, head_size, max_length) for _ in range(num_heads)])
        # Create a linear layer to project the concatenated output of all heads to the original dimension.
        # In our case, the concatenated output is happen to be the same as the original dimension, so we can skip


+        # Note that we add a special character '\0' in the end, which is used as an end-of-text token(will be index 0 in the vocabulary).
        # An end-of-text token is useful to let the model know when to stop generating text.


        Process a batch of token id lists.
+        Emplace 0 to the positions that exceed the actual length of each item, and mask these positions in the label by setting them to -100.
+        This is necessary to let the model know where to stop(first 0 in label) and ignore the rest padding tokens in the loss calculation.




        Returns:
-            A batch of input token id lists and label token ids. The label refer to the next character of each input sequence
+            A batch of input token id lists and label token ids. The label refer to the next character of each input sequence.


adny-code · 2026-05-07T11:00:02Z

Addressed the current review feedback in 23ed735.

fixed attention score scaling to use the projected head dimension
made position embeddings follow the input tensor device instead of the constructor string
made the focused regression test assert on the absence of warnings rather than warning text
cleaned up the dataset docstrings and synchronized the remaining tutorial snippets with the implementation

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

+    Test the unreduced loss path used by DPO without relying on dataset loading.
+    """
+    torch.manual_seed(2024)
+    # Keep the constructor device intentionally stale to verify forward uses the input tensor device.


+
+    assert logits.shape == (8, 17)
+    assert loss.shape == (8,)
+    assert not caught


adny-code · 2026-05-08T01:59:02Z

Followed up on the latest automated review in b47f7b4.

narrowed the test comment so it only describes the positional-index device behavior
made the warning assertion fail only on the deprecated reduce/size_average warning
fixed the small MultiHeadAttention comment grammar issue

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

tests/test_run.py:40

Docstring typo: "overal" should be "overall".

    """
    Test the overal pipeline runs without error.
    """

Fix loss reduction API and sync tutorial docs

5457676

Copilot AI review requested due to automatic review settings May 7, 2026 10:43

Copilot started reviewing on behalf of adny-code May 7, 2026 10:43 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

adny-code requested a review from Copilot May 7, 2026 10:50

Copilot started reviewing on behalf of adny-code May 7, 2026 10:51 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Address PR review feedback

23ed735

adny-code requested a review from Copilot May 7, 2026 11:04

Copilot started reviewing on behalf of adny-code May 7, 2026 11:04 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Refine review follow-ups

b47f7b4

adny-code requested a review from Copilot May 8, 2026 02:10

Copilot started reviewing on behalf of adny-code May 8, 2026 02:11 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deprecated loss reduction API and sync tutorial docs#3

Fix deprecated loss reduction API and sync tutorial docs#3
adny-code wants to merge 3 commits intojingedawang:mainfrom
adny-code:pr/upstream-code-doc-sync

adny-code commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

adny-code commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

adny-code commented May 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Scale the attention weights to avoid the problem of vanishing gradients.
		weights = dim_embed * -0.5

		# Note that we add a special character '\0' in the end, which is used as an end-of-text token(will be index 0 in the vocabulary).
		# An end-of-text token is useful to let the model know when to stop generating text.

		Emplace 0 to the positions that exceed the actual length of each item, and mask these positions in the label by setting them to -100.
		This is necessary to let the model know where to stop(first 0 in label) and ignore the rest padding tokens in the loss calculation.

Conversation

adny-code commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

adny-code commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

adny-code commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adny-code commented May 7, 2026 •

edited

Loading

adny-code commented May 7, 2026 •

edited

Loading

adny-code commented May 8, 2026 •

edited

Loading