diff-to-commit

Problem Statement

Given a diff, we wish to figure out other relevant files for modification. We can do this via an embedding model.

For example, consider a diff like this: https://github.com/pytorch/pytorch/commit/7716da9fb23f27a65b41f9f016a2afadf281c18f.diff

We might get as input a partial diff on one file.

diff --git a/torch/__init__.py b/torch/__init__.py
index ad32f8a054dc7..e6f9cfcb54472 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -320,7 +320,7 @@ def _preload_cuda_lib(lib_folder: str, lib_name: str, required: bool = True) ->
         ctypes.CDLL(lib_path)
 
 
-def _preload_cuda_deps(err: _Optional[OSError] = None) -> None:
+def _preload_cuda_deps(err: OSError | None = None) -> None:
     cuda_libs: list[tuple[str, str]] = [
         ("cublas", "libcublas.so.*[0-9]"),
         ("cudnn", "libcudnn.so.*[0-9]"),

and we wish to predict the next jump/edit location

@@ -1276,7 +1276,7 @@ def set_default_device(device: "Device") -> None:
     _GLOBAL_DEVICE_CONTEXT.device_context = device_context
 
 
-def set_default_tensor_type(t: _Union[type["torch.Tensor"], str], /) -> None:
+def set_default_tensor_type(t: type["torch.Tensor"] | str, /) -> None:
     r"""
     .. warning::

This is not quite trivial because this jump/edit location is only valid if the edit has not been made.

Therefore, whether or not the line is

def set_default_tensor_type(t: _Union[type["torch.Tensor"], str], /) -> None:

or

def set_default_tensor_type(t: type["torch.Tensor"] | str, /) -> None:

actually has a huge difference and should change the embedding.

Proposed Solution

Crucially, there are two main issues:

How do we embed diffs? Contrastive learning seems to have not worked here.
How do we embed undiff'd code?

One approach we can take is by training a full encoder/decoder model and using commit messages as the alignment instead of contrastive learning. In other words, we take something like https://github.com/microsoft/CodeBERT and we train it on diffs as input, the commit message as output. Then, to get the embeddings, we discard the decoder half of the model.

To get code also aligned, we should train it on the code before the diff as a positive signal and the code after the diff as a negative signal. It's not clear to me if this is sufficient though, because it may be too difficult to tease out. Some experimentation will be needed.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
src/diff_to_commit		src/diff_to_commit
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diff-to-commit

Problem Statement

Proposed Solution

About

Uh oh!

Releases

Packages

Languages

viplismism/diff-to-commit

Folders and files

Latest commit

History

Repository files navigation

diff-to-commit

Problem Statement

Proposed Solution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages