Skip to content

viplismism/diff-to-commit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diff-to-commit

Problem Statement

Given a diff, we wish to figure out other relevant files for modification. We can do this via an embedding model.

For example, consider a diff like this: https://github.com/pytorch/pytorch/commit/7716da9fb23f27a65b41f9f016a2afadf281c18f.diff

We might get as input a partial diff on one file.

diff --git a/torch/__init__.py b/torch/__init__.py
index ad32f8a054dc7..e6f9cfcb54472 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -320,7 +320,7 @@ def _preload_cuda_lib(lib_folder: str, lib_name: str, required: bool = True) ->
         ctypes.CDLL(lib_path)
 
 
-def _preload_cuda_deps(err: _Optional[OSError] = None) -> None:
+def _preload_cuda_deps(err: OSError | None = None) -> None:
     cuda_libs: list[tuple[str, str]] = [
         ("cublas", "libcublas.so.*[0-9]"),
         ("cudnn", "libcudnn.so.*[0-9]"),

and we wish to predict the next jump/edit location

@@ -1276,7 +1276,7 @@ def set_default_device(device: "Device") -> None:
     _GLOBAL_DEVICE_CONTEXT.device_context = device_context
 
 
-def set_default_tensor_type(t: _Union[type["torch.Tensor"], str], /) -> None:
+def set_default_tensor_type(t: type["torch.Tensor"] | str, /) -> None:
     r"""
     .. warning::

This is not quite trivial because this jump/edit location is only valid if the edit has not been made.

Therefore, whether or not the line is

def set_default_tensor_type(t: _Union[type["torch.Tensor"], str], /) -> None:

or

def set_default_tensor_type(t: type["torch.Tensor"] | str, /) -> None:

actually has a huge difference and should change the embedding.

Proposed Solution

Crucially, there are two main issues:

  1. How do we embed diffs? Contrastive learning seems to have not worked here.
  2. How do we embed undiff'd code?

One approach we can take is by training a full encoder/decoder model and using commit messages as the alignment instead of contrastive learning. In other words, we take something like https://github.com/microsoft/CodeBERT and we train it on diffs as input, the commit message as output. Then, to get the embeddings, we discard the decoder half of the model.

To get code also aligned, we should train it on the code before the diff as a positive signal and the code after the diff as a negative signal. It's not clear to me if this is sufficient though, because it may be too difficult to tease out. Some experimentation will be needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages