Bisect-Based Copy Tracing
Copy tracing is a technique used to efficiently account for file copies and
renames when comparing histories. It is used for diff
commands and merge-related
operations, such as rebase
, graft
, and merge
. It simplifies resolving merge conflicts,
especially when large refactors, like directory renames, occur in frequently updated
repositories. Below is an example of a rebase involving file renames:
$ sl
│ update b
│
│ mv a -> b
│
│
├─╯ update a
│
add a
Without copy tracing, sl
previously had to ask about renamed or copied files
during rebases:
$ sl rebase -s b0d1b083d -d d78558192
rebasing b0d1b083d791 "update a"
other [source (being rebased)] changed a which local [dest (rebasing onto)] is missing
hint: if this is due to a renamed file, you can manually input the renamed path
use (c)hanged version, leave (d)eleted, or leave (u)nresolved, or input (r)enamed path?
With copy tracing, the merge just works, despite there being copied or renamed files:
$ sl rebase -s b0d1b083d -d d78558192
rebasing b0d1b083d791 "update a"
merging b and a to b
b0d1b083d791 -> 92444cbb366b "update a"
Background
Historically, Sapling has used two copy-tracing solutions. However, as our monorepos grow (tens of millions of files, tens of millions of commits), the previous solutions have become too slow for production use:
- Full copy tracing finds all the new files (M) that were added from
merge base up to the top commit and for each file it checks if this file
was copied from another file (N). For each pair of files, Sapling was walking
through the file history (H) to check the copy-from the relationship.
- The time complexity of this algorithm is
O(M * N * H)
, where N and H are huge.- Typically M (source) << N (destination)
- Sapling records rename information directly in file headers, eliminating the need to compute file content similarity, which is different from Git's approach.
- The time complexity of this algorithm is
- Heuristics copy tracing assumes that moves or renames fall into one of two
categories: (1) Within the same directory (same directory name but different
file names); (2) Move from one directory to another (same file names but
different directory names)
- This approach reduces N to K (K is a configured constant value), resulting
in a time complexity of
O(M * H)
, where H remains large and there is a large constant factor for reducing N to K. - Another issue is that if the renames do not match the heuristics, they cannot be found.
- This approach reduces N to K (K is a configured constant value), resulting
in a time complexity of
Before we explore bisect-based copy tracing, let's first examine how Git's rename detection works. Git's rename detection is similar to the heuristics copy tracing mentioned earlier, but it includes additional heuristics and strategies to enhance performance, such as "Remembering previous work", "Exact renames".
- The time complexity is
O(M * S)
, where M is the same as above, S is the complexity of file content similarity computation. - It also shares the disadvantage of Sapling heuristics copy tracing
when renames do not match heuristics. Otherwise, the time complexity
will be
O(M * N * S)
.
Bisect-based copy tracing is built to achieve the following desired properties:
- Scalability:
O(M * log H)
time complexity, it bisects the file history rather than scanning commits sequentially. - Flexibility: Not restricted by heuristics like 'Move from one directory to another'.
- Abstracted: Support both Sapling and Git backend repositories.
- Efficiency: Provides fast content similarity checks for cases where renames are not recorded in Sapling or when working with Git repositories.
- User-Friendly: Informative message when renames cannot be found, such as delete/modified conflicts.
How?
Scalability
The problem that copy tracing solves is: given two commits, C1 and C2, and a path P1 in C1, we need to find the renamed path P2 in C2.
This problem required a new algorithmic design to scale efficiently. The basic idea is to break the problem into two steps:
- Bisect a commit
C3
that deletesP1
in theC1
toC2
range. - Examine
C3
, find what pathP1
was renamed to. If that path exists inC2
, then we’re done. Otherwise recursively trace renames in theC3
toC2
range.
The efficient bisect is based on the Segmented Changelog we developed for lazy commit graph downloading and improving DAG operations, please check this blog post to learn more about Segmented Changelog.
Flexibility
Since Sapling can efficiently trace rename commits by bisecting the history, and then find the renames in a rename commit, we don't need heuristics to reduce the large number N on destination side. This approach allows Sapling to detect renames that would otherwise be missed by heuristics-based methods.
Abstracted
We made the rename detection inside a commit abstracted. Whether it’s Sapling’s tracked rename, or Git’s implicit content similar rename, or a combination of them, they fit in the same abstraction and can be flexibly configured.
Efficiency
Typical content similarity libraries often degrade to O(N^2)
in the worst case,
where N is the line count (O(N^2)
is the worst case for the Myers diff algorithm).
Our approach, xdiff::edit_cost
, imposes a max cost
limit, reducing the
complexity to O(N)
.
User-Friendly
When renames cannot be found, for example, file a.txt
was renamed to a.md
and then deleted on the destination branch, the new copy tracing can identify
both the commit that renamed the file and also the commit that eventually
deleted it. This allows us to provide additional context to help resolve the
conflict:
$ sl rebase -s 108b59d42 -d a1fcdc96b
...
other [source (being rebased)] changed a.txt which local [dest (rebasing onto)] is missing
hint: the missing file was probably deleted by commit 7f48dc97d540 with name 'a.md' in the branch rebasing onto
use (c)hanged version, leave (d)eleted, or leave (u)nresolved, or input (r)enamed path?