16 Commits

Author SHA1 Message Date
Martin von Zweigbergk
934564bf8d diff: also sort base ranges by end point
I wanted to replace the BTreeMap by a Vec and noticed that we actually
sometimes end up having a `0..n` range followed by a `0..0` after
refinement. We currently compare those two as equal because I had not
thought that we could end up attempting to add two ranges with the
same start point. When trying to insert the second range (`0..0`), the
BTreeMap will keep the existing key (`0..n`) and replace the
value. That's probably works, but it's clearly not what I
intended. Let's fix by sorting by the end point if the start point is
equal. This actually improves some benchmarks by a few percent (maybe
because the subsequent compaction can then remove the `0..0` range).
2022-03-10 00:33:17 -08:00
Waleed Khan
9202aae8b1 build: conditionally use map_first_last feature if available 2022-02-20 22:21:14 -08:00
Waleed Khan
261cd1a1c4 build: add shims for nightly feature map_first_last 2022-02-20 22:16:07 -08:00
Martin von Zweigbergk
e52c902d3a diff: compact adjacent unchanged regions also when using for_tokenizer()
I noticed while working on support for unified diffs (#33) that
`Diff::for_tokenizer(..., &find_line_ranges)` would return a
`DiffHunk::Matching` for each matching line instead of a single
`DiffHunk::Matching` for all the matching lines. That's different from
what you get from `Diff::default_refinement()` and seems less
convenient to work with.
2021-10-10 00:00:06 -07:00
Martin von Zweigbergk
81b51de300 diff: rewrite diff() using new multi-way diff 2021-06-26 23:49:58 -07:00
Martin von Zweigbergk
987aecc749 diff: add a type for diffing arbitrary number of inputs
I have been trying to figure out how to generalize diffs and merges
for arbitrary number of inputs. For example, I want to have an
internal representation of an octopus merge adding 5 inputs (file
states/contents) and removing 4 inputs. I also want to be to represent
a diff from a regular 3-way-conflict state to a resolved state. Such a
diff would be from a state adding two inputs and removing one, to a
state adding just one input.

I finally realized last week that the problem is simple if you don't
care about adds vs removes. Instead, you line up the matching and
differing parts of all the inputs. It's then up to the caller to use
it in an appropriate way for its use case. For example, a regular diff
would pass in two inputs and would get back a list of matching and
dffering hunks. It might then present the first element of differing
hunks in red and the second element in green. Similarly, a 3-way merge
would pass in three inputs with the base first. It would then compare
the sides and decide on a resolution (or leave it unresolved if all
three sides are different).

This change adds a type representing this kind of multi-way
diff. Coming changes will update existing code to use it. In addition
to making the existing code simpler and more consistent, having this
in place should also:

 * Make it much easier to present merge conflicts involving more than
   3 parts.

 * Experiment with different ways of displaying diffs from/to conflict
   states.

 * Experiment with sub-line-level merging.
2021-06-26 23:49:58 -07:00
Martin von Zweigbergk
4c416dd864 cleanup: let Clippy fix a bunch of warnings 2021-06-14 00:27:31 -07:00
Martin von Zweigbergk
b50ef1410d styler: rename Styler to more standard Formatter 2021-06-05 08:38:28 -07:00
Martin von Zweigbergk
5b18e89a4d diff: fix LCS when a line/word/byte has been moved later 2021-04-28 23:33:18 -07:00
Martin von Zweigbergk
102f7a0416 diff: also recurse into final region after after unchanged regions
See test case for details.


Before:
test bench_diff_10k_lines_reversed  ... bench:  36,249,659 ns/iter (+/- 174,455)
test bench_diff_10k_modified_lines  ... bench:  37,258,890 ns/iter (+/- 803,963)
test bench_diff_10k_unchanged_lines ... bench:       4,252 ns/iter (+/- 69)
test bench_diff_1k_lines_reversed   ... bench:     982,834 ns/iter (+/- 6,467)
test bench_diff_1k_modified_lines   ... bench:   3,343,469 ns/iter (+/- 23,243)
test bench_diff_1k_unchanged_lines  ... bench:         231 ns/iter (+/- 2)
test bench_diff_git_git_read_tree_c ... bench:      95,559 ns/iter (+/- 816)


After:
test bench_diff_10k_lines_reversed  ... bench:  36,186,715 ns/iter (+/- 196,903)
test bench_diff_10k_modified_lines  ... bench:  37,511,000 ns/iter (+/- 1,370,476)
test bench_diff_10k_unchanged_lines ... bench:       3,099 ns/iter (+/- 8)
test bench_diff_1k_lines_reversed   ... bench:     986,010 ns/iter (+/- 11,565)
test bench_diff_1k_modified_lines   ... bench:   3,370,938 ns/iter (+/- 17,041)
test bench_diff_1k_unchanged_lines  ... bench:         230 ns/iter (+/- 2)
test bench_diff_git_git_read_tree_c ... bench:     102,189 ns/iter (+/- 1,052)


So this patch makes diffing even slower (but still easily fast enough
for all cases I've run into in real life). There's probably a lot that
can be done to make things faster, but the first priority is that the
diffs are correct and easy to read.
2021-04-08 23:54:54 -07:00
Martin von Zweigbergk
5c10c93e64 diff: fix tests broken by the previous commit
Sorry, I forgot to run the automated tests again :(
2021-04-07 11:00:04 -07:00
Martin von Zweigbergk
0dd000d236 diff: do final refinement at byte-level for non-word bytes
This results in significantly more readable diffs on commits like
659393bec219 in this repo.


Before:
test bench_diff_10k_lines_reversed  ... bench:  38,122,998 ns/iter (+/- 557,688)
test bench_diff_10k_modified_lines  ... bench:  32,556,563 ns/iter (+/- 548,114)
test bench_diff_10k_unchanged_lines ... bench:       4,231 ns/iter (+/- 15)
test bench_diff_1k_lines_reversed   ... bench:     958,296 ns/iter (+/- 46,963)
test bench_diff_1k_modified_lines   ... bench:   3,014,723 ns/iter (+/- 15,830)
test bench_diff_1k_unchanged_lines  ... bench:         249 ns/iter (+/- 2)
test bench_diff_git_git_read_tree_c ... bench:      78,599 ns/iter (+/- 1,079)

After:
test bench_diff_10k_lines_reversed  ... bench:  38,289,493 ns/iter (+/- 413,712)
test bench_diff_10k_modified_lines  ... bench:  37,352,516 ns/iter (+/- 1,293,950)
test bench_diff_10k_unchanged_lines ... bench:       4,238 ns/iter (+/- 13)
test bench_diff_1k_lines_reversed   ... bench:     967,253 ns/iter (+/- 8,506)
test bench_diff_1k_modified_lines   ... bench:   3,358,028 ns/iter (+/- 37,154)
test bench_diff_1k_unchanged_lines  ... bench:         233 ns/iter (+/- 1)
test bench_diff_git_git_read_tree_c ... bench:      95,787 ns/iter (+/- 740)


So the biggest slowdown is when there are modified lines.
2021-04-07 10:27:17 -07:00
Martin von Zweigbergk
d7395cc34a diff: add copyright header 2021-04-06 21:26:37 -07:00
Martin von Zweigbergk
7e4e43f358 diff: first diff lines, then refine to words, producing better diffs
The new diff algorithm produces pretty bad diffs in some cases, such
as cc4b1e923091 in this repo (the parent of this commit). I think the
problem there is that many words are repeated over and over. Diffing
first at the line level and then refining the diff of the changed
ranges at the word level gives much better results. That's what this
patch does. After this patch, `jj diff -r cc4b1e923091` looks pretty
similar to the diff in GitHub's UI.

I hope to get around to doing the same for the merge code soon.

Impact on benchmarks:

Before:
test bench_diff_10k_lines_reversed  ... bench:  42,647,532 ns/iter (+/- 765,347)
test bench_diff_10k_modified_lines  ... bench:  21,407,980 ns/iter (+/- 126,366)
test bench_diff_10k_unchanged_lines ... bench:       4,235 ns/iter (+/- 16)
test bench_diff_1k_lines_reversed   ... bench:   1,190,483 ns/iter (+/- 7,192)
test bench_diff_1k_modified_lines   ... bench:   1,919,766 ns/iter (+/- 9,665)
test bench_diff_1k_unchanged_lines  ... bench:         231 ns/iter (+/- 1)
test bench_diff_git_git_read_tree_c ... bench:     174,702 ns/iter (+/- 1,199)

After:
test bench_diff_10k_lines_reversed  ... bench:  38,289,509 ns/iter (+/- 129,004)
test bench_diff_10k_modified_lines  ... bench:  33,140,659 ns/iter (+/- 3,989,339)
test bench_diff_10k_unchanged_lines ... bench:       3,099 ns/iter (+/- 14)
test bench_diff_1k_lines_reversed   ... bench:     973,551 ns/iter (+/- 94,895)
test bench_diff_1k_modified_lines   ... bench:   3,033,818 ns/iter (+/- 29,513)
test bench_diff_1k_unchanged_lines  ... bench:         230 ns/iter (+/- 1)
test bench_diff_git_git_read_tree_c ... bench:      79,100 ns/iter (+/- 963)


So most of them get slower, as expected. The last one, taken from a
real diff in the git.git repo, get faster, however (which is also what
I would have expected).
2021-04-04 21:50:31 -07:00
Martin von Zweigbergk
3c35dbace6 merge: use new diff algorithm for finding sync regions
With the histogram diff code from the previous patch, we can now start
using that for finding the "sync regions" in 3-way merge. That helps a
lot with the slow merging we had before this patch. `jj diff -r
9d540e9726` in the git.git repo drops from 22 s to 0.15 s with this
patch. (That commit is a rather arbitrary merge commit from aroun 5
years ago.)

With the new diff algorithm, the output of `jj diff -r 9d540e9726` in
git.git looks better if we find unchanged sync regions based on lines
than on words, so that's what I'm using in this patch. That's a change
compared the the LCS-based diff we used before this patch. I suspect
the reason that finding sync regions based on words works worse now is
not because of the change from LCS to histogram but because of the
change in how we define a word. My goal right now is mostly to make it
faster; I'll get back to refining the diff result later.
2021-03-31 22:16:19 -07:00
Martin von Zweigbergk
1e657c5331 diff: add a histogram(-like?) diff algorithm
The current diff algorithm does a full LCS on the words of the texts,
which is really slow. Diffing the working copy when e.g.
`src/commands.py` has changes far apart takes seconds. This patch adds
an implementation inspired by JGit's Histogram diff. I say "inspired"
because I just didn't quite understand it :P In particular, I didn't
understand what it does when it finds non-unique elements. I decided
to line up the leading common elements on both sides of the merge. I
don't know if that usually gives good enough results in practice.

I'm sure this can still be optimized a lot, but this seems good enough
as a start. There is also many things to improve about the quality of
the diffs.
2021-03-31 22:15:36 -07:00