For tree-level conflicts, we're eventually not going to have
`ConflictId`. We'd want to make `write_conflict_to_store()` take a
`Merge<Option<TreeValue>>` and return an updated such value. That
would leave very little logic in the function, so let's just inline it
instead.
`update_from_content()` already writes file content for each term of
an unresolved merge, so it seems consistent for it to also write the
file content for resolved merges. I think this should simplify further
refactoring for tree-level conflicts and for preserving the executable
bit.
Since `update_from_contents()` only works with file contents and not
the executable or other kinds of paths, I think it makes more sense
for it to deal with `FileId`s instead of `TreeValue`s.
I think I moved way too many functions onto `Merge<Option<TreeValue>>`
in 82883e648da4. This effectively reverts almost all of that
commit. The `Merge<T>` type is simple container and it seems like it
should be at fairly low level in the dependency graph. By moving
functions off of it, we can get rid of the back-depdencies from the
`merge` module to the `conflict` module that I introduced when I moved
`Merge` to the `merge` module. I'm thinking the `conflict` module can
focus on materialized conflicts.
Perhaps the most important invariant in `.jj/working_copy/tree_state`
is that its set of files in it matches the files in its tree. In
particular, if a file that exists in the tree doesn't exist in the
file state and doesn't exist on disk either, we won't notice that it's
gone, and we will therefore not delete it from the tree on future
rounds of snapshotting either.
Now that we process the outputs from the file system traversal by
reading from channels, we can separate the processing from the file
system traversal. When the working copy is unchanged, processing tree
entries and deleted files takes practically no time, but processing
file states and present files takes significant time.
This improves `jj status` time by a factor of ~2x on my machine (M1 Macbook Pro 2021 16-inch, uses an SSD):
```sh
$ hyperfine --parameter-list hash before,after --parameter-list repo nixpkgs,gecko-dev --setup 'git checkout {hash} && cargo build --profile release-with-debug' --warmup 3 './target/release-with-debug/jj -R ../{repo} st'
Benchmark 1: ./target/release-with-debug/jj -R ../nixpkgs st (hash = before)
Time (mean ± σ): 1.640 s ± 0.019 s [User: 0.580 s, System: 1.044 s]
Range (min … max): 1.621 s … 1.673 s 10 runs
Benchmark 2: ./target/release-with-debug/jj -R ../nixpkgs st (hash = after)
Time (mean ± σ): 760.0 ms ± 5.4 ms [User: 812.9 ms, System: 2214.6 ms]
Range (min … max): 751.4 ms … 768.7 ms 10 runs
Benchmark 3: ./target/release-with-debug/jj -R ../gecko-dev st (hash = before)
Time (mean ± σ): 11.403 s ± 0.648 s [User: 4.546 s, System: 5.932 s]
Range (min … max): 10.553 s … 12.718 s 10 runs
Benchmark 4: ./target/release-with-debug/jj -R ../gecko-dev st (hash = after)
Time (mean ± σ): 5.974 s ± 0.028 s [User: 5.387 s, System: 11.959 s]
Range (min … max): 5.937 s … 6.024 s 10 runs
$ hyperfine --parameter-list repo nixpkgs,gecko-dev --warmup 3 'git -C ../{repo} status'
Benchmark 1: git -C ../nixpkgs status
Time (mean ± σ): 865.4 ms ± 8.4 ms [User: 119.4 ms, System: 1401.2 ms]
Range (min … max): 852.8 ms … 879.1 ms 10 runs
Benchmark 2: git -C ../gecko-dev status
Time (mean ± σ): 2.892 s ± 0.029 s [User: 0.458 s, System: 14.244 s]
Range (min … max): 2.837 s … 2.934 s 10 runs
```
Conclusions:
- ~2x improvement from previous `jj status` time.
- Slightly faster than Git on nixpkgs.
- Still 2x slower than Git on gecko-dev, not sure why.
For reference, Git's default number of threads is defined in the `online_cpus` function: ee48e70a82/thread-utils.c (L21-L66). We are using whatever the Rayon default is.
In preparation of traversing the filesystem in parallel, send updates via `channel`.
An alternative is to modify shared mutable state, e.g. put `self.file_states` behind a mutex or use a concurrent hash-map. This risks leaving the `TreeState` in an invalid state if an error occurs, and makes invariants harder to reason about.
Using a channel introduces a small performance regression. (I didn't try out the concurrent hash-map approach.)
```sh
$ hyperfine --parameter-list hash before,after --setup 'git checkout {hash} && cargo build --profile release-with-debug' --warmup 3 './target/release-with-debug/jj -R ../nixpkgs st'
Benchmark 1: ./target/release-with-debug/jj -R ../nixpkgs st (hash = before)
Time (mean ± σ): 1.533 s ± 0.013 s [User: 0.587 s, System: 0.926 s]
Range (min … max): 1.510 s … 1.559 s 10 runs
Benchmark 2: ./target/release-with-debug/jj -R ../nixpkgs st (hash = after)
Time (mean ± σ): 1.563 s ± 0.021 s [User: 0.607 s, System: 0.936 s]
Range (min … max): 1.518 s … 1.595 s 10 runs
Summary
./target/release-with-debug/jj -R ../nixpkgs st (hash = before) ran
1.02 ± 0.02 times faster than ./target/release-with-debug/jj -R ../nixpkgs st (hash = after)
```
`.gitignores` in ignored directories should be ignored. Before this
commit, we would visit ignored directories like any others if there
were any ignored paths in them.
I've done a lot of preparation for this commit, but There's still a
bit of duplication between the new code and the existing code. I don't
mind improving it if anyone has suggestions. Otherwise I might end up
doing that when I get back to working on snapshotting tree-level
conflicts soon.
This fixes#1785.
The `sub_path` is created by joining `dir` to a basename. I think
calling it just `path` is clear, especially since its the main path
involved in each iteration of the loop.
The `FileStateUpdate` enum now looks very similar to `Option`, so
let's just use that. I also renamed `get_updated_file_state()` to
`get_updated_tree_value()` since it returns a `TreeValue`.
We currently remove the file state for deleted files after walking the
working copy and noticing that the file is not there. However, in the
case of files that have been replaced by special files like Unix
sockets, we delete the file state inside the loop. Let's simplify a
tiny bit by not doing that.
If we don't have a recorded state for a file, we assume that it's new,
so we add it to the tree as the type it appears on disk. That means we
won't check if it exists as a conflict in the current tree. As another
step towards making the file state just a cache, let's instead treat
this case as a dirty file, so we look up the current value from the
tree. That means that adding files will be a tiny bit slower, but I
doubt it will be noticeable (we need to read the file from disk and
write it to the backend anyway).
I want to replace the `DirEntry` argument to
`get_updated_file_state()` by a `PathBuf` and a `Metadata`. To avoid
always reading the metadata, we need to check for ignored files
outside of `get_updated_file_state()`. I also think that gives the
call site a nice symmetry in how we use the `git_ignore` for
directories (`.matches_all_files_in()`)) and files
(`.matches_file()`).
This removes another little bit (literally) of dependency on the
cached file state by reading the old executable bit from the current
tree instead. That helps make it possible to discard the file states
without affecting the resulting snapshot, as we may want to do with
Watchman.
With this change, `write_path_to_store()` contains all the logic for
reading a file from disk and writing it to a `TreeBuilder`, making the
code for added and modified files more similar.
It's faster to add only files matched by the Watchman matcher to the
set of deleted files than to add all files and then removed files not
matched. This speeds up `jj diff` with Watchman in the Linux repo from
~530 ms to ~460 ms.
The intention in this and some future commits is to have `update_file_state` accept `&self` instead of `&mut self` to make clear what data is updated by `update_file_state` and to ensure transactional safety of the `TreeState` contents.
There's a comment saying that mutating the file's current state
simplifies later logic, but I don't think that's true. It might have
been true in the past, when we had `FileType::Conflict`.
When snapshotting a conflict, we read the contents and parse conflict
markers. If we determine that it's no longer a conflict, then we write
a normal file entry to the tree instead. When we do that, we re-read
the file from the working copy. Let's instead write the file contents
we already read.
When a file's mtime has changed on disk, we update our record of that
mtime, but we did so in a separate place for conflicts compared to
non-conflicts. Let's reuse it.
The working copy's current tree tracks whether a file is a
conflict. We also track that in the `TreeState` object. That allows us
to not read the trees object to decide if we should try to parse a
file as a conflict.
One disadvantage is that it's redundant information that needs to be
kept in sync with the tree object. Also, for Watchman, we would like
to completely ignore the persisted `FileState`.
This commit removes the `FileType::Conflict` variant and instead
checks in the tree object whether a given path was a conflict. This is
the change I mentioned in dc8a20773760. We still skip the check
completely if the file's mtime etc. is unchanged, so it shouldn't have
much effect in the common case of a mostly unchanged working copy. I
measured a slowdown on `jj diff` by ~3% in the Linux repo with a clean
working copy with all mtimes bumped. I think the simpler code and
reduced risk of subtle bugs is worth the performance hit.