Summary:
The flush method will write buffered data to disk.
A mistake in Root entry serialization is fixed - it needs to translate dirty
offsets to non-dirty ones.
Reviewed By: DurhamG
Differential Revision: D7223729
fbshipit-source-id: baeaab27627d6cfb7c5798d3a39be4d2b8811e5f
Summary:
Add the main `Index` structure and its constructor.
The structure focus on the index logic itself. It does not have the checksum
part yet.
Some notes about choices made:
- The use of mmap: mmap is good for random I/O, and has the benefit of
sharing buffers between processes reading the same file. We may be able to
do good user-space caching for the random I/O part. But it's harder to
share the buffers between processes.
- The "read_only" auto decision. Common "open" pattern requires the caller
to pass whether they want to read or write. The index makes the decision
for the caller for convenience (ex. running "hg log" on somebody else's
repo).
- The "load root entry from the end of the file" feature. It's just for
convenience for users wanting to use the Index in a standalone way. We
probably
Reviewed By: DurhamG
Differential Revision: D7208358
fbshipit-source-id: 14b74d7e32ef28bd5bc3483fd560c489d36bf8e5
Summary:
A simple utility that does paths <-> local bytes conversion. It's needed
since Mercurial stores paths using local encoding in manifests.
For POSIX, the code is zero-cost - no real conversion or error can happen.
This is in theory cheaper than what treedirstate does.
For Windows, the "local_encoding" crate is selected as Yuya suggested the
`MultiByteToWideChar` Win32 API [1] and "local_encoding" uses it. It does
the right thing given my experiment with GBK (Chinese, simplified) encoding.
```
....
C:\Users\quark\enc>hg debugshell --config extensions.debugshell=
>>> repo[0].manifest().text()
'\xc4\xbf\xc2\xbc1/\xce\xc4\xbc\xfe1\x00b80de5d138758541c5f05265ad144ab9fa86d1db\n'
>>> repo[0].files()
['\xc4\xbf\xc2\xbc1/\xce\xc4\xbc\xfe1']
extern crate local_encoding;
use std::path::PathBuf;
use local_encoding::{Encoder, Encoding};
const mpath: &[u8] = b"\xc4\xbf\xc2\xbc1/\xce\xc4\xbc\xfe1";
fn main() {
let p = PathBuf::from(Encoding::OEM.to_string(mpath).unwrap());
println!("exists: {}", p.exists());
println!("mpath len: {}, osstr len: {}", mpath.len(), p.as_path().as_os_str().len());
}
exists: true
mpath len: 11, osstr len: 15
```
In the future, we might normalize the paths to UTF-8 before storing them in
manifest to avoid issues.
Differential Revision: D7319604
fbshipit-source-id: a7ed5284be116c4176598b4c742e8228abcc3b02
Summary:
xdiff has a `xdl_trim_ends` step that removes common lines, unmatchable
lines. That is in theory good, but happens too late - after splitting,
hashing, and adjusting the hash values so they are unique. Those splitting,
hashing and adjusting hash values steps could have noticeable overhead.
For not uncommon cases like diffing two large files with minor differences,
the raw performance of those preparation steps seriously matter. Even
allocating an O(N) array and storing line offsets to it is expensive.
Therefore my previous attempts [1] [2] cannot be good enough since they do
not remove the O(N) array assignment.
This patch adds a preprocessing step - `xdl_trim_files` that runs before
other preprocessing steps. It counts common prefix and suffix and lines in
them (needed for displaying line number), without doing anything else.
Testing with a crafted large (169MB) file, with minor change:
```
open('a','w').write(''.join('%s\n' % (i % 100000) for i in xrange(30000000) if i != 6000000))
open('b','w').write(''.join('%s\n' % (i % 100000) for i in xrange(30000000) if i != 6003000))
```
Running xdiff by a simple binary [3], this patch improves the xdiff perf by
more than 10x for the above case:
```
# xdiff before this patch
2.41s user 1.13s system 98% cpu 3.592 total
# xdiff after this patch
0.14s user 0.16s system 98% cpu 0.309 total
# gnu diffutils
0.12s user 0.15s system 98% cpu 0.272 total
# (best of 20 runs)
```
It's still slightly slower than GNU diffutils. But it's pretty close now.
Testing with real repo data:
For the whole repo, this patch makes xdiff 25% faster:
```
# hg perfbdiff --count 100 --alldata -c d334afc585e2 --blocks [--xdiff]
# xdiff, after
! wall 0.058861 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
# xdiff, before
! wall 0.077816 comb 0.080000 user 0.080000 sys 0.000000 (best of 91)
# bdiff
! wall 0.117473 comb 0.120000 user 0.120000 sys 0.000000 (best of 67)
```
For files that are long (ex. commands.py), the speedup is more than 3x, very
significant:
```
# hg perfbdiff --count 3000 --blocks commands.py.i 1 [--xdiff]
# xdiff, after
! wall 0.690583 comb 0.690000 user 0.690000 sys 0.000000 (best of 12)
# xdiff, before
! wall 2.240361 comb 2.210000 user 2.210000 sys 0.000000 (best of 4)
# bdiff
! wall 2.469852 comb 2.440000 user 2.440000 sys 0.000000 (best of 4)
```
The improvement is also seen for the `json` test case mentioned in D7124455.
xdiff's time improves from 0.3s to 0.04s, similar to GNU diffutils.
This patch is also sent as https://phab.mercurial-scm.org/D2686.
[1]: https://phab.mercurial-scm.org/D2631
[2]: https://phab.mercurial-scm.org/D2634
[3]:
```
// Code to run xdiff from command line. No proper error handling.
mmfile_t readfile(const char *path) {
struct stat st; int fd = open(path, O_RDONLY);
fstat(fd, &st); mmfile_t f = { malloc(st.st_size), st.st_size };
ensure(read(fd, f.ptr, st.st_size) == st.st_size); close(fd); return f; }
static int xdiff_outf(void *priv_, mmbuffer_t *mb, int nbuf) { int i;
for (i = 0; i < nbuf; i++) { write(STDOUT_FILENO, mb[i].ptr, mb[i].size); }
return 0; }
int main(int argc, char const *argv[]) {
mmfile_t a = readfile(argv[1]), b = readfile(argv[2]);
xpparam_t xpp = { XDF_INDENT_HEURISTIC, 0 };
xdemitconf_t xecfg = { 3, 0 }; xdemitcb_t ecb = { 0, &xdiff_outf };
xdl_diff(&a, &b, &xpp, &xecfg, &ecb); return 0; }
```
Reviewed By: ryanmce
Differential Revision: D7151582
fbshipit-source-id: 3f2dd43b74da118bd827af4fc5e1bf65be191ad2
Summary:
`mmap_readonly` will be reused in `index.rs` so let's moved it to a shared
utils module.
Reviewed By: DurhamG
Differential Revision: D7208359
fbshipit-source-id: d98779e4e21765ce0e185281c9560245b59b174c
Summary:
Add ScopedFileLock. This is similar to Python's contextmanager.
It's easier to use than the fs2 raw API, since it guarantees the file is
unlocked.
Reviewed By: jsgf
Differential Revision: D7203684
fbshipit-source-id: 5d7beed99ff992466ab7bf1fbea0353de4dfe4f9
Summary: They are simpler than radix entry and similar.
Reviewed By: DurhamG
Differential Revision: D7191652
fbshipit-source-id: b516663567267a2e354748396b44c2ac8ebb691f
Summary: These are Rust structures that map to the file format.
Reviewed By: DurhamG
Differential Revision: D7191366
fbshipit-source-id: 23a4431383be9713e955b74306cd68108eb80536
Summary: Document the format. Actual implementation in later diffs.
Reviewed By: DurhamG
Differential Revision: D7190575
fbshipit-source-id: 243992fd052ca7a9688d54d20694e65daebb9660
Summary:
The append-only index is too different so it's cleaner to cherry-pick code
from radixbuf, instead of modifying radixbuf which would break code
depending on it.
Started by picking the base16 iterator part.
`rustc-test` does not work with buck, and seems to be in an unmaintained
state, so benchmark tests are migrated to criterion.
Reviewed By: DurhamG
Differential Revision: D7189143
fbshipit-source-id: 459a79b4cf16f35d2ff86f11a5980ba1fc627951
Summary:
Filesystem is hard. Append-only sounds like a safe way to write files, but it
only really helps with process crashes. If the OS crashes, it's possible that
other parts of the file gets corrupted. As source control, data integrity check
is important. So bytes not logically touched by appending also needs to be
checked.
Implement a `ChecksumTable` which adds integrity check ability to append-only
files. It's intended to be used by future append-only indexes.
Reviewed By: DurhamG
Differential Revision: D7108433
fbshipit-source-id: 16daf6b8d04bba464f1ee9221716beba69c1d47b
Summary:
First step of a storage-related building block that is in Rust. The goal is
to use it to replace revlog, obsstore and packfiles.
Extern crates that are likely useful are added to reduce future churns.
Reviewed By: DurhamG
Differential Revision: D7108434
fbshipit-source-id: 97ebd9ba69547d876dcecc05e604acdf9088877e
Summary:
1. Variable Length Arrays are not supported by MSVC, but since this is a C++ code, we can just use heap allocation
2. Replacing `inet` with portability version
Depends on D7196403
Reviewed By: quark-zju
Differential Revision: D7196605
fbshipit-source-id: a0d88b6e06f255ef648c0b35a99b42ba3bee538a
Summary:
Add a "boring" threshold to limit the search range of the indention heuristic,
so the performance of the diff algorithm is mostly unaffected by turning on
indention heuristic.
Reviewed By: ryanmce
Differential Revision: D7145002
fbshipit-source-id: 024ec685f96aa617fb7da141f38fa4e12c4c0fc9
Summary:
xdiff generated hunks for the differences (ex. questionmarks in the
`@@ -?,? +?,? @@` part from `diff --git` output). However, bdiff generates
matched hunks instead.
This patch adds a `XDL_EMIT_BDIFFHUNK` flag used by the output function
`xdl_call_hunk_func`. Once set, xdiff will generate bdiff-like hunks
instead. That makes it easier to use xdiff as a drop-in replacement of bdiff.
Note that since `bdiff('', '')` returns `[(0, 0, 0, 0)]`, the shortcut path
`if (xscr)` is removed. I have checked functions called with `xscr` argument
(`xdl_mark_ignorable`, `xdl_call_hunk_func`, `xdl_emit_diff`,
`xdl_free_script`) work just fine with `xscr = NULL`.
Reviewed By: ryanmce
Differential Revision: D7135207
fbshipit-source-id: cfb8c363e586841c06c94af283c7f014ba65fcc0
Summary:
Patience diff is the normal diff algorithm, plus some greediness that
unconditionally matches common common unique lines. That means it is easy to
construct cases to let it generate suboptimal result, like:
```
open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n')))
open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n')))
```
Patience diff has been advertised as being able to generate better results for
some C code changes. However, the more scientific way to do that is the
indention heuristic [1].
Since patience diff could generate suboptimal result more easily and its
"better" diff feature could be replaced by the new indention heuristic, let's
just remove it and its variant histogram diff to simplify the code.
[1]: 433860f3d0
Reviewed By: ryanmce
Differential Revision: D7124711
fbshipit-source-id: 127e8de6c75d0262687a1b60814813e660aae3da
Summary:
Vendor git's xdiff library from git commit
d7c6c2369d7c6c2369ac21141b7c6cceaebc6414ec3da14ad using GPL2+ license.
There is another recent user report that hg diff generates suboptimal
result. It seems the fix to issue4074 isn't good enough. I crafted some
other interesting cases, and hg diff barely has any advantage compared with
gnu diffutils or git diff.
| testcase | gnu diffutils | hg diff | git diff |
| | lines time | lines time | lines time |
| patience | 6 0.00 | 602 0.08 | 6 0.00 |
| random | 91772 0.90 | 109462 0.70 | 91772 0.24 |
| json | 2 0.03 | 1264814 1.81 | 2 0.29 |
"lines" means the size of the output, i.e. the count of "+/-" lines. "time"
means seconds needed to do the calculation. Both are the smaller the better.
"hg diff" counts Python startup overhead.
Git and GNU diffutils generate optimal results. For the "json" case, git can
have an optimization that does a scan for common prefix and suffix first,
and match them if the length is greater than half of the text. See
https://neil.fraser.name/news/2006/03/12/. That would make git the fastest
for all above cases.
About testcases:
patience:
Aiming for the weakness of the greedy "patience diff" algorithm. Using
git's patience diff option would also get suboptimal result. Generated using
the Python script:
```
open('a', 'w').write('\n'.join(list('a' + 'x' * 300 + 'u' + 'x' * 700 + 'a\n')))
open('b', 'w').write('\n'.join(list('b' + 'x' * 700 + 'u' + 'x' * 300 + 'b\n')))
```
random:
Generated using the script in `test-issue4074.t`. It practically makes the
algorithm suffer. Impressively, git wins in both performance and diff
quality.
json:
The recent user reported case. It's a single line movement near the end of a
very large (800K lines) JSON file.
Reviewed By: ryanmce
Differential Revision: D7124455
fbshipit-source-id: 832651115da770f9d2ed5fdff2e200453c0013f8
Summary:
This allows us to decode VLQ integers at a given offset, for anything that
implements `AsRef<[u8]>`. Instead of having to couple with a `&mut Read`
interface. The main benefit is to get rid of `mut`. The old `VLQDecode`
interface has to use `&mut Read` since reading has a side effect of changing
the internal position counter.
Reviewed By: markbt
Differential Revision: D7093998
fbshipit-source-id: 20cb14e38c828462c34f32245d0f0f512028b647
Summary:
I'm going to add more ways to do VLQ parsing (ex. reading from a `&[u8]`
instead of a `Read` which has to be mutable). So let's add a benchmark to
compare the `&[u8]` version with the `Read` version.
Reviewed By: DurhamG
Differential Revision: D7092960
fbshipit-source-id: e1189de10396516c732dc73b45b7690a1718f1c0
Summary:
`test::Bencher` is an unstable feature, which is enabled by 3rd-party crate
`rustc-test`. However, `rustc-test` does not work with buck build. So let's
workaround that by allowing all usage of `test::Bencher` to be disabled by a
feature. And turn on that feature in buck build. Cargo build will remain
unchanged.
Reviewed By: singhsrb
Differential Revision: D7011703
fbshipit-source-id: e08ba9516bf7fadb6edb52ab107e0172df0aaf5b
Summary:
On the other two platforms we return the result of `madvise`, so let's return -1,
as this is the error return value of `madvise` on POSIX.
Reviewed By: quark-zju
Differential Revision: D6979093
fbshipit-source-id: 7c715eb459aaad6c21fae6e346e8650211649182
Summary: The current location of these defines is really odd and does not work with the current version of `PACKEDSTRUCT` macro expansion (it expands everything in the same line, therefore `#defines` are inline, which fails to compile.
Reviewed By: quark-zju
Differential Revision: D6970926
fbshipit-source-id: ed01042760fa729004e159b492cf67a4afd25923
Summary:
Let's create a new portability header, which can be used on both Windows and
Posix.
Reviewed By: quark-zju
Differential Revision: D6970928
fbshipit-source-id: a3970c50260f52bfc0a9420a4ff11d93ace304b0
Summary: This is needed to make our C code compile on Windows.
Reviewed By: quark-zju
Differential Revision: D6970929
fbshipit-source-id: 2cfe46e0718fe75916912d0e59c5400038e03a12
Summary:
Adds some basic building blocks to build hg using buck.
Header files are cleaned up, so they are relative to the project root.
Some minor changes to C code are made to remove clang build
warnings.
Rust dependencies, fb-hgext C/Python dependencies (ex. cstore,
mysql-connector), and 3rd-party dependencies like python-lz4
are not built yet. But the built hg binary should be able to run
most tests just fine.
Reviewed By: wez
Differential Revision: D6814686
fbshipit-source-id: 59eefd5a3ad86db2ad1c821ed824c9f1878c93e4
Summary: Based on feedback to D6687860.
Test Plan: n/a
Reviewers: durham, #mercurial
Reviewed By: durham
Differential Revision: https://phabricator.intern.facebook.com/D6714211
Signature: 6714211:1515788399:386b8f7330f343349234d1f317e5ac0a594142cf
Summary:
Moves ctreemanifest into hgext/extlib/. D6679698 was committed to scratch branch
by mistake.
Test Plan: make local && cd tests && ./run-tests.py
Reviewers: durham, #mercurial, #sourcecontrol
Reviewed By: durham
Differential Revision: https://phabricator.intern.facebook.com/D6684623
Signature: 6684623:1515522634:9bec363d00990d9ff7d5f655e30ab8cae636155c
Summary:
This moves the cdatapack code to the new lib/ directory and adds it to the main
setup.py.
Test Plan: hg purge --all && make local && cd tests && ./run-tests.py -S -j 48
Reviewers: #mercurial
Differential Revision: https://phabricator.intern.facebook.com/D6677491
Summary:
I didn't notice the test failure because clang-format was not installed.
Might be a good idea to make it a hard error.
Test Plan: Run test-check-clang-format.t
Reviewers: phillco, #mercurial
Reviewed By: phillco
Subscribers: mathieubaudet
Differential Revision: https://phabricator.intern.facebook.com/D6679576
Signature: 6679576:1515457526:6b1935858da284b896244b0d99e2fef03ead97b8
Summary:
The `lib/linelog` directory contains pure C code that is unrelated from
either Mercurial or Python. The `mercurial/cyext` contains Cython extension
code (although for linelog's case, the Cython extension is unrelated from
Mercurial).
Cython is now a hard dependence to simplify the code.
Test Plan: `make local` and check `from mercurial.cyext import linelog` works.
Reviewers: durham, #mercurial
Reviewed By: durham
Subscribers: durham, fried
Differential Revision: https://phabricator.intern.facebook.com/D6678541
Signature: 6678541:1515455512:967266dc69c702dbff95fdea05671e11c32ebf28
Summary:
Move the rust libraries and extensions to their new locations, and integrate
them with the hg-crew setup.py.
Test Plan: Run `python setup.py build` and verify rust extensions are built.
Reviewers: durham, #mercurial
Reviewed By: durham
Subscribers: fried, jsgf, mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D6677251
Tasks: T24908724
Signature: 6677251:1515450235:920faf40babbce9b09e3283ff9ca328d1c5c51e6
Summary:
cdatapack depends on clib, so let's move it to lib/ outside of fb-hgext.
None of the consumers of these files were changed. They will be changed as they
are moved into the main part of the repo.
Test Plan: hg purge --all && make local && cd tests && ./run-tests.py -S -j 48
Reviewers: mitrandir, #mercurial
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D6677197
Signature: 6677197:1515447873:399fb3e7beb5cc1ad8db18f42b359ffbfbeb21f2
Summary:
cdatapack depends on sha1detectcoll, so let's add the library to setup.py before
we add cdatapack.
Test Plan:
hg purge --all && make local && cd tests/ && ./run-tests.py -S -j 48
Verified sha1dc was in the build output and the tests passed.
Reviewers: quark, #mercurial
Reviewed By: quark
Differential Revision: https://phabricator.intern.facebook.com/D6676405
Signature: 6676405:1515444508:2da65c6c3a18267a1d3c151c8e9acf60b674ffc2