Commit Graph

436 Commits

Author SHA1 Message Date
Arun Kulshreshtha
eb86dabbc1 mononokeapi: migrate to Rust 2018
Summary: Migrate this crate to Rust 2018 edition.

Reviewed By: phillco

Differential Revision: D13742720

fbshipit-source-id: 0a2f6a713cff43cf2814cf41df4ac910b9901e5c
2019-01-18 19:29:41 -08:00
Arun Kulshreshtha
c067536fae mononokeapi: use url::Url instead of http::Uri
Summary: The canonical URL type in Rust, `http::Uri`, does not support manipulating URLs easily. (e.g., concatenating path components, etc.) As such, switch to using the `Url` type from the `url` crate, which does support URL manipulation, and convert to `http::Uri` before passing the resulting URL to Hyper.

Reviewed By: phillco

Differential Revision: D13738139

fbshipit-source-id: c7de67f1596ebc1bdde89d3fe87086f49c32b5db
2019-01-18 15:47:17 -08:00
Xavier Deguillard
33688947c6 revisionstore: sort pack files in list_packs
Summary:
Directory listing is different in every OS, and due to the current repack
implementation, this directly affect the order in which the packfiles are added
to the new one. Since the resulting packfile name depends on the hash of its
content, the name was influenced by the directory order.

By sorting the files in list_packs, the packfile name will be independent of
the directory listing and thus be the same for all the OSes.

Reviewed By: singhsrb

Differential Revision: D13700935

fbshipit-source-id: 01e055a0c1bcf7fb2dc4faf614dfb20cd4499017
2019-01-16 15:18:24 -08:00
Xavier Deguillard
87cf0f533b revisionstore: Add a basic rust incremental repack.
Summary: For now, combine all files smaller than 100MB that accumulate to less than 4GB.

Reviewed By: DurhamG

Differential Revision: D13603760

fbshipit-source-id: 3fa74f1ced3d3ccd463af8f187ef5e0254e1820b
2019-01-16 09:47:09 -08:00
Xavier Deguillard
2525a6e9ee revisionstore: Use PackWriter to write to {data,history}packs.
Summary: Use the newly introduced PackWriter to write the {data,history}packs.

Reviewed By: markbt

Differential Revision: D13603759

fbshipit-source-id: 528a6af7c4ac3321aeec0559805de12114224cfd
2019-01-16 09:47:09 -08:00
Xavier Deguillard
e6a60b68f3 revisionstore: Add an efficient pack writer.
Summary:
The packfiles are currently being written via an unbuffered file. This is
inefficient as every write to the file results results in a write(2) syscall.
By buffering these writes we can reduce the number of syscalls and thus
increase the throughput of pack writing operations.

Reviewed By: markbt

Differential Revision: D13603758

fbshipit-source-id: 649186a852d427a1473695b1d32cc9cd87a74a75
2019-01-16 09:47:09 -08:00
Mark Thomas
c6c99b4777 configparser: update pest to 2.1.0
Summary:
Update pest to 2.1.0.

This version has a new behaviour for parser error messages: the line feed at
the end of the line is shown in the error output.

Reviewed By: wez

Differential Revision: D13671099

fbshipit-source-id: b8d1142a44a56a0b21b3b72cf027f3f8a30f421e
2019-01-16 03:52:09 -08:00
Arun Kulshreshtha
28e20c5997 Reexport public types from public submodules
Summary:
The revisionstore crate currently consists of several public submodules,
each exposing several public types. The APIs exposed by each of the modules
require using types from the other modules. As such, users of this crate are
forced to have complex nested imports to use any of its functionality.

This diff helps ease this problem by reexporting the public types exposed from
each of the public submodules at the top level, thereby allowing crate users to
`use` all of the required types without needing nested imports.

Reviewed By: singhsrb

Differential Revision: D13686913

fbshipit-source-id: 9fb3cce8783787aa5f3f974c7168afada5952712
2019-01-15 21:20:03 -08:00
Xavier Deguillard
e6135fa88e revisionstore: Use get_missing instead of get_delta in repack.
Summary:
The later tries to read from the disk, while the former is purely in memory and
thus more efficient.

Reviewed By: DurhamG, markbt

Differential Revision: D13603757

fbshipit-source-id: 5fd120ba4065d6a65cb2982db9ab81db3ea26524
2019-01-15 17:02:38 -08:00
Mark Thomas
3b9eb801e1 types: use Fallible
Summary:
Use the `Fallible` type alias provided by `failure` rather than defining our
own.

Differential Revision: D13657313

fbshipit-source-id: ae249bc15037cc2be019ce7ce8a440c153aa31cc
2019-01-15 03:50:47 -08:00
Mark Thomas
3570402d79 watchman_client: use Fallible
Summary:
Use the `Fallible` type alias provided by `failure` rather than defining our
own.

Differential Revision: D13657312

fbshipit-source-id: 55134ee93f1f3aaaeefe5644a4a1f2285603bc1c
2019-01-15 03:50:47 -08:00
Mark Thomas
7f1258f091 commitcloudsubscriber: use Fallible
Summary:
Use the `Fallible` type alias provided by `failure` rather than defining our
own.

Differential Revision: D13657314

fbshipit-source-id: f1a379089972f7f0066c49ddedf606d36b7ac260
2019-01-15 03:50:47 -08:00
Mark Thomas
d3709fde5b mononokeapi: use Fallible
Summary:
Use the `Fallible` type alias provided by `failure` rather than defining our
own.

Differential Revision: D13657310

fbshipit-source-id: cae73fc239a6ad30bb6ef56a664d1ef5a2a19b5f
2019-01-15 03:50:47 -08:00
Xavier Deguillard
f170cceea2 revisionstore: Repackable::delete now takes the ownership of self.
Summary:
On some platforms, removing a file can fail if it's still mapped or opened. In
mercurial, this can happen during repack as the datapacks are removed while
still being mapped.

Reviewed By: DurhamG

Differential Revision: D13615938

fbshipit-source-id: fdc1ff9370e2767e52ee1828552f4598105f784f
2019-01-14 21:14:13 -08:00
Xavier Deguillard
da3dd2319f revisionstore: remove repacked pack files
Summary:
After repacking the data/history packs, we need to cleanup the
repacked files. This was an omission from D13363853.

Reviewed By: markbt

Differential Revision: D13577592

fbshipit-source-id: 36e7d5b8e86affe47cdd10d33a769969f02b8a62
2019-01-11 16:54:15 -08:00
Xavier Deguillard
ce16778656 remotefilelog: set proper file permissions on closed mutable packs.
Summary:
The python version of the mutable packs set the permission to read-only after
writing them, while the rust version keeps them writeable. Let's make the rust
one more consistent.

Reviewed By: markbt

Differential Revision: D13573572

fbshipit-source-id: 61256994562aa09058a88a7935c16dfd7ddf9d18
2019-01-11 16:54:15 -08:00
Mark Thomas
98417b1ffb configparser: fix warning about unused Result
Summary:
Use of `write!` requires checking for errors, however in this case, there is no
need to use `write!`, as we just want the error as a string.

Reviewed By: ikostia

Differential Revision: D13596497

fbshipit-source-id: 5892025344936936188cf3a8ca227e71eff57d55
2019-01-08 06:19:55 -08:00
Jun Wu
f6158659f8 configparser: use hardcoded system config path on Windows
Summary:
When I was debugging an eden importer issue with Puneet, we saw errors caused
by important extensions (ex. remotefilelog, lz4revlog) not being loaded.  It
turned out that configpaser was checking the "exe dir" to decide where to
load "system configs". For example, If we run:

  C:\open\fbsource\fbcode\scm\hg\build\pythonMSVC2015\python.exe eden_import_helper.py

The "exe dir" is "C:\open\fbsource\fbcode\scm\hg\build", and system config is
not there.

Instead of copying "mercurial.ini" to every possible "exe dir", this diff just
switches to a hard-coded system config path. It's now consistent with what we
do on POSIX systems.

The logic to copy "mercurial.ini" to "C:\open\fbsource\fbcode\scm\hg" or
"C:\tools\hg" become unnecessary and are removed.

Reviewed By: singhsrb

Differential Revision: D13542939

fbshipit-source-id: 5fb50d8e42d36ec6da28af29de89966628fe5549
2018-12-22 01:53:03 -08:00
Saurabh Singh
b193e23dd2 test-check-fix-code: unbreak test by fixing copyrights
Summary:
`test-check-fix-code.t` was failing due to copyright header missing
from certain files. This commit fixes the files by running

```
contrib/fix-code.py FILE
```

as suggested in the failure message.

Reviewed By: DurhamG

Differential Revision: D13538506

fbshipit-source-id: d8063c9a0e665377a9976abeccb68fbef6781950
2018-12-21 10:03:26 -08:00
Jun Wu
22e9000fc9 lz4-pyframe: add compresshc
Summary:
Unfortunately required symbols are not exposed by lz4-sys. So we just declare
them ourselves.

Make sure it compresses better:

  In [1]: c=open('/bin/bash').read();
  In [2]: from mercurial.rust import lz4
  In [3]: len(lz4.compress(c))
  Out[3]: 762906
  In [4]: len(lz4.compresshc(c))
  Out[4]: 626970

While it's much slower for larger data (and compresshc is slower than pylz4):

  Benchmarking (easy to compress data, 20MB)...
            pylz4.compress: 10328.03 MB/s
       rustlz4.compress_py:  9373.84 MB/s
          pylz4.compressHC:  1666.80 MB/s
     rustlz4.compresshc_py:  8298.57 MB/s
          pylz4.decompress:  3953.03 MB/s
     rustlz4.decompress_py:  3935.57 MB/s
  Benchmarking (hard to compress data, 0.2MB)...
            pylz4.compress:  4357.88 MB/s
       rustlz4.compress_py:  4193.34 MB/s
          pylz4.compressHC:  3740.40 MB/s
     rustlz4.compresshc_py:  2730.71 MB/s
          pylz4.decompress:  5600.94 MB/s
     rustlz4.decompress_py:  5362.96 MB/s
  Benchmarking (hard to compress data, 20MB)...
            pylz4.compress:  5156.72 MB/s
       rustlz4.compress_py:  5447.00 MB/s
          pylz4.compressHC:    33.70 MB/s
     rustlz4.compresshc_py:    22.25 MB/s
          pylz4.decompress:  2375.42 MB/s
     rustlz4.decompress_py:  5755.46 MB/s

Note python-lz4 was using an ancient version of lz4. So there could be differences.

Reviewed By: DurhamG

Differential Revision: D13528200

fbshipit-source-id: 6be1c1dd71f57d40dcffcc8d212d40a853583254
2018-12-20 17:54:22 -08:00
Jun Wu
4f24bffdde cpython-ext: move pybuf to cpython-ext
Summary:
The `pybuf` provides a way to read `bytes`, `bytearray`, some `buffer` types in
a zero-copy way. The main benefit is to use same code to support different
input types. It's copied to a couple of places. Let's move it to `cpython-ext`.

Reviewed By: DurhamG

Differential Revision: D13516206

fbshipit-source-id: f58881c4bfe651a6fdb84cf317a74c3c8d7a4961
2018-12-20 17:54:22 -08:00
Jun Wu
f23c6bc7e3 cpython-ext: add a way to pre-allocate PyBytes
Summary: Make it possible to write content directly into a PyBytes buffer.

Reviewed By: DurhamG

Differential Revision: D13528202

fbshipit-source-id: 8c0a4ed030439a8dc40cdfbd72b1f6734a8b2036
2018-12-20 17:54:22 -08:00
Jun Wu
6e88ac4794 lz4-pyframe: provide decompress_into API
Summary:
This allows decompressing into a pre-allocated buffer. After some experiments,
it seems `bytearray` will just break too many things, ex:

- bytearray is not hashable
- bytearray[index] returns an int
- a = bytearray('x'); b = a; b += '3' # will mutate 'a'
- ''.join([bytearray('')]) will raise TypeError

Therefore we have to use zero-copy `bytes` instead, which is less elegent. But
this API change is a step forward.

Reviewed By: DurhamG

Differential Revision: D13528201

fbshipit-source-id: 1cfaf5d55efdc0d6c0df85df9960fe9682028b08
2018-12-20 17:54:22 -08:00
Jun Wu
7831e2a4ce cpython-ext: add ways to zero-copy Vec<u8> into a Python object
Summary:
I need to convert `Vec<u8>` to a Python object in a zero-copy way for rustlz4
performacne.

Assuming Python and Rust use the same memory allocator, it's possible to transfer
the control of a malloc-ed pointer from Rust to Python. Use this to implement
zero-copy. PyByteArrayObject is chosen because its struct contains such a pointer.
PyBytes cannot be used as it embeds the bytes, without using a pointer.

Sadly there are no CPython APIs to do this job. So we have to write to the raw
structures. That means the code will crash if python is replaced by
python-debug (due to Python object header change). However, that seems less an
issue given the performance wins. If python-debug does become a problem, we can
try vendoring libpython directly.

I didn't implement a feature-rich `PyByteArray` Rust object. It's not easy to
do so outside the cpython crate. Most helper macros to declare types cannot be
reused, because they refer to `::python`, which is not available in the current
crate.

Reviewed By: DurhamG

Differential Revision: D13516209

fbshipit-source-id: 9aa089b309beb71d4d21f6c63fcb97dbc798b5f8
2018-12-20 17:54:22 -08:00
Jun Wu
35c85018cd lz4-pyframe: add a benchmark
Summary:
This gives some sense about how fast it is.

Background: I was trying to get rid of python-lz4, by exposing this to Python.
However, I noticed it's 10x slower than python-lz4. Therefore I added some
benchmark here to test if it's the wrapper or the Rust lz4 code.

It does not seem to be this crate:

```
  # Pure Rust
  compress (100M)                77.170 ms
  decompress (~100M)             67.043 ms

  # python-lz4
  In [1]: import lz4, os
  In [2]: b=os.urandom(100000000);
  In [3]: %timeit lz4.compress(b)
  10 loops, best of 3: 87.4 ms per loop
```

Reviewed By: DurhamG

Differential Revision: D13516205

fbshipit-source-id: f55f94bbecc3b49667ed12174f7000b1aa29e7c4
2018-12-20 17:54:21 -08:00
Jun Wu
b3893b3d3c indexedlog: add methods on Log to do prefix lookups
Summary:
This exposes the underlying lookup functions from `Index`.

Alternatively we can allow access to `Index` and provide an `iter_started_from`
method on `Log` which takes a raw offset. I have been trying to avoid exposing
raw offsets in public interfaces, as they would change after `flush()` and cause
problems.

Reviewed By: markbt

Differential Revision: D13498303

fbshipit-source-id: 8b00a2a36a9383e3edb6fd7495a005bc985fd461
2018-12-20 15:50:55 -08:00
Jun Wu
3237b77e4c indexedlog: add APIs to lookup by prefix
Summary:
This is the missing API before `indexedlog::Index` can fit in the
`changelog.partialmatch` case. It's actually more flexible as it can provide
some example commit hashes while the existing revlog.c or radixbuf
implementation just error out saying "ambiguous prefix".

It can be also "abused" for the semantics of sorted "sub-keys". By replace
"key" with "key + subkey" when inserting to the index. Looking up using "key"
would return a lazy result list (`PrefixIter`) sorted by "subkey". Note:
the radix tree is NOT efficient (both in time and space) when there are common
prefixes. So this use-case needs to be careful.

Reviewed By: markbt

Differential Revision: D13498301

fbshipit-source-id: 637856ebd761734d68b20c15866424b1d4518ad6
2018-12-20 15:50:55 -08:00
Jun Wu
562b7a1704 indexedlog: add a function to convert base16 to base256
Summary: This will be used in prefix lookups.

Reviewed By: markbt

Differential Revision: D13498300

fbshipit-source-id: 3db7a21d6f35a18699d9dc3a0eca71a5410e0e61
2018-12-20 15:50:55 -08:00
Jun Wu
443a8f33b3 indexedlog: move binary indexedlog_dump out
Summary:
It makes testing duplicated - now `cargo test` would try running tests on 2 entry points:
lib.rs and indexedlog_dump.rs.  Move it to a separate crate to solve the issue.

Reviewed By: markbt

Differential Revision: D13498266

fbshipit-source-id: 8abf07c1272dfa825ec7701fd8ea9e0d1310ec5f
2018-12-18 08:17:21 -08:00
Jun Wu
61b1a5f475 indexedlog: fix rustc warnings
Summary: `write!` result needs to be used.

Reviewed By: markbt

Differential Revision: D13471967

fbshipit-source-id: d48752bcac05dd33b112679d7faf990eb8ddd651
2018-12-17 12:10:52 -08:00
Xavier Deguillard
79164e920c revisionstore: replace rand::chacha with rand_chacha
Summary: The former is deprecated and thus compiling revisionstore shows many warnings.

Reviewed By: markbt

Differential Revision: D13379278

fbshipit-source-id: d4b4662a1ad00997de4c46274deaf22f48487328
2018-12-17 12:07:22 -08:00
Mark Thomas
ca135cd33f cpython-failure: Integrate cpython PyResult with the failure crate
Summary:
Adds a new crate `cpython-result`, which provides a `ResultExt` trait, which
extends the failure `Result` type to allow coversion to `PyResult` by
converting the error to an appropriate Python Exception.

Reviewed By: quark-zju

Differential Revision: D12980782

fbshipit-source-id: 44a63d31f9ecf2f77efa3b37c68f9a99eaf6d6fa
2018-12-14 06:43:40 -08:00
Mark Thomas
cf4b52c19c mutationstore: add mutationstore
Summary:
The mutationstore is a new store for recording records of commit mutations for
commits that are not in the local repository.

It uses an indexedlog to store the data.  Each mutation entry corresponds to
the information the mutation that led to the creation of a particular commit,
which is recorded as the successor in the entry.

Entries can come from three possible places:

* `Commit` metadata for a commit not available locally
* `Obsmarkers` for repos that have been migrated from evolution tracking
* `Synthetic` for entries created synthetically, e.g. by a pullcreatemarkers
  implementation.

The other commits referred to in an entry must predate the successor commit.
For entries that originated from commits, this is ensured, as the successor
commit hash includes the other commit hashes.  For other entry types, it is
an error to refer to later commits, and any entry that causes a cycle will
be ignored.

Reviewed By: quark-zju

Differential Revision: D12980773

fbshipit-source-id: 040d3f7369a113e710ed8c9f61fabec6c5ec9258
2018-12-14 06:43:40 -08:00
Mark Thomas
1346ff92c4 types: implement Debug for Node
Summary:
The derived debug for Node prints out each byte as a decimal number.  Instead,
make the Debug output for nodes look like `Node("hexstring")`.

Reviewed By: DurhamG

Differential Revision: D12980775

fbshipit-source-id: 042cbf6eade8403759684969e1f69f7f4e335582
2018-12-14 06:43:40 -08:00
Mark Thomas
88ab626e9a types: Add Nodes::random_distinct to randomly generate sets of nodes
Summary:
Add a utility function for tests to generate a vector of random nodes.  This
will be used in future tests.

Reviewed By: DurhamG

Differential Revision: D12980784

fbshipit-source-id: 73fc8643503e11a46a845671df94c912a5e49d23
2018-12-14 06:43:40 -08:00
Mark Thomas
d0c03f6aaf types: Add WriteNodeExt and ReadNodeExt
Summary:
Add traits that extend `std::io::Read` and `std::io::Write` to implement new
`read_node` and `write_node` methods, allowing simple reading and writing of
binary nodes from and to streams.

Reviewed By: DurhamG

Differential Revision: D12980778

fbshipit-source-id: fc6751cd43a1693a5a5a3ac93aea74aec5fda4fe
2018-12-14 06:43:40 -08:00
Xavier Deguillard
5307fd8867 revisionstore: implement basic repack in rust
Summary:
The future of mercurial is rust, and one of the missing piece is repacking of data/history packs. For now, let's implement a very basic packing strategy that just pulls all the packs into one, with one small optimization that puts all the delta chains close together in the output file.

At first, it's expected that this code will be driven by the existing python code, but more and more will be done in rust as time goes.

Reviewed By: DurhamG

Differential Revision: D13363853

fbshipit-source-id: ad1ac2039e1732f7141d99abf7f01804a9bde097
2018-12-12 12:44:03 -08:00
Jun Wu
421c7b3f45 indexedlog: add a tool to dump indexedlog content
Summary: The tool can dump indexedlog content. Useful for manually investigating issues.

Reviewed By: DurhamG

Differential Revision: D13051387

fbshipit-source-id: 8687a1aa9dfb54776e80f184208c49da2492c34d
2018-12-06 14:57:52 -08:00
Jun Wu
54dc931140 indexedlog: use inlined leaf entries to further reduce index size
Summary:
Add a new entry type - INLINE_LEAF, which embeds the EXT_KEY and LINK entries
to save space.

The index size for referred keys is significantly reduced with little overhead:

  index insertion (owned key)     3.732 ms
  index insertion (referred key)  3.604 ms
  index flush                    11.868 ms
  index lookup (memory)           1.159 ms
  index lookup (disk, no verify)  2.175 ms
  index lookup (disk, verified)   4.303 ms
  index size (5M owned keys)     216626039
  index size (5M referred keys)   96616431
    11.87s user 2.96s system 98% cpu 15.107 total

The breakdown of the "5M referred keys" size is:

  type          count     bytes
  radixes       1729472   33835772
  inline_leafs  5000000   62780651

There are no other kinds of entries stored.

Previously, the index size of referred keys is:

  index size (5M referred keys)  136245815 bytes

So it's 136MB -> 96MB, 40% decrease.

Reviewed By: DurhamG

Differential Revision: D13036801

fbshipit-source-id: 27e68e4b6c332c1dc419abc6aba69271952e4b3d
2018-12-06 14:57:52 -08:00
Jun Wu
a4958163ee indexedlog: optimize size of radix entries (BC)
Summary:
Replace the 20-byte "jump table" with 3-byte "flag + bitmap". This saves space
for indexes less than 4GB. There are some reserved bits in the "flag" so if we
run into space issues when indexes are larger than 4GB, we can try adding
6-byte integer, or VLQ back without breaking backwards-compatibility.

It seems to hurt flush performance a bit, because we have to scan the child
array twice. However, lookup (the most important performance) does not change
much. And the index is more compact.

After:

  index flush                    19.644 ms
  index lookup (disk, no verify)  2.220 ms
  index lookup (disk, verified)   4.067 ms
  index size (5M owned keys)     216626039 bytes
  index size (5M referred keys)  136245815 bytes

Before:

  index flush                    16.764 ms
  index lookup (disk, no verify)  2.205 ms
  index lookup (disk, verified)   4.030 ms
  index size (5M owned keys)     240838647 bytes
  index size (5M referred keys)  160458423 bytes

For the "referred key" case, it's 160->136MB, 17% decrease.

A detailed break down of components of index is:

After:

  type       count     bytes (using owned keys)
  radixes    1729472   33835772
  links      5000000   27886336
  leafs      5000000   44629384
  keys       5000000  110000000

  type       count     bytes (using referred keys)
  radixes    1729472   33835772
  links      5000000   27886336
  leafs      5000000   44629384
  ext_keys   5000000   29894315

Before:

  type       count     bytes (using owned keys)
  radixes    1729472   58048380
  links      5000000   27886336
  leafs      5000000   44903923
  keys       5000000  110000000

  type       count     bytes (using referred keys)
  radixes    1729472   58048380
  links      5000000   27886336
  leafs      5000000   44629384
  ext_keys   5000000   29894315

Leaf nodes are taking too much space. It seems the next big optimization might
be inlining ext_keys into leafs.

Reviewed By: DurhamG, markbt

Differential Revision: D13028196

fbshipit-source-id: 6043b16fd67a497eb52d20a17e153fcba5cb3e81
2018-12-06 14:57:52 -08:00
Jun Wu
d8117b3b04 indexedlog: increase key count for size test
Summary:
Since the size test only runs once, we can use a larger number of keys. This is
closer to some production use-cases.

`cargo bench size` shows:

  index size (5M owned keys)     240838647
  index size (5M referred keys)  160458423

It currently uses 32 bytes per key for 5M referred keys.

Reviewed By: markbt

Differential Revision: D13027880

fbshipit-source-id: 726f5fb2da056e77ab93d82fda9f1afa500d0a8d
2018-12-06 14:57:52 -08:00
Jun Wu
55b6331aa4 indexedlog: add more benchmarks
Summary:
Add benchmarks about index sizes, and a benchmark of insertion using key
references.

An example `cargo bench` result running on my devserver looks like:

  index insertion (owned key)     3.551 ms
  index insertion (referred key)  3.713 ms
  index flush                    20.648 ms
  index lookup (memory)           1.087 ms
  index lookup (disk, no verify)  2.041 ms
  index lookup (disk, verified)   4.347 ms
  index size (owned key)            886010
  index size (referred key)         534298

Reviewed By: markbt

Differential Revision: D13027879

fbshipit-source-id: 70644c504026ffee2122d857d5035f5b7eea4f42
2018-12-06 14:57:52 -08:00
Jun Wu
d7129256d4 indexedlog: switch checksum table to little endian (BC)
Summary:
For checksum values like xxhash, there is no benefit using big endian. Switch
to little endian so it's slightly slightly faster on the major platforms we
care about.

This is a breaking change. However, the format is not used in production yet.
So there is no migration code.

Reviewed By: markbt

Differential Revision: D13015465

fbshipit-source-id: ca83d19b3328370d089b03a33e848e64b728ef2a
2018-12-06 14:57:52 -08:00
Jun Wu
75b4f92c44 indexedlog: support different checksum functions for Log entries (BC)
Summary:
Previously, the format of an Log entry is hard-coded - length, xxhash, and
content. The xxhash always takes 8 bytes.

For small (ex. 40-byte) entries, xxhash32 is actually faster and takes less
disk space.

Introduce the "entry flags" concept so we can store some metadata about what
checksum function to use. The concept could be potentially used to support
other new format changes at per entry level in the future.

As we're here, also support data without checksums. That can be useful for
content with its own checksum, like a blob store with its own SHA1 integrity
check.

Performance-wise, log insertion is slower (but the majority insertaion overhead
would be on the index part), iteration is a little bit faster, perhaps because
the log can use less data.

Before:

  log insertion                  15.874 ms
  log iteration (memory)          6.778 ms
  log iteration (disk)            6.830 ms

After:

  log insertion                  18.114 ms
  log iteration (memory)          6.403 ms
  log iteration (disk)            6.307 ms

Reviewed By: DurhamG, markbt

Differential Revision: D13051386

fbshipit-source-id: 629c251633ecf85058ee7c3ce7a9f576dfac7bdf
2018-12-06 14:57:52 -08:00
Jun Wu
049cd99f05 indexedlog: use non-VLQ encoding for xxhash (BC)
Summary:
Xxhash result won't usually have leading zeros. So VLQ encoding is not an
efficient choice. Use non-VLQ encoding instead.

Performance wise, this is noticably faster than before:

  log insertion                  14.161 ms
  log insertion with index      102.724 ms
  log flush                      11.336 ms
  log iteration (memory)          6.351 ms
  log iteration (disk)            7.922 ms
    10.18s user 3.66s system 97% cpu 14.218 total
  log insertion                  13.377 ms
  log insertion with index       97.422 ms
  log flush                      11.792 ms
  log iteration (memory)          6.890 ms
  log iteration (disk)            7.139 ms
    10.20s user 3.56s system 97% cpu 14.117 total
  log insertion                  14.573 ms
  log insertion with index       94.216 ms
  log flush                      18.993 ms
  log iteration (memory)          7.867 ms
  log iteration (disk)            7.567 ms
    9.85s user 3.73s system 96% cpu 14.073 total
  log insertion                  15.526 ms
  log insertion with index       98.868 ms
  log flush                      19.600 ms
  log iteration (memory)          7.533 ms
  log iteration (disk)            7.150 ms
    10.13s user 4.02s system 96% cpu 14.647 total
  log insertion                  14.629 ms
  log insertion with index      100.449 ms
  log flush                      20.997 ms
  log iteration (memory)          7.299 ms
  log iteration (disk)            7.518 ms
    10.14s user 3.65s system 96% cpu 14.274 total

This is a format-breaking change. Fortunately we haven't really use the old
format in production yet.

Reviewed By: DurhamG, markbt

Differential Revision: D13015463

fbshipit-source-id: 6e7e4f7a845ea8dbf0904b3902740b65cc7467d5
2018-12-06 14:57:52 -08:00
Jun Wu
42c3ef6eb6 indexedlog: add benchmark for "log"
Summary:
Some simple benchmark for "log". The initial result running from my devserver
looks like:

  log insertion                  33.146 ms
  log insertion with index      106.449 ms
  log flush                       9.623 ms
  log iteration (memory)         10.644 ms
  log iteration (disk)           11.517 ms
    13.75s user 3.61s system 97% cpu 17.778 total
  log insertion                  27.906 ms
  log insertion with index      107.683 ms
  log flush                      19.204 ms
  log iteration (memory)         10.239 ms
  log iteration (disk)           11.118 ms
    12.89s user 3.55s system 97% cpu 16.924 total
  log insertion                  31.645 ms
  log insertion with index      109.403 ms
  log flush                       9.416 ms
  log iteration (memory)         10.226 ms
  log iteration (disk)           10.757 ms
    13.07s user 3.02s system 97% cpu 16.423 total
  log insertion                  31.848 ms
  log insertion with index      109.332 ms
  log flush                      18.345 ms
  log iteration (memory)         10.709 ms
  log iteration (disk)           11.346 ms
    13.12s user 3.70s system 97% cpu 17.276 total
  log insertion                  29.665 ms
  log insertion with index      106.041 ms
  log flush                      16.159 ms
  log iteration (memory)         10.367 ms
  log iteration (disk)           11.110 ms
    12.99s user 3.27s system 97% cpu 16.717 total

Reviewed By: markbt

Differential Revision: D13015464

fbshipit-source-id: 035fee6c8b6d0bea4cfe194eed3d58ba4b5ebcb8
2018-12-06 14:57:52 -08:00
Durham Goode
1a3a0bcd72 nodemap: add key iteration
Summary:
An upcoming diff will need the ability to iterate over all the keys in
the store. So let's expose that functionality.

Reviewed By: quark-zju

Differential Revision: D13062575

fbshipit-source-id: a173fcdbbf44e2d3f09f7229266cca6f3e67944b
2018-12-06 11:47:41 -08:00
Durham Goode
60b3bebaff nodemap: python bindings for rust nodemap
Summary: Simple python bindings for the new nodemap rust structure

Reviewed By: quark-zju

Differential Revision: D13062572

fbshipit-source-id: d60407b87bfc19b496de09273a9c8d6b59af0b8b
2018-12-06 11:47:41 -08:00
Durham Goode
e9b755198c nodemap: introduce rust bidirectional node map
Summary:
Introduces a nodemap structure that stores the mapping between two
nodes with bidirectional indexes.

Reviewed By: quark-zju

Differential Revision: D13047698

fbshipit-source-id: 967bf4b26a4b57e4fa2421a342edb21d3a5adbf6
2018-12-06 11:47:41 -08:00
Durham Goode
668ba5165c indexedlog: add an iterator function for iterating over keys
Summary:
You can currently iterate over indexlog entries, but there's no way to
iterate over the keys without keeping a copy of the index function with you.
Let's add a key iterator function.

Reviewed By: quark-zju

Differential Revision: D13010744

fbshipit-source-id: 1fcaf959ae82417e5cbafae7c1927c3ae8f8e76a
2018-12-06 11:47:41 -08:00