Summary:
Change fields in IndexDef to private. Provide a public constructor method and
switch users to use that instead. This makes it possible to change the IndexDef
struct in the future (ex. having extra optional fields about whether the index
is backed by radix tree or something different).
Differential Revision: D14608955
fbshipit-source-id: 62a413268d97ba96b2c4efd2ce67cd4fa0ff4293
Summary:
Windows disallows rewriting or truncating mmaped files. Fix the tests by
either dropping the mmap, or skipping the test.
Reviewed By: sfilipco
Differential Revision: D14572119
fbshipit-source-id: dccafdc66db3830c2919232d899ba31365120066
Summary:
The `load_or_create_meta` function is subject to filesystem races. Solve it by
always taking a lock.
This hurts performance a little bit. But `open()` should not be in a hot loop.
So it should probably be fine.
Reviewed By: sfilipco
Differential Revision: D14568122
fbshipit-source-id: d9b28555ab94252da4717de709b780b361e1dda7
Summary:
On Windows it's impossible to open (2) a directory. Therefore add a utility
function that creates `lock` file automatically on Windows and open that file
instead.
Reviewed By: sfilipco
Differential Revision: D14568117
fbshipit-source-id: bc7ae7046be654560c38fbd98ec4dd58c071b1dc
Summary:
Previously, `load_or_create_meta` could return without actually creating the
meta file. That leads to problems when `load_or_create_meta` is called a
second time via `flush()`, it rewrites the primary file incorrectly. On Windows,
it will fail to rewrite the primary file.
Fix it by actually writing a meta file before returning.
Reviewed By: sfilipco
Differential Revision: D14568118
fbshipit-source-id: da3ad42bf48a923d732b1719839ca1953bd2b06c
Summary:
This exposes the underlying lookup functions from `Index`.
Alternatively we can allow access to `Index` and provide an `iter_started_from`
method on `Log` which takes a raw offset. I have been trying to avoid exposing
raw offsets in public interfaces, as they would change after `flush()` and cause
problems.
Reviewed By: markbt
Differential Revision: D13498303
fbshipit-source-id: 8b00a2a36a9383e3edb6fd7495a005bc985fd461
Summary:
This is the missing API before `indexedlog::Index` can fit in the
`changelog.partialmatch` case. It's actually more flexible as it can provide
some example commit hashes while the existing revlog.c or radixbuf
implementation just error out saying "ambiguous prefix".
It can be also "abused" for the semantics of sorted "sub-keys". By replace
"key" with "key + subkey" when inserting to the index. Looking up using "key"
would return a lazy result list (`PrefixIter`) sorted by "subkey". Note:
the radix tree is NOT efficient (both in time and space) when there are common
prefixes. So this use-case needs to be careful.
Reviewed By: markbt
Differential Revision: D13498301
fbshipit-source-id: 637856ebd761734d68b20c15866424b1d4518ad6
Summary: This will be used in prefix lookups.
Reviewed By: markbt
Differential Revision: D13498300
fbshipit-source-id: 3db7a21d6f35a18699d9dc3a0eca71a5410e0e61
Summary:
It makes testing duplicated - now `cargo test` would try running tests on 2 entry points:
lib.rs and indexedlog_dump.rs. Move it to a separate crate to solve the issue.
Reviewed By: markbt
Differential Revision: D13498266
fbshipit-source-id: 8abf07c1272dfa825ec7701fd8ea9e0d1310ec5f
Summary: `write!` result needs to be used.
Reviewed By: markbt
Differential Revision: D13471967
fbshipit-source-id: d48752bcac05dd33b112679d7faf990eb8ddd651
Summary:
Add a new entry type - INLINE_LEAF, which embeds the EXT_KEY and LINK entries
to save space.
The index size for referred keys is significantly reduced with little overhead:
index insertion (owned key) 3.732 ms
index insertion (referred key) 3.604 ms
index flush 11.868 ms
index lookup (memory) 1.159 ms
index lookup (disk, no verify) 2.175 ms
index lookup (disk, verified) 4.303 ms
index size (5M owned keys) 216626039
index size (5M referred keys) 96616431
11.87s user 2.96s system 98% cpu 15.107 total
The breakdown of the "5M referred keys" size is:
type count bytes
radixes 1729472 33835772
inline_leafs 5000000 62780651
There are no other kinds of entries stored.
Previously, the index size of referred keys is:
index size (5M referred keys) 136245815 bytes
So it's 136MB -> 96MB, 40% decrease.
Reviewed By: DurhamG
Differential Revision: D13036801
fbshipit-source-id: 27e68e4b6c332c1dc419abc6aba69271952e4b3d
Summary:
Replace the 20-byte "jump table" with 3-byte "flag + bitmap". This saves space
for indexes less than 4GB. There are some reserved bits in the "flag" so if we
run into space issues when indexes are larger than 4GB, we can try adding
6-byte integer, or VLQ back without breaking backwards-compatibility.
It seems to hurt flush performance a bit, because we have to scan the child
array twice. However, lookup (the most important performance) does not change
much. And the index is more compact.
After:
index flush 19.644 ms
index lookup (disk, no verify) 2.220 ms
index lookup (disk, verified) 4.067 ms
index size (5M owned keys) 216626039 bytes
index size (5M referred keys) 136245815 bytes
Before:
index flush 16.764 ms
index lookup (disk, no verify) 2.205 ms
index lookup (disk, verified) 4.030 ms
index size (5M owned keys) 240838647 bytes
index size (5M referred keys) 160458423 bytes
For the "referred key" case, it's 160->136MB, 17% decrease.
A detailed break down of components of index is:
After:
type count bytes (using owned keys)
radixes 1729472 33835772
links 5000000 27886336
leafs 5000000 44629384
keys 5000000 110000000
type count bytes (using referred keys)
radixes 1729472 33835772
links 5000000 27886336
leafs 5000000 44629384
ext_keys 5000000 29894315
Before:
type count bytes (using owned keys)
radixes 1729472 58048380
links 5000000 27886336
leafs 5000000 44903923
keys 5000000 110000000
type count bytes (using referred keys)
radixes 1729472 58048380
links 5000000 27886336
leafs 5000000 44629384
ext_keys 5000000 29894315
Leaf nodes are taking too much space. It seems the next big optimization might
be inlining ext_keys into leafs.
Reviewed By: DurhamG, markbt
Differential Revision: D13028196
fbshipit-source-id: 6043b16fd67a497eb52d20a17e153fcba5cb3e81
Summary:
Since the size test only runs once, we can use a larger number of keys. This is
closer to some production use-cases.
`cargo bench size` shows:
index size (5M owned keys) 240838647
index size (5M referred keys) 160458423
It currently uses 32 bytes per key for 5M referred keys.
Reviewed By: markbt
Differential Revision: D13027880
fbshipit-source-id: 726f5fb2da056e77ab93d82fda9f1afa500d0a8d
Summary:
Add benchmarks about index sizes, and a benchmark of insertion using key
references.
An example `cargo bench` result running on my devserver looks like:
index insertion (owned key) 3.551 ms
index insertion (referred key) 3.713 ms
index flush 20.648 ms
index lookup (memory) 1.087 ms
index lookup (disk, no verify) 2.041 ms
index lookup (disk, verified) 4.347 ms
index size (owned key) 886010
index size (referred key) 534298
Reviewed By: markbt
Differential Revision: D13027879
fbshipit-source-id: 70644c504026ffee2122d857d5035f5b7eea4f42
Summary:
For checksum values like xxhash, there is no benefit using big endian. Switch
to little endian so it's slightly slightly faster on the major platforms we
care about.
This is a breaking change. However, the format is not used in production yet.
So there is no migration code.
Reviewed By: markbt
Differential Revision: D13015465
fbshipit-source-id: ca83d19b3328370d089b03a33e848e64b728ef2a
Summary:
Previously, the format of an Log entry is hard-coded - length, xxhash, and
content. The xxhash always takes 8 bytes.
For small (ex. 40-byte) entries, xxhash32 is actually faster and takes less
disk space.
Introduce the "entry flags" concept so we can store some metadata about what
checksum function to use. The concept could be potentially used to support
other new format changes at per entry level in the future.
As we're here, also support data without checksums. That can be useful for
content with its own checksum, like a blob store with its own SHA1 integrity
check.
Performance-wise, log insertion is slower (but the majority insertaion overhead
would be on the index part), iteration is a little bit faster, perhaps because
the log can use less data.
Before:
log insertion 15.874 ms
log iteration (memory) 6.778 ms
log iteration (disk) 6.830 ms
After:
log insertion 18.114 ms
log iteration (memory) 6.403 ms
log iteration (disk) 6.307 ms
Reviewed By: DurhamG, markbt
Differential Revision: D13051386
fbshipit-source-id: 629c251633ecf85058ee7c3ce7a9f576dfac7bdf
Summary:
Xxhash result won't usually have leading zeros. So VLQ encoding is not an
efficient choice. Use non-VLQ encoding instead.
Performance wise, this is noticably faster than before:
log insertion 14.161 ms
log insertion with index 102.724 ms
log flush 11.336 ms
log iteration (memory) 6.351 ms
log iteration (disk) 7.922 ms
10.18s user 3.66s system 97% cpu 14.218 total
log insertion 13.377 ms
log insertion with index 97.422 ms
log flush 11.792 ms
log iteration (memory) 6.890 ms
log iteration (disk) 7.139 ms
10.20s user 3.56s system 97% cpu 14.117 total
log insertion 14.573 ms
log insertion with index 94.216 ms
log flush 18.993 ms
log iteration (memory) 7.867 ms
log iteration (disk) 7.567 ms
9.85s user 3.73s system 96% cpu 14.073 total
log insertion 15.526 ms
log insertion with index 98.868 ms
log flush 19.600 ms
log iteration (memory) 7.533 ms
log iteration (disk) 7.150 ms
10.13s user 4.02s system 96% cpu 14.647 total
log insertion 14.629 ms
log insertion with index 100.449 ms
log flush 20.997 ms
log iteration (memory) 7.299 ms
log iteration (disk) 7.518 ms
10.14s user 3.65s system 96% cpu 14.274 total
This is a format-breaking change. Fortunately we haven't really use the old
format in production yet.
Reviewed By: DurhamG, markbt
Differential Revision: D13015463
fbshipit-source-id: 6e7e4f7a845ea8dbf0904b3902740b65cc7467d5
Summary:
Some simple benchmark for "log". The initial result running from my devserver
looks like:
log insertion 33.146 ms
log insertion with index 106.449 ms
log flush 9.623 ms
log iteration (memory) 10.644 ms
log iteration (disk) 11.517 ms
13.75s user 3.61s system 97% cpu 17.778 total
log insertion 27.906 ms
log insertion with index 107.683 ms
log flush 19.204 ms
log iteration (memory) 10.239 ms
log iteration (disk) 11.118 ms
12.89s user 3.55s system 97% cpu 16.924 total
log insertion 31.645 ms
log insertion with index 109.403 ms
log flush 9.416 ms
log iteration (memory) 10.226 ms
log iteration (disk) 10.757 ms
13.07s user 3.02s system 97% cpu 16.423 total
log insertion 31.848 ms
log insertion with index 109.332 ms
log flush 18.345 ms
log iteration (memory) 10.709 ms
log iteration (disk) 11.346 ms
13.12s user 3.70s system 97% cpu 17.276 total
log insertion 29.665 ms
log insertion with index 106.041 ms
log flush 16.159 ms
log iteration (memory) 10.367 ms
log iteration (disk) 11.110 ms
12.99s user 3.27s system 97% cpu 16.717 total
Reviewed By: markbt
Differential Revision: D13015464
fbshipit-source-id: 035fee6c8b6d0bea4cfe194eed3d58ba4b5ebcb8
Summary:
An upcoming diff will need the ability to iterate over all the keys in
the store. So let's expose that functionality.
Reviewed By: quark-zju
Differential Revision: D13062575
fbshipit-source-id: a173fcdbbf44e2d3f09f7229266cca6f3e67944b
Summary:
You can currently iterate over indexlog entries, but there's no way to
iterate over the keys without keeping a copy of the index function with you.
Let's add a key iterator function.
Reviewed By: quark-zju
Differential Revision: D13010744
fbshipit-source-id: 1fcaf959ae82417e5cbafae7c1927c3ae8f8e76a
Summary:
Turn BookmarkStore rust implementation into indexed-log backed.
Note that this no longer matches existing mercurial bookmark store
disk representation.
Reviewed By: DurhamG
Differential Revision: D13133605
fbshipit-source-id: 2e0a27738bcec607892b0edab6f759116929c8e1
Summary:
This is done by running `fix-code.py`. Note that those strings are
semvers so they do not pin down the exact version. An API-compatiable upgrade
is still possible.
Reviewed By: ikostia
Differential Revision: D10213073
fbshipit-source-id: 82f90766fb7e02cdeb6615ae3cb7212d928ed48d
Summary:
The "misc" benchmark requires the base16 module to be public. It was made
private in a previous change. Let's make it public again so the benchmark can
run.
Reviewed By: singhsrb
Differential Revision: D13015031
fbshipit-source-id: 0dc1542803aae290de26651e367898eebfc95e83
Summary: It needs to be Send to be used in cpython.
Reviewed By: ikostia
Differential Revision: D10250289
fbshipit-source-id: ea57e356a0752764e50db9b6872b5cc4a456303f
Summary:
Make it more detailed for public APIs. Hide too detailed information (file
format).
Reviewed By: DurhamG
Differential Revision: D10250140
fbshipit-source-id: d9d9af9d67984b80f07db13e69bbffdf77e6a30e
Summary:
The log module is the "entry point" of other features. Update it so things are
more detailed. I tried to make it more friendly for people without knowledge
about the implementation details.
This could probably be further improved by adding some examples. For now, I'm
focusing on the plain English parts.
To reviewers: Let me know how you feel reading it assuming no prior knowledge
with the implementation. Ways to make sentences shorter, natural to native
speakers without losing important information are also very welcome.
Reviewed By: DurhamG
Differential Revision: D10250141
fbshipit-source-id: 35258c7197c1ce0a1d3d0554fab2f2d2866e123c
Summary:
Make important modules public. Make internal utility (base16) private. Add
some text to the crate-level document. It just refers to important structures.
Will revise document of those structures.
Reviewed By: DurhamG, kulshrax
Differential Revision: D10250143
fbshipit-source-id: c79859ee7d3d9cc4ee9a093ef5d12ec6599f2a42
Summary: This is just the result of running `./contrib/fix-code.py $(hg files .)`
Reviewed By: ikostia
Differential Revision: D10213075
fbshipit-source-id: 88577c9b9588a5b44fcf1fe6f0082815dfeb363a
Summary:
The code block is not a valid Rust program. Mark it as "plain".
This fixes `cargo doc`.
Reviewed By: markbt
Differential Revision: D10137806
fbshipit-source-id: 1197d3a2ebc1450a0738686fa6cfa7c7b79dcb0d
Summary:
The primary log and indexes could be out of sync when mutating the indexes
error out. In that case, mark the indexes as "corrupted" and refuse to
perform index read (lookup) operations, for correctness.
Reviewed By: DurhamG
Differential Revision: D8337689
fbshipit-source-id: 3db9006ea03cfcaba52391f189aa697944b616e5
Summary:
This demonstrates the index definitions can have different orders, as long
as their names do not change, things still work.
Reviewed By: DurhamG
Differential Revision: D8337688
fbshipit-source-id: 2fbbdf711d8edc10fc6d3314532390ea712aca6c
Summary:
This allows us to store arbitrary metadata in the root node. It will be used
by the `Log` structure to store how many bytes the index covers.
Reviewed By: DurhamG
Differential Revision: D8337687
fbshipit-source-id: 159a89d66765fc251a486fd62c1ffd01f625b503
Summary: Implement the dependencies of the "open" public API.
Reviewed By: DurhamG
Differential Revision: D8156518
fbshipit-source-id: 9fed441f520a3b74cbef5bfb815c82943c615fdf
Summary:
The read_entry function takes care of reading an entry from a given offset,
and return internal stats like real data offset (skipping the length and
checksum metadata), and the next entry offset.
It does integrity check and handles offset for both in-memory and on-disk
buffers. The offsets to in-memory entries are fairly simple - they start
from "meta.primary_len" instead of a fixed reserved value. This makes the
"next_offset" work seamlessly.
The public API won't have "offset" exposed, so the API is private.
Reviewed By: DurhamG
Differential Revision: D8156513
fbshipit-source-id: 8661f2f2757de6f3f94defc64f4a8dd5261973b2
Summary:
Partially implement open, append, flush, lookup APIs. This shows how things
work in general, like how locking works. What's in-memory and what's on-disk
etc.
Reviewed By: DurhamG
Differential Revision: D8156514
fbshipit-source-id: 2de23dcde2f63895f3f3e4f67057aa9520fdfa34
Summary: Implemented as the file format specification added by the previous diff.
Reviewed By: DurhamG
Differential Revision: D8156516
fbshipit-source-id: 7153932b9442b3ab5bdb81490f88c40346128afc
Summary: The public interface and its dependencies.
Reviewed By: DurhamG
Differential Revision: D8156509
fbshipit-source-id: c6f3e4b88851683a5d8804b80f689282e3f582d4
Summary:
Without this change, code doing `index.get(...).values().collect()` might
end up with an infinite loop.
Reviewed By: DurhamG
Differential Revision: D8156510
fbshipit-source-id: 5497aa354de7d49cfc4308a025856608ce981a1e
Summary:
Previously, the index API optionally takes a root offset. This is
inconvenient for the caller since they probably need to record both
valid file length and root offsets. Since root nodes are always at
the end of the index. Let's just simplify the API to take a logical
file length instead of a root offset.
Reviewed By: DurhamG
Differential Revision: D8156512
fbshipit-source-id: 7029272a61c9990e6484bca7ebbff64e2233c6cd
Summary:
Previously, `mmap_readonly` always reads file length, and uses that for mmap
length. In many cases we do know the desired file length and it's cleaner to
not `mmap` unused bytes. So let's add a parameter to do that.
Note: The `stat` call is still needed. Since `mmap` wouldn't return an error
of the requested length is greater than the file length.
Reviewed By: DurhamG
Differential Revision: D8156523
fbshipit-source-id: 991aa28f3542eaff24387dcc6a7302122fb6962f
Summary: The function will be reused in another module.
Reviewed By: DurhamG
Differential Revision: D8156522
fbshipit-source-id: 2aff6f2e4b8fc9b5d2c000e12ac2d940f7fab407