sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-12 17:58:27 +03:00

Author	SHA1	Message	Date
Jun Wu	991a9343b9	indexedlog: log: partially implement main APIs Summary: Partially implement open, append, flush, lookup APIs. This shows how things work in general, like how locking works. What's in-memory and what's on-disk etc. Reviewed By: DurhamG Differential Revision: D8156514 fbshipit-source-id: 2de23dcde2f63895f3f3e4f67057aa9520fdfa34	2018-06-11 19:36:15 -07:00
Jun Wu	529c79bd33	indexedlog: log: implement serialization for the meta file Summary: Implemented as the file format specification added by the previous diff. Reviewed By: DurhamG Differential Revision: D8156516 fbshipit-source-id: 7153932b9442b3ab5bdb81490f88c40346128afc	2018-06-11 19:36:15 -07:00
Jun Wu	97281caabf	indexedlog: log: define public facing interface Summary: The public interface and its dependencies. Reviewed By: DurhamG Differential Revision: D8156509 fbshipit-source-id: c6f3e4b88851683a5d8804b80f689282e3f582d4	2018-06-11 19:36:15 -07:00
Jun Wu	8ad9276975	indexedlog: log: add comments about the file format Summary: Start implementing the "Log" object. Let's define the file formats first. Reviewed By: DurhamG Differential Revision: D8156515 fbshipit-source-id: 037f7454452959f82583a4d97d3f38dfa60aa741	2018-06-11 19:36:14 -07:00
Jun Wu	c65612acc9	indexedlog: index: stop iteration if an error is encountered Summary: Without this change, code doing `index.get(...).values().collect()` might end up with an infinite loop. Reviewed By: DurhamG Differential Revision: D8156510 fbshipit-source-id: 5497aa354de7d49cfc4308a025856608ce981a1e	2018-06-05 00:12:29 -07:00
Jun Wu	798e55d53d	indexedlog: index: change APIs to take file lengths instead of root offsets Summary: Previously, the index API optionally takes a root offset. This is inconvenient for the caller since they probably need to record both valid file length and root offsets. Since root nodes are always at the end of the index. Let's just simplify the API to take a logical file length instead of a root offset. Reviewed By: DurhamG Differential Revision: D8156512 fbshipit-source-id: 7029272a61c9990e6484bca7ebbff64e2233c6cd	2018-06-05 00:12:29 -07:00
Jun Wu	68660cc443	indexedlog: utils: make `mmap_readonly` optionally take file length Summary: Previously, `mmap_readonly` always reads file length, and uses that for mmap length. In many cases we do know the desired file length and it's cleaner to not `mmap` unused bytes. So let's add a parameter to do that. Note: The `stat` call is still needed. Since `mmap` wouldn't return an error of the requested length is greater than the file length. Reviewed By: DurhamG Differential Revision: D8156523 fbshipit-source-id: 991aa28f3542eaff24387dcc6a7302122fb6962f	2018-06-05 00:12:29 -07:00
Jun Wu	c43312ad9c	indexedlog: utils: move xxhash to utils Summary: The function will be reused in another module. Reviewed By: DurhamG Differential Revision: D8156522 fbshipit-source-id: 2aff6f2e4b8fc9b5d2c000e12ac2d940f7fab407	2018-06-05 00:12:29 -07:00
Jun Wu	7b9867ac12	crates: pin rand to 0.4 version Summary: `rand` 0.5 has too many breaking changes that the code is not ready to migrate yet. So let's ping rand to 0.4. Ideally all dependencies in Cargo.toml should avoid using "*". But for now `rand` is the only troublemaker. Note `rand 0.4` is a dependency of `quickcheck 0.6.2` so it's available. Reviewed By: phillco, singhsrb Differential Revision: D8158406 fbshipit-source-id: 417ae6807a2efc650acb8d82370964fab6531fdb	2018-05-25 09:51:19 -07:00
Jun Wu	40a88364be	indexedlog: replace `div` with `shr` to make checksum faster Summary: Spot `div` slowness using Linux's `perf` tool. \| Disassembly of section .text: \| \| 0000000000018990 <indexedlog::checksum_table::ChecksumTable::check_range>: \| _ZN10indexedlog14checksum_table13ChecksumTable11check_range17h2303c96b1e035e20E(): 1.36 \| push %rax 0.18 \| mov %rdx,%r8 0.54 \| mov $0x1,%cl \| test %r8,%r8 \| je 60 0.54 \| add %rsi,%r8 0.72 \| cmp 0x30(%rdi),%r8 \| ja 64 0.27 \| mov 0x28(%rdi),%r9 0.27 \| test %r9,%r9 \| je 6a 0.36 \| add $0xffffffffffffffff,%r8 0.18 \| xor %edx,%edx 0.45 \| mov %rsi,%rax 0.36 \| div %r9 43.72 \| mov %rax,%rsi \| xor %edx,%edx \| mov %r8,%rax 0.18 \| div %r9 42.82 \| add $0x1,%rax 0.09 \| cmp %rax,%rsi \| jae 60 2.17 \| cmpq $0x0,0x60(%rdi) \| je 78 \| mov 0x50(%rdi),%rcx \| cmpb $0x0,(%rcx) 1.63 \| sete %cl 0.18 \| xchg %ax,%ax \|50: test $0x1,%cl \| je 64 0.45 \| add $0x1,%rsi 0.81 \| mov $0x1,%cl 0.09 \| cmp %rax,%rsi \| jb 50 \|60: mov %ecx,%eax \| pop %rcx 2.62 \| retq \|64: xor %ecx,%ecx \| mov %ecx,%eax \| pop %rcx \| retq \|6a: lea panic_loc.a.llvm.9800112514578621117,%rdi \| callq core::panicking::panic \| ud2 \|78: lea panic_bounds_check_loc.7.llvm.9800112514578621117,%rdi \| xor %esi,%esi \| xor %edx,%edx \| callq core::panicking::panic_bounds_check \| ud2 Change `chunk_size` to `chunk_size_log`. Replace `div` with `shr` to make it significantly faster: Before: index lookup (memory) 1.118 ms index lookup (disk, no verify) 2.078 ms index lookup (disk, verified) 7.687 ms After: index lookup (memory) 1.066 ms index lookup (disk, no verify) 1.992 ms index lookup (disk, verified) 3.591 ms Reviewed By: DurhamG, markbt Differential Revision: D7554992 fbshipit-source-id: c24189ced722d880af6ca0d64967eb762363d9e3	2018-04-17 18:54:39 -07:00
Jun Wu	f25c152d01	indexedlog: add a test about checksum Summary: Add a test that bitflips the index content, and make sure reading the index would trigger an error. Due to run-time performance difference, the release version tests 2-byte key while the debug version only tests 1-byte key. The header byte was not verified. Now it is verified. Reviewed By: DurhamG Differential Revision: D7517134 fbshipit-source-id: b3d8665ff4ac08c1a70db8d21122ba241913a2ed	2018-04-17 18:54:39 -07:00
Jun Wu	9ce455769c	indexedlog: avoid writing unused entries due to leaf split Summary: In "split_leaf" "Example 3" case, the old leaf entry (and its key) becomes unused. Writing them to disk is unnecessary. This patch adds "unused" marker so they could be marked and skipped inside flush(). No visible performance change: index insertion 3.710 ms index flush 3.717 ms index lookup (memory) 1.128 ms index lookup (disk, no verify) 1.993 ms index lookup (disk, verified) 7.866 ms Reviewed By: DurhamG Differential Revision: D7517139 fbshipit-source-id: 253c878bc4b3762382c424777dfa779b3868e851	2018-04-17 18:54:38 -07:00
Jun Wu	ac52e4a6fb	indexedlog: add a test against std hashmap for multi-values Summary: Since we now have the ability to store multiple values. Add a test. Reviewed By: DurhamG Differential Revision: D7472880 fbshipit-source-id: 85b1c69245ac7f0c4702daf22a02f5e5072f0924	2018-04-13 21:51:46 -07:00
Jun Wu	de74642bc7	indexedlog: implement value iterator Summary: The value type is a linked list of u64 integers. Add an API to expose that. Using iterator framework has benefits about flexibility - the caller can take the first value, or convert it to a vector, or count the values, etc. easily. Reviewed By: DurhamG Differential Revision: D7472881 fbshipit-source-id: d31e81770e069734b54fa08729c0cd45a699aae2	2018-04-13 21:51:46 -07:00
Jun Wu	cc4193ba29	indexedlog: handle radix null child correctly Summary: This is caught by a later test. Looking up a non-existed child (jumptable value is 0) returns InvalidData error, while it should return Offset(0). The added if condition does not seem to have noticeable performance impact: index insertion 3.840 ms index flush 3.740 ms index lookup (memory) 1.085 ms index lookup (disk, no verify) 1.972 ms index lookup (disk, verified) 7.752 ms Reviewed By: DurhamG Differential Revision: D7472882 fbshipit-source-id: 1cc51e9afa248e123cca9c561d7bb2128fd898b1	2018-04-13 21:51:46 -07:00
Jun Wu	b82b0daab5	indexedlog: make LinkOffset also return next link offset Summary: Previously, the code was focusing on getting the hardest (index) part right, but less about the value part. There is no way to get all values in the linked list, as designed, yet. This diff starts the work. Similar to `KeyOffset::key_and_link_offset`, change the internal API of LinkOffset to return both value and the next link offset. Reviewed By: DurhamG Differential Revision: D7472879 fbshipit-source-id: 4a4512d7c63abbb667146de582e0f8cd04c9c04a	2018-04-13 21:51:46 -07:00
Jun Wu	b9b1f1e907	indexedlog: use OpenOptions Summary: `Index::open` now takes too many parameters, which is not very convenient to use. Inspired by `fs::OpenOptions`, use a dedicated strut for specifying open options. Motivation: To test checksum ability more confidently, I'd like to write something that randomly mutates 1 byte from a sane index. To make sure the checksum coverage is "correct", checksum chunk size is another parameter. Reviewed By: DurhamG Differential Revision: D7464182 fbshipit-source-id: 469ce7d1cfa5de3946028418567a9f3e2bc303fa	2018-04-13 21:51:46 -07:00
Jun Wu	6cb2b1dd23	indexedlog: make OffsetMap::get have no assumption about offset Summary: Address DurhamG's review comment on D7422832. Previously, `OffsetMap::get` expects a dirty offset. That's because it was changed from `HashMap` and we don't control `HashMap::get`. It's cleaner to let `OffsetMap` do the `is_dirty` check. Reviewed By: DurhamG Differential Revision: D7461707 fbshipit-source-id: 9f2abdf6c6f993d98d9443f16bafcc6154ee0dbb	2018-04-13 21:51:46 -07:00
Jun Wu	9787cfc15b	indexedlog: add more tests about leaf split Summary: The new test covers the `else` branch inside `LeafOffset::set_link` previously not covered. Coverage was checked by the following script: ``` from __future__ import absolute_import import glob import os import shutil os.system('cargo rustc --lib --profile test -- -Ccodegen-units=1 -Clink-dead-code -Zno-landing-pads') path = max((os.stat(path).st_mtime, path) for path in glob.glob('./target/debug/*-????????????????'))[1] shutil.rmtree('target/kcov') os.system('kcov --include-path $PWD/src --verify target/kcov %s' % path) ``` Reviewed By: DurhamG Differential Revision: D7446902 fbshipit-source-id: 293da2ff53b83c8f11534f0f8e5e7fd102216a01	2018-04-13 21:51:46 -07:00
Jun Wu	5209e8360b	indexedlog: support external keys Summary: Change `insert_advanced` to accept an enum that could be either a key, or an (offset, len) that refers to the external key buffer. Insertion becomes slower due to new flexibility overhead. For some reason, "index lookup (no verify)" becomes faster (restores pre-D7440248 performance): index insertion 6.434 ms index flush 3.757 ms index lookup (memory) 1.068 ms index lookup (disk, no verify) 1.969 ms index lookup (disk, verified) 7.805 ms With 2M 20-byte keys, the non-external key version generates a 105MB index: seconds operation 1.247 insert 0.622 flush 1.859 flush done 0.702 lookup (without checksum) 1.395 lookup (with checksum) Using external keys,the index is 70MB, and time for each operation: seconds operation 1.086 insert 0.702 flush 0.665 lookup (without checksums) 1.602 lookup (with checksums) The external key will have more space wins for longer keys, ex. file path. `Index` module was made public so `InsertKey` type is usable. Reviewed By: DurhamG Differential Revision: D7444907 fbshipit-source-id: b89d95246845799c2c55fb73ad203a7e6724b85e	2018-04-13 21:51:46 -07:00
Jun Wu	36dfda984c	indexedlog: relax leaf entry's key offset type Summary: Previously, a leaf entry can only have a `KeyOffset`. This diff makes it possible to be either `KeyOffset`, or `ExtKeyOffset`. The API didn't change much since `LeafOffset::key_and_link_offset` handles the difference transparently. Latest benchmark result: index insertion 4.879 ms index flush 3.620 ms index lookup (memory) 1.827 ms index lookup (disk, no verify) 3.508 ms index lookup (disk, verified) 7.861 ms Reviewed By: DurhamG Differential Revision: D7444909 fbshipit-source-id: 5441e1ae187d42931377d7213dcb77156b2af714	2018-04-13 21:51:46 -07:00
Jun Wu	44a0998bc6	indexedlog: let leaf entry return key content Summary: The leaf entry has a `key_and_link_offset` method. Previously it returns a `KeyOffset`, since we now have `ExtKeyOffset`, it's friendly to handle the key entry type difference at the leaf entry level, instead of requiring the caller to handle it. Reviewed By: DurhamG Differential Revision: D7444905 fbshipit-source-id: 56d87641a2a5a50ddca8b1e4c74c9aaa3891b542	2018-04-13 21:51:46 -07:00
Jun Wu	1294c1b471	indexedlog: add an "external key" entry type Summary: Previously, I thought there is only one index that will use "commit hash" as keys, that is the nodemap, and other indexes (like childmap) would just use shorter integer keys (ex. revision number, or offsets). So the space overhead of storing full keys only applies to one index and seems acceptable. But that implies strict topo order for the source of truth data (ex. to use integers as keys in childmap, you have to know how to translate parent revisions from hashes to integers at the time writing the revision). Thinking about it again, it seems the topo-order requirement would make a lot of things less flexible. It's much easier to just use hashes as keys in the index. Then it's worthwhile to address the space efficiency problem by introducing an "external key buffer" concept. That's actually what `radixbuf` does. This is the start. It adds the type to the strcut. The feature is not completed yet. Reviewed By: DurhamG Differential Revision: D7444904 fbshipit-source-id: 60a83c9e6e8b0734450f0c5827928a7c5bd111d5	2018-04-13 21:51:45 -07:00
Jun Wu	5e828307f4	indexedlog: verify checksum for all reads Summary: It further slows down lookups, even when checksum is disabled, since even a `is_none()` check is not free: index insertion 4.697 ms index flush 3.764 ms index lookup (memory) 2.878 ms index lookup (disk, no verify) 3.564 ms index lookup (disk, verified) 7.788 ms The "verified" version basically needs 2x time due to more memory lookups. Unfortunately this means eventual lookup performance will be slower than gdbm, but insertion is still much faster. And the index still has a better locking properties (lock-free read) that gdbm does not have. With correct time complexity (no O(len(changelog)) index-only operations for example), I'd expect it's rare for the overall performance to be bounded by index performance. Data integrity is more important. With a larger number of nodes, ex. 2M 20-byte strings: inserting to memory takes 1.4 seconds, flushing to disk takes 0.9 seconds, looking up without checksum takes 0.9 seconds, looking up with checksum takes 1.7 seconds. Reviewed By: DurhamG Differential Revision: D7440248 fbshipit-source-id: 020e5204606f9f0a4f68843a491009a6a6f75751	2018-04-13 21:51:42 -07:00
Jun Wu	ca8f60eb0a	indexedlog: verify checksum for type bytes Summary: This is in the critical path for lookup, and has very visible performance penalty: index insertion 3.923 ms index flush 3.921 ms index lookup (memory) 1.070 ms index lookup (disk, no verify) 1.980 ms index lookup (disk, verified) 5.206 ms Reviewed By: DurhamG Differential Revision: D7440252 fbshipit-source-id: 49540f974faff1cdd0603a72328f141ccd054ee2	2018-04-13 21:51:42 -07:00
Jun Wu	55fc90dfea	indexedlog: verify checksum for `Mem` structs Summary: Previously checksum is only for `MemRoot`, now it's for all `Mem` structs. Since `Mem` structs are not frequently used in the normal lookup code path, there is no visible performance change. Reviewed By: DurhamG Differential Revision: D7440253 fbshipit-source-id: 945f5a8c38d228f59190a487b0cf6dbc5daac4f7	2018-04-13 21:51:42 -07:00
Jun Wu	a7e3e7884d	indexedlog: add a type alias for `Option<ChecksumTable>` Summary: The type will be used all over the place and may make `rustfmt` wrap lines. Use a shorter type to make it slightly cleaner. Reviewed By: DurhamG Differential Revision: D7436338 fbshipit-source-id: ecaada23916a22658f65669b748632a077e60df2	2018-04-13 21:51:42 -07:00
Jun Wu	bfd8e33370	indexedlog: verify checksum for root entry Summary: This only affects `Index::open` right now. So it's a one time check and does not affect performance. Reviewed By: DurhamG Differential Revision: D7436341 fbshipit-source-id: 30313064bf2ea50320ac744fc18c03bff4b12c89	2018-04-13 21:51:42 -07:00
Jun Wu	a0cec9853c	indexedlog: add checksum table to index struct Summary: Add `ChecksumTable` to the `Index` struct. But it's not functional yet. The checksum will mainly affect "index lookup (disk)" case. Add another benchmark for showing the difference with checksum on and off. They do not have much difference right now: index insertion 3.756 ms index flush 3.469 ms index lookup (memory) 0.990 ms index lookup (disk, no verify) 1.768 ms index lookup (disk, verified) 1.766 ms Reviewed By: DurhamG Differential Revision: D7436339 fbshipit-source-id: 60a6554a2c96067a53ce9e1753cd51d0d61c0bea	2018-04-13 21:51:42 -07:00
Jun Wu	8d7d4de8ee	indexedlog: separate benchmarks Summary: The minibench framework does not provide benchmark filtering. So let's separate benchmarks using different entry points. Reviewed By: DurhamG Differential Revision: D7440250 fbshipit-source-id: 11e7790a5074ebf4c08e33c312a490a66a921926	2018-04-13 21:51:42 -07:00
Jun Wu	d86adc417e	indexedlog: remove "index clone" benchmarks Summary: The "clone" benchmarks were added to be subtracted from "lookup" to workaround the test framework limitation. The new minibench framework makes it easier to exclude preparation cost. Therefore the clone benchmarks are no longer needed. index insertion 3.881 ms index flush 3.286 ms index lookup (memory) 0.928 ms index lookup (disk) 1.685 ms "index lookup (memory)" is basically "index lookup (memory)" minus "index clone (memory)" in previous benchmarks. Reviewed By: DurhamG Differential Revision: D7440251 fbshipit-source-id: 0e6a1fb7ee64f9a393ee9ada4db6e6eb052e20bf	2018-04-13 21:51:42 -07:00
Jun Wu	9b9dd289e4	indexedlog: use minibench to do benchmark Summary: See the previous minibench diff for the motivation. "failure" was removed from build dependencies since it's not used yet. Run benchmark a few times. It seems the first several items are less stable due to possibly warming up issues. Otherwise the result looks good enough. The test also compiles and runs much faster. ``` base16 iterating 1M bytes 0.921 ms index insertion 4.804 ms index flush 5.104 ms index lookup (memory) 2.929 ms index lookup (disk) 1.767 ms index clone (memory) 2.036 ms index clone (disk) 0.010 ms base16 iterating 1M bytes 0.853 ms index insertion 4.512 ms index flush 4.717 ms index lookup (memory) 2.907 ms index lookup (disk) 1.755 ms index clone (memory) 1.856 ms index clone (disk) 0.010 ms base16 iterating 1M bytes 1.525 ms index insertion 4.577 ms index flush 4.901 ms index lookup (memory) 2.800 ms index lookup (disk) 1.790 ms index clone (memory) 1.794 ms index clone (disk) 0.010 ms base16 iterating 1M bytes 0.768 ms index insertion 4.486 ms index flush 4.918 ms index lookup (memory) 2.658 ms index lookup (disk) 1.721 ms index clone (memory) 1.763 ms index clone (disk) 0.010 ms base16 iterating 1M bytes 0.732 ms index insertion 4.489 ms index flush 4.792 ms index lookup (memory) 2.689 ms index lookup (disk) 1.739 ms index clone (memory) 1.850 ms index clone (disk) 0.009 ms base16 iterating 1M bytes 1.124 ms index insertion 7.188 ms index flush 4.888 ms index lookup (memory) 2.829 ms index lookup (disk) 1.609 ms index clone (memory) 2.642 ms index clone (disk) 0.010 ms base16 iterating 1M bytes 1.055 ms index insertion 4.683 ms index flush 4.996 ms index lookup (memory) 2.782 ms index lookup (disk) 1.710 ms index clone (memory) 1.802 ms index clone (disk) 0.009 ms ``` Reviewed By: DurhamG Differential Revision: D7440249 fbshipit-source-id: 0f946ab184455acd40c5a38cf46ff94d9e3755c8	2018-04-13 21:51:42 -07:00
Jun Wu	8bcff92cab	indexedlog: use a dedicated map type for offset translation Summary: The dirty -> non-dirty offset mapping can be optimized using a dedicated "map" type that is backed by `vec`s, because dirty offsets are continuous per type. This makes "flush" significantly faster: ``` index flush time: [5.8808 ms 6.1800 ms 6.4813 ms] change: [-62.250% -59.481% -56.325%] (p = 0.00 < 0.05) Performance has improved. ``` Reviewed By: DurhamG Differential Revision: D7422832 fbshipit-source-id: 9ab8a70d1663155941dae5b4f02f7452f5e3cadf	2018-04-13 21:51:42 -07:00
Jun Wu	00503a6d94	indexedlog: avoid a memory allocation Summary: It seems to improve the performance a bit: ``` index insertion time: [5.4643 ms 5.6818 ms 5.9188 ms] change: [-24.526% -17.384% -10.315%] (p = 0.00 < 0.05) Performance has improved. ``` Reviewed By: DurhamG Differential Revision: D7422831 fbshipit-source-id: fc1c72f402258db7e189cd8724583757d48affb7	2018-04-13 21:51:42 -07:00
Jun Wu	4cb2cc1abb	indexedlog: use Box<[u8]> instead of Vec<u8> Summary: For key entries, the key is immutable once stored. So just use `Box<[u8]>`. It saves a `usize` per entry. On 64-bit platform, that's a lot. Performance is slightly improved and it catches up with D7404532 before typed offset refactoring now: index insertion time: [6.1852 ms 6.6598 ms 7.2433 ms] index flush time: [15.814 ms 16.538 ms 17.235 ms] index lookup (memory) time: [3.7636 ms 3.9403 ms 4.1424 ms] index lookup (disk) time: [1.9413 ms 2.0366 ms 2.1325 ms] index clone (memory) time: [2.6952 ms 2.9221 ms 3.0968 ms] index clone (disk) time: [5.0296 us 5.2862 us 5.5629 us] Reviewed By: DurhamG Differential Revision: D7422837 fbshipit-source-id: 4aabfdc028aefb8e796803e103f0b2e4965f84e6	2018-04-13 21:51:42 -07:00
Jun Wu	36793b7c14	indexedlog: simplify `insert_advanced` API Summary: Previously, both `value` and `link` are optional in `insert_advanced`. This diff makes `value` required. `maybe_create_link_entry` becomes unused and removed. No visible performance change. Reviewed By: DurhamG Differential Revision: D7422838 fbshipit-source-id: 8d7d3cc1cc325f6fea7e8ce996d0a43d3ee49839	2018-04-13 21:51:41 -07:00
Jun Wu	892fcd6dfd	indexedlog: use typed offsets Summary: This is a large refactoring that replaces `u64` offsets with strong typed ones. Tests about serialization are removed since they generate illegal data that cannot pass type check. It seems to slow down the code a bit, comparing with D7404532. But there are still room to improve. index insertion time: [6.9395 ms 7.3863 ms 7.7620 ms] index flush time: [15.949 ms 17.965 ms 20.246 ms] index lookup (memory) time: [3.6212 ms 3.8855 ms 4.1923 ms] index lookup (disk) time: [2.2496 ms 2.4649 ms 2.8090 ms] index clone (memory) time: [2.7292 ms 2.9399 ms 3.2055 ms] index clone (disk) time: [4.9239 us 5.5928 us 6.3167 us] Reviewed By: DurhamG Differential Revision: D7422833 fbshipit-source-id: 7357cb0f4f573f620e829c5e300cd423619dbd62	2018-04-13 21:51:41 -07:00
Jun Wu	a87fea077c	indexedlog: prefix in-memory entries with `Mem` Summary: This makes it clear the code has different code paths for on-disk entries. Reviewed By: DurhamG Differential Revision: D7422836 fbshipit-source-id: 018fa0e2c20682d4e1beba99f3307550e1f40388	2018-04-13 21:51:40 -07:00
Jun Wu	3332522d43	indexedlog: add some benchmarks Summary: Add benchmarks inserting / looking up 20K entries. Benchmark results on my laptop are: index insertion time: [6.5339 ms 6.8174 ms 7.1805 ms] index flush time: [15.651 ms 16.103 ms 16.537 ms] index lookup (memory) time: [3.6995 ms 4.0252 ms 4.3046 ms] index lookup (disk) time: [1.9986 ms 2.1224 ms 2.2464 ms] index clone (memory) time: [2.5943 ms 2.6866 ms 2.7749 ms] index clone (disk) time: [5.2302 us 5.5477 us 5.9518 us] Comparing with highly optimized radixbuf: index insertion time: [991.89 us 1.1708 ms 1.3844 ms] index lookup time: [863.83 us 945.69 us 1.0304 ms] Insertion takes 6x time. Lookup from memory takes 1.4x time, from disk takes 2.2x time. Flushing is the slowest - it needs 16x radixbuf insertion time. Note: need to subtract "clone" time from "lookup" to get meaningful values about "lookup". This cannot be done automatically due to the limitation of the benchmark framework. Although it's slower than radixbuf, the index is still faster than gdbm and rocksdb. Note: the index does less than gdbm/rocksdb since it does not return a `[u8]`-ish which requires extra lookups. So it's not a very fair comparison. gdbm insertion time: [69.607 ms 75.102 ms 79.334 ms] gdbm lookup time: [9.0855 ms 9.8480 ms 10.637 ms] gdbm prepare time: [110.35 us 120.40 us 135.63 us] rocksdb insertion time: [117.96 ms 123.42 ms 127.85 ms] rocksdb lookup time: [24.413 ms 26.147 ms 28.153 ms] rocksdb prepare time: [3.8316 ms 4.1776 ms 4.5039 ms] Note: Subtract "prepare" from "insertion" to get meaningful values. Code to benchmark rocksdb and gdbm: ``` extern crate criterion; extern crate gnudbm; extern crate rand; extern crate rocksdb; extern crate tempdir; use criterion::Criterion; use gnudbm::GdbmOpener; use rand::{ChaChaRng, Rng}; use rocksdb::DB; use tempdir::TempDir; const N: usize = 20480; /// Generate random buffer fn gen_buf(size: usize) -> Vec<u8> { let mut buf = vec![0u8; size]; ChaChaRng::new_unseeded().fill_bytes(buf.as_mut()); buf } fn criterion_benchmark(c: &mut Criterion) { c.bench_function("rocksdb prepare", \|b\| { b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let _db = DB::open_default(dir.path().join("a")).unwrap(); }); }); c.bench_function("rocksdb insertion", \|b\| { let buf = gen_buf(N * 20); b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let db = DB::open_default(dir.path().join("a")).unwrap(); for i in 0..N { db.put(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } }); }); c.bench_function("rocksdb lookup", \|b\| { let dir = TempDir::new("index").expect("TempDir::new"); let db = DB::open_default(dir.path().join("a")).unwrap(); let buf = gen_buf(N * 20); for i in 0..N { db.put(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } b.iter(move \|\| { for i in 0..N { db.get(&&buf[20 * i..20 * (i + 1)]).unwrap(); } }); }); c.bench_function("gdbm prepare", \|b\| { let buf = gen_buf(N * 20); b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let _db = GdbmOpener::new().create(true).readwrite(dir.path().join("a")).unwrap(); }); }); c.bench_function("gdbm insertion", \|b\| { let buf = gen_buf(N * 20); b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let mut db = GdbmOpener::new().create(true).readwrite(dir.path().join("a")).unwrap(); for i in 0..N { db.store(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } }); }); c.bench_function("gdbm lookup", \|b\| { let dir = TempDir::new("index").expect("TempDir::new"); let mut db = GdbmOpener::new().create(true).readwrite(dir.path().join("a")).unwrap(); let buf = gen_buf(N * 20); for i in 0..N { db.store(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } b.iter(move \|\| { for i in 0..N { db.fetch(&&buf[20 * i..20 * (i + 1)]).unwrap(); } }); }); } criterion_group!{ name=benches; config=Criterion::default().sample_size(20); targets=criterion_benchmark } criterion_main!(benches); ``` Reviewed By: DurhamG Differential Revision: D7404532 fbshipit-source-id: ff39f520b78ad1b71eb36970506b313bb2ff426b	2018-04-13 21:51:40 -07:00
Jun Wu	5576402ea9	indexedlog: add ability to clone a `Index` object Summary: This will be useful for benchmarks - prepare an index as a template, and clone it in the tests. Reviewed By: DurhamG Differential Revision: D7422835 fbshipit-source-id: 190bbdee7cb7c1526274b4d4dab07af4984b5df6	2018-04-13 21:51:40 -07:00
Jun Wu	2f30189748	indexedlog: reorder "use"s Summary: The latest rustfmt disagrees about the order of `std::io` imports. Move the troublesome line to a separate group so both the old and new rustfmt agress on the format. Reviewed By: DurhamG Differential Revision: D7422834 fbshipit-source-id: 9f5289ef2af1a691559fe691e121190f6d845162	2018-04-13 21:51:40 -07:00
Jun Wu	9672c45582	indexedlog: add a test comparing with std HashMap Reviewed By: DurhamG Differential Revision: D7404529 fbshipit-source-id: a52da9aa9661b48eefc015ce351886677f842d66	2018-04-13 21:51:40 -07:00
Jun Wu	9077cbb5a7	indexedlog: reverse the writing order of radix entries Summary: Radix entries need to be written in an reversed order given the order they are added to the vector. Reviewed By: DurhamG Differential Revision: D7404530 fbshipit-source-id: 403189b5c0fa6f21183e62eea04ce4ce7c4e1129	2018-04-13 21:51:40 -07:00
Jun Wu	2075ad87c2	indexedlog: implement leaf splitting Summary: Complete the insertion interface. Reviewed By: DurhamG Differential Revision: D7377210 fbshipit-source-id: 96645ac03a3fd65f22d9a9a54d8479715f49e67d	2018-04-13 21:51:39 -07:00
Jun Wu	a436d0554d	indexedlog: add more helper methods Summary: Those little read and write helpers are used in the next diff. Reviewed By: DurhamG Differential Revision: D7377214 fbshipit-source-id: c6e2d240334c11a0b08b15cd7d5c114b6f4d8ace	2018-04-13 21:51:39 -07:00
Jun Wu	61bf1f3854	indexedlog: add a helper function to get key content Summary: Add a helper function `peek_key_entry_content` that checks key type and return the key content. Reviewed By: DurhamG Differential Revision: D7377211 fbshipit-source-id: 0ce509aba30309373a709cf5fbcb909dd80471dc	2018-04-13 21:51:39 -07:00
Jun Wu	bf55572f78	indexedlog: partially implement insertion Summary: Implement insertion when there is no need to split a leaf entry. The API may be subject to change if we want other value types. For now, it's better to get something working and can be benchmarked so we have data about performance impact with new format changes. Reviewed By: DurhamG Differential Revision: D7343423 fbshipit-source-id: 9761f72168046dbafcb00883634aa7ad513a522b	2018-04-13 21:51:39 -07:00
Jun Wu	2389fd95c0	indexedlog: add helper methods about writing data Summary: Like the `peek_` family of helper methods. Those methods handles writing data for both dirty (in-memory) and non-dirty (on-disk) cases. They will be used in the next diff. Reviewed By: DurhamG Differential Revision: D7377208 fbshipit-source-id: f458a20da4bb7808f37daeed3077be2f7e90a9df	2018-04-13 21:51:39 -07:00
Jun Wu	cb58628046	indexedlog: add debug formatter Summary: Add code to print out Index's on-disk and in-memory entries in human-friendly form. This is useful for explaining its internal state, so it could be used in tests. Reviewed By: DurhamG Differential Revision: D7343427 fbshipit-source-id: 706a35404ea42c413657b389166729f8dd1315a3	2018-04-13 21:51:39 -07:00
Jun Wu	a3f7ec3f9b	indexedlog: fix root entry serialization Summary: Offset stored in it needs to be translated, as done in other types of entries. I forgot it. Reviewed By: DurhamG Differential Revision: D7404528 fbshipit-source-id: fb09a9c3052ddfe8f8016440290062084d5d8b03	2018-04-13 21:51:39 -07:00

1 2

67 Commits