sapling

mirror of https://github.com/facebook/sapling.git synced 2024-10-13 02:07:31 +03:00

Author	SHA1	Message	Date
Jun Wu	892fcd6dfd	indexedlog: use typed offsets Summary: This is a large refactoring that replaces `u64` offsets with strong typed ones. Tests about serialization are removed since they generate illegal data that cannot pass type check. It seems to slow down the code a bit, comparing with D7404532. But there are still room to improve. index insertion time: [6.9395 ms 7.3863 ms 7.7620 ms] index flush time: [15.949 ms 17.965 ms 20.246 ms] index lookup (memory) time: [3.6212 ms 3.8855 ms 4.1923 ms] index lookup (disk) time: [2.2496 ms 2.4649 ms 2.8090 ms] index clone (memory) time: [2.7292 ms 2.9399 ms 3.2055 ms] index clone (disk) time: [4.9239 us 5.5928 us 6.3167 us] Reviewed By: DurhamG Differential Revision: D7422833 fbshipit-source-id: 7357cb0f4f573f620e829c5e300cd423619dbd62	2018-04-13 21:51:41 -07:00
Jun Wu	a87fea077c	indexedlog: prefix in-memory entries with `Mem` Summary: This makes it clear the code has different code paths for on-disk entries. Reviewed By: DurhamG Differential Revision: D7422836 fbshipit-source-id: 018fa0e2c20682d4e1beba99f3307550e1f40388	2018-04-13 21:51:40 -07:00
Jun Wu	3332522d43	indexedlog: add some benchmarks Summary: Add benchmarks inserting / looking up 20K entries. Benchmark results on my laptop are: index insertion time: [6.5339 ms 6.8174 ms 7.1805 ms] index flush time: [15.651 ms 16.103 ms 16.537 ms] index lookup (memory) time: [3.6995 ms 4.0252 ms 4.3046 ms] index lookup (disk) time: [1.9986 ms 2.1224 ms 2.2464 ms] index clone (memory) time: [2.5943 ms 2.6866 ms 2.7749 ms] index clone (disk) time: [5.2302 us 5.5477 us 5.9518 us] Comparing with highly optimized radixbuf: index insertion time: [991.89 us 1.1708 ms 1.3844 ms] index lookup time: [863.83 us 945.69 us 1.0304 ms] Insertion takes 6x time. Lookup from memory takes 1.4x time, from disk takes 2.2x time. Flushing is the slowest - it needs 16x radixbuf insertion time. Note: need to subtract "clone" time from "lookup" to get meaningful values about "lookup". This cannot be done automatically due to the limitation of the benchmark framework. Although it's slower than radixbuf, the index is still faster than gdbm and rocksdb. Note: the index does less than gdbm/rocksdb since it does not return a `[u8]`-ish which requires extra lookups. So it's not a very fair comparison. gdbm insertion time: [69.607 ms 75.102 ms 79.334 ms] gdbm lookup time: [9.0855 ms 9.8480 ms 10.637 ms] gdbm prepare time: [110.35 us 120.40 us 135.63 us] rocksdb insertion time: [117.96 ms 123.42 ms 127.85 ms] rocksdb lookup time: [24.413 ms 26.147 ms 28.153 ms] rocksdb prepare time: [3.8316 ms 4.1776 ms 4.5039 ms] Note: Subtract "prepare" from "insertion" to get meaningful values. Code to benchmark rocksdb and gdbm: ``` extern crate criterion; extern crate gnudbm; extern crate rand; extern crate rocksdb; extern crate tempdir; use criterion::Criterion; use gnudbm::GdbmOpener; use rand::{ChaChaRng, Rng}; use rocksdb::DB; use tempdir::TempDir; const N: usize = 20480; /// Generate random buffer fn gen_buf(size: usize) -> Vec<u8> { let mut buf = vec![0u8; size]; ChaChaRng::new_unseeded().fill_bytes(buf.as_mut()); buf } fn criterion_benchmark(c: &mut Criterion) { c.bench_function("rocksdb prepare", \|b\| { b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let _db = DB::open_default(dir.path().join("a")).unwrap(); }); }); c.bench_function("rocksdb insertion", \|b\| { let buf = gen_buf(N * 20); b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let db = DB::open_default(dir.path().join("a")).unwrap(); for i in 0..N { db.put(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } }); }); c.bench_function("rocksdb lookup", \|b\| { let dir = TempDir::new("index").expect("TempDir::new"); let db = DB::open_default(dir.path().join("a")).unwrap(); let buf = gen_buf(N * 20); for i in 0..N { db.put(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } b.iter(move \|\| { for i in 0..N { db.get(&&buf[20 * i..20 * (i + 1)]).unwrap(); } }); }); c.bench_function("gdbm prepare", \|b\| { let buf = gen_buf(N * 20); b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let _db = GdbmOpener::new().create(true).readwrite(dir.path().join("a")).unwrap(); }); }); c.bench_function("gdbm insertion", \|b\| { let buf = gen_buf(N * 20); b.iter(move \|\| { let dir = TempDir::new("index").expect("TempDir::new"); let mut db = GdbmOpener::new().create(true).readwrite(dir.path().join("a")).unwrap(); for i in 0..N { db.store(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } }); }); c.bench_function("gdbm lookup", \|b\| { let dir = TempDir::new("index").expect("TempDir::new"); let mut db = GdbmOpener::new().create(true).readwrite(dir.path().join("a")).unwrap(); let buf = gen_buf(N * 20); for i in 0..N { db.store(&&buf[20 * i..20 * (i + 1)], b"v").unwrap(); } b.iter(move \|\| { for i in 0..N { db.fetch(&&buf[20 * i..20 * (i + 1)]).unwrap(); } }); }); } criterion_group!{ name=benches; config=Criterion::default().sample_size(20); targets=criterion_benchmark } criterion_main!(benches); ``` Reviewed By: DurhamG Differential Revision: D7404532 fbshipit-source-id: ff39f520b78ad1b71eb36970506b313bb2ff426b	2018-04-13 21:51:40 -07:00
Jun Wu	5576402ea9	indexedlog: add ability to clone a `Index` object Summary: This will be useful for benchmarks - prepare an index as a template, and clone it in the tests. Reviewed By: DurhamG Differential Revision: D7422835 fbshipit-source-id: 190bbdee7cb7c1526274b4d4dab07af4984b5df6	2018-04-13 21:51:40 -07:00
Jun Wu	2f30189748	indexedlog: reorder "use"s Summary: The latest rustfmt disagrees about the order of `std::io` imports. Move the troublesome line to a separate group so both the old and new rustfmt agress on the format. Reviewed By: DurhamG Differential Revision: D7422834 fbshipit-source-id: 9f5289ef2af1a691559fe691e121190f6d845162	2018-04-13 21:51:40 -07:00
Jun Wu	9672c45582	indexedlog: add a test comparing with std HashMap Reviewed By: DurhamG Differential Revision: D7404529 fbshipit-source-id: a52da9aa9661b48eefc015ce351886677f842d66	2018-04-13 21:51:40 -07:00
Jun Wu	9077cbb5a7	indexedlog: reverse the writing order of radix entries Summary: Radix entries need to be written in an reversed order given the order they are added to the vector. Reviewed By: DurhamG Differential Revision: D7404530 fbshipit-source-id: 403189b5c0fa6f21183e62eea04ce4ce7c4e1129	2018-04-13 21:51:40 -07:00
Jun Wu	2075ad87c2	indexedlog: implement leaf splitting Summary: Complete the insertion interface. Reviewed By: DurhamG Differential Revision: D7377210 fbshipit-source-id: 96645ac03a3fd65f22d9a9a54d8479715f49e67d	2018-04-13 21:51:39 -07:00
Jun Wu	a436d0554d	indexedlog: add more helper methods Summary: Those little read and write helpers are used in the next diff. Reviewed By: DurhamG Differential Revision: D7377214 fbshipit-source-id: c6e2d240334c11a0b08b15cd7d5c114b6f4d8ace	2018-04-13 21:51:39 -07:00
Jun Wu	61bf1f3854	indexedlog: add a helper function to get key content Summary: Add a helper function `peek_key_entry_content` that checks key type and return the key content. Reviewed By: DurhamG Differential Revision: D7377211 fbshipit-source-id: 0ce509aba30309373a709cf5fbcb909dd80471dc	2018-04-13 21:51:39 -07:00
Jun Wu	bf55572f78	indexedlog: partially implement insertion Summary: Implement insertion when there is no need to split a leaf entry. The API may be subject to change if we want other value types. For now, it's better to get something working and can be benchmarked so we have data about performance impact with new format changes. Reviewed By: DurhamG Differential Revision: D7343423 fbshipit-source-id: 9761f72168046dbafcb00883634aa7ad513a522b	2018-04-13 21:51:39 -07:00
Jun Wu	2389fd95c0	indexedlog: add helper methods about writing data Summary: Like the `peek_` family of helper methods. Those methods handles writing data for both dirty (in-memory) and non-dirty (on-disk) cases. They will be used in the next diff. Reviewed By: DurhamG Differential Revision: D7377208 fbshipit-source-id: f458a20da4bb7808f37daeed3077be2f7e90a9df	2018-04-13 21:51:39 -07:00
Jun Wu	cb58628046	indexedlog: add debug formatter Summary: Add code to print out Index's on-disk and in-memory entries in human-friendly form. This is useful for explaining its internal state, so it could be used in tests. Reviewed By: DurhamG Differential Revision: D7343427 fbshipit-source-id: 706a35404ea42c413657b389166729f8dd1315a3	2018-04-13 21:51:39 -07:00
Jun Wu	a3f7ec3f9b	indexedlog: fix root entry serialization Summary: Offset stored in it needs to be translated, as done in other types of entries. I forgot it. Reviewed By: DurhamG Differential Revision: D7404528 fbshipit-source-id: fb09a9c3052ddfe8f8016440290062084d5d8b03	2018-04-13 21:51:39 -07:00
Jun Wu	fcc71af3ab	indexedlog: add API to find link offset from a key Summary: This is a low-level API that follows the base16 sequence of a key, and return potentially matched `LinkOffset`. Reviewed By: DurhamG Differential Revision: D7343424 fbshipit-source-id: 38f260064d1a23695a28dda6f7dc921f88c7fccc	2018-04-13 21:51:39 -07:00
Jun Wu	871ca6c96b	indexedlog: add helper methods to read data Summary: Add a bunch of helper methods to "peek" data inside all kinds of entries. They will be used in the next diff. The benefit of those helper methods is they handle both dirty offsets and non-dirty offsets transparently. Previously I have tried to always parse on-disk entries into in-memory ones and stored them in a hashmap cache. But that turned to have too much overhead so always reading from disk is more desirable. It seems to provide at least 2x perf improvement from my previous quick test. Reviewed By: DurhamG Differential Revision: D7377207 fbshipit-source-id: 1b393f1fe64c1d54b986ba7c3b03c790adb694d4	2018-04-13 21:51:39 -07:00
Jun Wu	983d6920f5	indexedlog: add a non-dirty helper method Summary: The `non_dirty` helper method enforces the offset to be a non-dirty one. It will be used frequently for checking offsets read from the disk, since the on-disk offsets shouldn't have any reference to dirty (in-memory) entries. Reviewed By: DurhamG Differential Revision: D7377209 fbshipit-source-id: c6c381c065d3ba8aaa65698224e4778b86edbc4a	2018-04-13 21:51:39 -07:00
Jun Wu	f0b5cd6eae	indexedlog: add simple `DirtyOffset` abstraction Summary: The `DirtyOffset` enum converts between array indexes and u64. Reviewed By: DurhamG Differential Revision: D7377215 fbshipit-source-id: 29d4f7d74f15523034c11abcc09329a1b21142b1	2018-04-13 21:51:39 -07:00
Jun Wu	3859d00394	indexedlog: implement flush for the main index Summary: The flush method will write buffered data to disk. A mistake in Root entry serialization is fixed - it needs to translate dirty offsets to non-dirty ones. Reviewed By: DurhamG Differential Revision: D7223729 fbshipit-source-id: baeaab27627d6cfb7c5798d3a39be4d2b8811e5f	2018-04-13 21:51:35 -07:00
Jun Wu	8f5c35c8d2	indexedlog: initial main index structure Summary: Add the main `Index` structure and its constructor. The structure focus on the index logic itself. It does not have the checksum part yet. Some notes about choices made: - The use of mmap: mmap is good for random I/O, and has the benefit of sharing buffers between processes reading the same file. We may be able to do good user-space caching for the random I/O part. But it's harder to share the buffers between processes. - The "read_only" auto decision. Common "open" pattern requires the caller to pass whether they want to read or write. The index makes the decision for the caller for convenience (ex. running "hg log" on somebody else's repo). - The "load root entry from the end of the file" feature. It's just for convenience for users wanting to use the Index in a standalone way. We probably Reviewed By: DurhamG Differential Revision: D7208358 fbshipit-source-id: 14b74d7e32ef28bd5bc3483fd560c489d36bf8e5	2018-04-13 21:51:35 -07:00
Jun Wu	865700883d	indexedlog: move mmap_readonly to utils Summary: `mmap_readonly` will be reused in `index.rs` so let's moved it to a shared utils module. Reviewed By: DurhamG Differential Revision: D7208359 fbshipit-source-id: d98779e4e21765ce0e185281c9560245b59b174c	2018-04-13 21:51:25 -07:00
Jun Wu	d3b0f0cdfb	indexedlog: add RAII file lock Summary: Add ScopedFileLock. This is similar to Python's contextmanager. It's easier to use than the fs2 raw API, since it guarantees the file is unlocked. Reviewed By: jsgf Differential Revision: D7203684 fbshipit-source-id: 5d7beed99ff992466ab7bf1fbea0353de4dfe4f9	2018-04-13 21:51:25 -07:00
Jun Wu	605cd36716	indexedlog: add serialization for root entry Reviewed By: DurhamG Differential Revision: D7191653 fbshipit-source-id: 4c82a6b2a00d8e4cb3c67ecb382659ff8946bdad	2018-04-13 21:51:25 -07:00
Jun Wu	0f9d39cae8	indexedlog: add serialization for key entry Reviewed By: DurhamG Differential Revision: D7191651 fbshipit-source-id: 8eb8cbc00f0b15660e6d9e988ae41b761d854fa2	2018-04-13 21:51:25 -07:00
Jun Wu	ba05e88179	indexedlog: add serialization for leaf and link entry Summary: They are simpler than radix entry and similar. Reviewed By: DurhamG Differential Revision: D7191652 fbshipit-source-id: b516663567267a2e354748396b44c2ac8ebb691f	2018-04-13 21:51:25 -07:00
Jun Wu	dab5948078	indexedlog: add serialization for radix entry Summary: Start serialization implementation. First, add support for the radix entry. Reviewed By: DurhamG Differential Revision: D7191365 fbshipit-source-id: 54a5ba5c666ba4def1e80eaa2ff7d4d77ff53f8c	2018-04-13 21:51:25 -07:00
Jun Wu	599194b15d	indexedlog: define basic structures Summary: These are Rust structures that map to the file format. Reviewed By: DurhamG Differential Revision: D7191366 fbshipit-source-id: 23a4431383be9713e955b74306cd68108eb80536	2018-04-13 21:51:25 -07:00
Jun Wu	6542d0ebf4	indexedlog: add comment about index file format Summary: Document the format. Actual implementation in later diffs. Reviewed By: DurhamG Differential Revision: D7190575 fbshipit-source-id: 243992fd052ca7a9688d54d20694e65daebb9660	2018-04-13 21:51:25 -07:00
Jun Wu	015a4ac5d6	indexedlog: port base16 iterator from radixbuf Summary: The append-only index is too different so it's cleaner to cherry-pick code from radixbuf, instead of modifying radixbuf which would break code depending on it. Started by picking the base16 iterator part. `rustc-test` does not work with buck, and seems to be in an unmaintained state, so benchmark tests are migrated to criterion. Reviewed By: DurhamG Differential Revision: D7189143 fbshipit-source-id: 459a79b4cf16f35d2ff86f11a5980ba1fc627951	2018-04-13 21:51:25 -07:00
Jun Wu	d2c457a6e2	indexedlog: integrity check utility on an append-only file Summary: Filesystem is hard. Append-only sounds like a safe way to write files, but it only really helps with process crashes. If the OS crashes, it's possible that other parts of the file gets corrupted. As source control, data integrity check is important. So bytes not logically touched by appending also needs to be checked. Implement a `ChecksumTable` which adds integrity check ability to append-only files. It's intended to be used by future append-only indexes. Reviewed By: DurhamG Differential Revision: D7108433 fbshipit-source-id: 16daf6b8d04bba464f1ee9221716beba69c1d47b	2018-04-13 21:51:24 -07:00
Jun Wu	0518016553	indexedlog: initial boilerplate Summary: First step of a storage-related building block that is in Rust. The goal is to use it to replace revlog, obsstore and packfiles. Extern crates that are likely useful are added to reduce future churns. Reviewed By: DurhamG Differential Revision: D7108434 fbshipit-source-id: 97ebd9ba69547d876dcecc05e604acdf9088877e	2018-04-13 21:51:24 -07:00

31 Commits