Summary:
Implemented initial version of a garbage collector for repack. Garbage
collection is only performed if gcrepack flag is set in config. Currently
data that is old and not in keepset is garabage collected. The age of data at
which point it will be garbage collected can be specified in config as well.
Test Plan: Added a test case
Reviewers: durham
Reviewed By: durham
Subscribers: rmcelroy, medson, mjpieters, #mercurial
Differential Revision: https://phabricator.intern.facebook.com/D5548194
Tasks: 19727521
Signature: t1:5548194:1501882489:cc775ec95aa6fafc462a46f99d790f231139ce27
Summary:
Previously looking up a particular node in a histpack required a bisect to find
the file section, then a linear scan to find the particular node. If you needed
to look up the latest 3000 nodes one by one, that involved 3000 linear scans,
many of which traversed the same nodes over and over.
This patch adds additional index at the end of the current histidx file. In a
future patch, we will change getnodeinfo() to use this index instead of the
linear scan logic.
Test Plan:
Ran the tests. I haven't actually verified that the data in these
indexes is correct. My next patch will add logic that reads these indexes and
will add tests around it. I won't land this until I've confirmed it's correct.
Reviewers: quark, #mercurial, rmcelroy
Reviewed By: rmcelroy
Subscribers: rmcelroy, mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4983690
Signature: t1:4983690:1493798877:ae0802b4896b54bf066df9f684d94554855fd35a
Summary:
Previously, we used the length of the index file to determine the upper bounds
of the bisect. In a future patch we'll want to add more data to the end of the
index file, so we need to record how long the index portion of the index is.
This patch adds that information.
Test Plan: Ran the tests.
Reviewers: #mercurial, quark
Reviewed By: quark
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4983682
Signature: t1:4983682:1493693255:57ab9af2030847fedff05b6755113ba8ce0c933b
Summary:
The issue was introduced by the getmeta refactoring. It didn't get caught by
tests because tests didn't trigger freememory.
Test Plan: The fix was verified working on fbsource
Reviewers: #mercurial, durham
Reviewed By: durham
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4978835
Signature: t1:4978835:1493689136:296d425cf5d08b807b898c0e8cd881c9207c6359
Summary:
This diff makes 2 changes to v1 packfile metadata:
1. Move `key` in a metadata entry to before `size`.
```
old: [entry-size: 2 byte] [key: 1 byte] [data: var length]
new: [key: 1 byte] [data-size: 2 byte] [data: var length]
```
Previously `entry-size == 0` does not make sense.
2. Use binary to represent sizes, instead of ASCII.
Related utility methods are cleaned up a bit so it's harder to make mistakes.
Test Plan: Updated existing tests
Reviewers: #mercurial, durham
Reviewed By: durham
Subscribers: durham, mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4983189
Signature: t1:4983189:1493689852:22d544d73ed63fac83f849786de035af304161ce
Summary:
This diff implements getmeta in C and enables related tests.
Now all content stores support `getmeta`.
Test Plan: Run existing tests.
Reviewers: #mercurial, durham
Reviewed By: durham
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4960926
Signature: t1:4960926:1493611048:55095c32927fac74e698f21d47173cb8a7523fb6
Summary:
This diffs add a `getmeta` method to all content stores. The cdatapack code is
modified to pass the tests, it needs further change to support `getmeta`.
The datapack format is bumped to v1 from v0. For v1, we append a `metadata`
dict at the end of each revision. The dict is currently used to store revlog
flags and rawsize of raw revlog fulltext. In the future we can put more data
like a second hash etc, without changing API or format again.
This diff focuses on correctness. A datapack caching layer to speed up
`getmeta` will be added later.
Tests are updated since we write new v1 packfile now and the format change
leads to different content and packfile names.
`Makefile`, `ls-l.py` are added to make tests easier to maintain.
Test Plan: Updated existing tests.
Reviewers: #mercurial, rmcelroy, durham
Reviewed By: durham
Subscribers: rmcelroy, mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4903917
Signature: t1:4903917:1493255844:7ef5d487096cd2f78f2aaae672a68d49f33632ee
Summary:
To be able to bump version and change formats, the related constants need to
be moved to individual classes. So a class (ex. datapack) can be subclassed to
handle different formats.
Test Plan: `arc unit`
Reviewers: #mercurial, durham
Reviewed By: durham
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4927284
Signature: t1:4927284:1493152641:e3274dd735d50baf193b7615dd314f4e6cf161f0
Summary:
As part of unifying our storage layer into a single library, let's move
py-cdatapack into the new cstore directory. Future patches will move
ctreemanifest and the upcoming datapackstore into here as well.
py-cdatapack.h required some reordering since it seems forward declarations work
a little differently between C and C++. There were no code changes though,
except one int->size_t fix.
Test Plan: Ran the tests
Reviewers: #mercurial, simonfar
Reviewed By: simonfar
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4581320
Signature: t1:4581320:1487788968:e8a34c7a03a16db282214c7dd476b749b92a1bfa
Summary:
We have run into some cases where users ended up with empty pack file in their
packs directory. Just log a warning in this case and skip this pack file,
rather than letting the exception propagate up and crashing the command.
Test Plan:
Created empty 0000000000000000000000000000000000000000.histpack and
0000000000000000000000000000000000000000.histidx files in my repository's
hgcache directory, and confirmed that "hg log" now simply warns about them
instead of crashing.
I didn't really test the perftest.py or treemanifest_correctness.py extensions
much. They seem to throw exceptions, and look like they have maybe gotten a
bit stale. I fixed one minor typo, but didn't dig into the other exceptions
too much.
Reviewers: durham
Reviewed By: durham
Subscribers: net-systems-diffs@, yogeshwer, mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4402516
Tasks: 15428659
Signature: t1:4402516:1484155254:96d2980efcec2d735257b08910e1ca437ef1dad6
Summary:
This fixes all the pyflaks and module errors for the main remotefilelog
code base.
Test Plan: ./run-tests.py test-check* test-remotefilelog*
Reviewers: #mercurial, quark
Reviewed By: quark
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D4055537
Signature: t1:4055537:1477049663:ee904d311d17d3659e055e2c109c68c9023cfd1f
Summary:
fastdatapack is the same as datapack. add selector in datapackstore to determine which datapack to create.
test-datapack-fast.t is the same as tset-datapack.t, except it enables fastdatapack
Test Plan: pass test-datapack.t test-datapack-fast.t
Reviewers: #fastmanifest, durham
Reviewed By: durham
Subscribers: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3666932
Signature: t1:3666932:1470426499:45292064e2868caab152d9a5b788840c5f63e4e4
Summary:
When retrieving a delta chain, datapack.py uncompresses the delta chain data. However, when iterating over the datapack, we get the compressed length. THis is not desirable as the output is no longer consistent. This diff peeks into the lz4 header to get the uncompressed length when iterating.
Depends on D3627119
Test Plan:
```
[andromeda]:~/work/mercurial/facebook-hg-rpms/remotefilelog:e2ef218> hg debugdatapack d864669a5651d04505ec6e5e9dba1319cde71f7b --node f2e53f83c5dc806aa2eda87bb15fe0367baf3a7e
source/zippydb/tier_spec/tier_settings/zippydb.wildcard.tmpfs.zippydb_settings.cconf
Node Delta Base Delta SHA1 Delta Length
f2e53f83c5dc806aa2eda87bb15fe0367baf3a7e 0000000000000000000000000000000000000000 f32b366a6c44430df6526133f82f9638426ba9c5 37769
[andromeda]:~/work/mercurial/facebook-hg-rpms/remotefilelog:e2ef218> hg debugdatapack d864669a5651d04505ec6e5e9dba1319cde71f7b | tail -n 4
source/zippydb/tier_spec/tier_settings/zippydb.wildcard.tmpfs.zippydb_settings.cconf
Node Delta Base Delta Length
f2e53f83c5dc 000000000000 37769
[andromeda]:~/work/mercurial/facebook-hg-rpms/remotefilelog:e2ef218>
```
Reviewers: durham
Reviewed By: durham
Subscribers: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3636945
Signature: t1:3636945:1469811243:b21d90d9599244ed4600c5336818b9a18eacf3ff
Summary:
It should include the filelen and the deltalen fields, which are
2 and 8 bytes.
Test Plan: visual.
Reviewers: durham
Reviewed By: durham
Subscribers: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3627108
Signature: t1:3627108:1469742083:ffb59768906d9e5463065eec92e1c80cc8482884
Summary:
When running large repack operations, the resident size of the process
could become quite large, since we're scanning in entire pack files. Linux/OSX
have api calls for telling the kernel it's ok to release some of that memory,
but those apis are not exposed to python.
So instead, let's unmap and remap the mmap's once a certain amount of data has
been read. I also tried changing the mmap accessors to use the file oriented api
(mmap.read(), mmap.seek(), etc) so we could switch to actual file handles during
repack, but it had a drastic affect on normal performance (repack took 1 hour
instead of a few minutes).
Long term we should move all of this logic to c++ so we can use the more
powerful APIs.
Test Plan:
Did a full repack on a laptop and verified memory capped out at 2GB
instead of exceeding 5GB.
Reviewers: #sourcecontrol, ttung
Differential Revision: https://phabricator.intern.facebook.com/D3545171
Summary:
There was a race condition where if a repack is running and another hg process
launches, the new process will only see the original packs, and not any of the
new packs (even though the source blobs are being deleted from disk by the
repack).
The fix is to allow our pack store to refresh it's list of packs every so often.
In this particular implementation we do it at most every 100ms. A more robust
strategy would be to group key misses and only check for new packs at the end
once we have a list of all the misses, but this would require significant
refactoring to make everything grouped. This case should only ever happen during
repacks, so it should almost never occur more than once during a command, so the
100ms version is probably good enough.
Test Plan:
Ran `hg up && hg pull && sleep 0.2 && hg up master` in a loop with a
break point in the refresh code and caught it executing in a situation where the
background repack had removed the original sources and put them in a new pack.
Verified that it loaded the data from the new pack correctly.
Reviewers: #mercurial, ttung, lcharignon
Reviewed By: lcharignon
Subscribers: lcharignon
Differential Revision: https://phabricator.intern.facebook.com/D3524314
Signature: t1:3524314:1467907680:85be07ad953811000c468852eb0626f4d8b53a13
We've unified on KeyError being the error thrown when the pack is missing the
desired filename+filehash, but there were a few old places still using
LookupError. This patch changes them to also be KeyError.
This fixes an issue where a repack could throw a LookupError when it only had a
partial history of a file. Now that we throw a KeyError, the exception is caught
and handled appropriately.
Summary:
This moves the common logic from datapack and historypack into a common
basepack. At the moment the only common logic is the constructor, which handles
version checking, fanout initialization, and mmap stuff.
Test Plan: Ran the tests
Reviewers: mitrandir, #mercurial, ttung, mjpieters
Reviewed By: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D3306558
Signature: t1:3306558:1463474571:35d3d2e71849b8111e5455da2dd4810725a35523
Summary:
This takes the duplicate logic from datapackstore and historypackstore and moves
it into a common subclass.
Test Plan: Ran the tests
Reviewers: ttung, mitrandir, #mercurial, mjpieters
Reviewed By: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D3306551
Signature: t1:3306551:1463474958:ddd36a43c2a3cbb34254c2b0153c0f3c5d431edb
Summary:
mutabledatapack and mutablehistorypack share a lot of common code, especially
around the index and fanout table. Let's move much of the code to a common
mutablebasepack class and out of mutabledatapack. In the next patch I will make
mutablehistorypack a subclass of mutablebasepack and delete all the duplicate
logic.
Test Plan: Ran the tests
Reviewers: mitrandir, #mercurial, ttung, rmcelroy
Reviewed By: rmcelroy
Subscribers: quark, rmcelroy
Differential Revision: https://phabricator.intern.facebook.com/D3306542
Signature: t1:3306542:1463611860:16bc68416c9bbed87748a50f55a3bac7c618fdf1
Summary:
Previously creating a mutable pack then not adding anything to it, would result
in a 1 byte pack file. This patch fixes it so it doesn't produce any pack file.
A future patch found this bug and includes a test that executes this code path.
Test Plan: Future patch adds a test
Reviewers: mitrandir, #mercurial, ttung, rmcelroy
Reviewed By: rmcelroy
Differential Revision: https://phabricator.intern.facebook.com/D3306520
Signature: t1:3306520:1463602354:f47e9478ea686b0b86525b5979dec9ce56301b2a
Summary:
Previously all fanout tables were 2^16 in length. For small packs, this resulted
in very sparse fanout tables that had to be linearly scanned to find the end of
the search bounds, which was slow.
With this patch, the data index now dynamically chooses what fanout table size
to use. If the pack has over 2^16 / 10 entries, we use 2^16, otherwise we use
2^8. The reasoning is in the code.
The patch is a bit large because we had to take a bunch of constants and
duplicate them, and change their accessors to access them through member
variables.
Test Plan: Added a test
Reviewers: #mercurial, ttung, mitrandir
Reviewed By: mitrandir
Subscribers: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3279000
Signature: t1:3279000:1463170357:9746d19dde14743bac9a8c40cafbc618504c420f
Summary:
In a future patch we'll add debug commmands to view the contents of packs. Let's
repurpose iterkeys to just be __iter__ and have it return most of the data about
each entry.
Test Plan: Ran the tests
Reviewers: #mercurial, ttung, mitrandir
Reviewed By: mitrandir
Subscribers: mitrandir, quark
Differential Revision: https://phabricator.intern.facebook.com/D3277225
Signature: t1:3277225:1463170015:c1acbc7f435f1cf8d07fea4f32bf742a22de5716
Summary:
Previously, we assumed every delta in a pack had a complete chain in that pack,
so the index had no way to indicate a deltabase offset that didn't exist in the
pack.
Let's add a new marker to indicate that the delta base doesn't exist in this
pack.
Test Plan: Tested in my large repo repack scenario. Added a test
Reviewers: #mercurial, ttung, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3277222
Signature: t1:3277222:1463084346:1687cfa174e98c2cf3022de9e9c3808881f689cd
Summary:
Since recent packs are likely to contain more recent data, let's put them at the
front of the pack list so they are checked first.
In a future diff I'll come back and refact the common code between datapack and
historypack into one base class.
Test Plan:
Ran the tests. Used the debugger to verify that the sort order was
correct.
Reviewers: #mercurial, ttung, mjpieters
Reviewed By: mjpieters
Subscribers: mjpieters
Differential Revision: https://phabricator.intern.facebook.com/D3259823
Signature: t1:3259823:1462388395:8ee48a7b02c6abc079878e53c5b336675249cb91
Summary:
Previously, if you ran repack twice in a row, it would actually delete your
packs, because the repack produced files with the same name as before, and the
cleanup then deleted them.
The fix is to have the stores record what files they produced in the ledger,
then future clean up work can avoid deleting anything that was created during
the repack.
Test Plan: Added a test
Reviewers: #mercurial, ttung, mitrandir
Reviewed By: mitrandir
Subscribers: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3255819
Signature: t1:3255819:1462393814:d32155b12535990f72fbe48de045eddbb6f7fab6
Summary:
Since pack files should never change after they are created, let's create them
with read-only permissions. It turns out that the Mercurial vfs doesn't apply
the correct permissions to files created by mkstemp (and we have to use mkstemp
since we don't know the name of the file until after we've written all the data
to it), so we have to manually call the permission fixing code.
We also need to fix our mmap calls to be readonly now, otherwise we get a
runtime permission denied exception.
Test Plan: Added a test
Reviewers: #mercurial, ttung, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3255816
Signature: t1:3255816:1462321201:dff4fb4c9301d67a77043ecc1d96262bb5d6a54a
Summary:
Instead of passing in a path and performing joins ourselves, let's use an
opener. This will help handle all the file permission edge cases.
Test Plan: Ran the tests
Reviewers: #mercurial, ttung, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3255165
Signature: t1:3255165:1462393836:38a28c850a0dc06838d9c17672d3dffd9903bbd7
Summary:
A future patch will add a version number to the index file. Let's move the
version size, fanout start, and index start to constants so we can more easily
change them without changing the code.
Test Plan: Ran the tests
Reviewers: lcharignon, rmcelroy, ttung, mitrandir, quark
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3254876
Signature: t1:3254876:1462315858:63fe56e8cfcdbb0209861898ce0c45c7d7b33e35
Summary:
Implementing these two functions will allow datapack's to be repacked (either
into other formats, or by combining multiple packs into one).
A future patch will add a test.
Test Plan: Added a test in a future patch
Reviewers: lcharignon, ttung, rmcelroy, mitrandir, quark
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3251539
Signature: t1:3251539:1462393256:7caa09677fbcaaf57a47d7a833684883483c5b3a
Summary:
Previously, if a repack failed, it would leave temporary pack files laying
around. By adding enter/exit functions to mutable packs, we can guarantee
cleanup happens.
Test Plan: Ran repack, verified that a failure did not leave tmp files
Reviewers: rmcelroy, quark, ttung, lcharignon, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3250201
Signature: t1:3250201:1462234552:7f20260a193ed1dd858bf6e9f489ac902d859218
Summary:
We had a naive repack implementation in datapack.py. Let's move it to the repack
module and do the minor adjustments to use the new repackerledger apis.
Test Plan: Ran it in conjunction with future diffs that make use of this api.
Reviewers: rmcelroy, ttung, lcharignon, quark, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3249585
Signature: t1:3249585:1462232504:a00aa65afca9562a2c1456cc4ab48c50d1ba5b68
Summary:
This implements the new markledger and cleanup apis on the existing
remotefilelog stores. These apis are used to tell the repacker what each store
has, and allows each store to cleanup if its data has been repacked.
Test Plan:
Ran repack in conjunction with the future diffs that make use of
these apis.
Reviewers: rmcelroy, ttung, lcharignon, quark, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3249584
Signature: t1:3249584:1462226133:1e8faffc9f6bf8f7c94e6e79aee8865e3c41648c
Summary:
Using range() allocates a full list, which is 2**16 entries in the fanout case.
Let's use xrange instead. This is a notable performance win when checking many
keys.
Also removed an unused variable and use index instead of self._index since this
is a hotpath.
Test Plan: Ran hg repack
Reviewers: rmcelroy, ttung, lcharignon, quark, mitrandir
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3249563
Signature: t1:3249563:1462240834:c19d6cbf0b6237f15ca8d81e8da856752df0ec59
Summary: Adds the same check code test that upstream Mercurial uses.
Test Plan:
Ran it, and fixed all the failures. I won't land this commit until
all the failure fixes are landed.
Reviewers: #sourcecontrol, ttung, rmcelroy, wez
Reviewed By: wez
Subscribers: quark, rmcelroy, wez
Differential Revision: https://phabricator.intern.facebook.com/D3221380
Signature: t1:3221380:1461802769:19f5bdc209c05edb442faa70ae572ce31e2fbc95
Summary: Fix check code for various store related files
Test Plan: Ran the tests
Reviewers: #sourcecontrol, mitrandir, ttung
Reviewed By: mitrandir
Differential Revision: https://phabricator.intern.facebook.com/D3222465
Signature: t1:3222465:1461701300:34560288be4dc921f0252d4ad8fdc9c8d9357e23
Summary:
This forces the revisions in the datapack to be added in alphabetical order.
This makes the algorithm more deterministic, but otherwise has little effect.
Test Plan: Ran the tests, ran repack
Reviewers: #sourcecontrol, rmcelroy, ttung
Reviewed By: rmcelroy
Subscribers: rmcelroy
Differential Revision: https://phabricator.intern.facebook.com/D3219760
Signature: t1:3219760:1461687720:7be5fdc1419f8214c8c83074494b33214b3684ae
Summary:
Now that we can read and write datapack files, let's add a store implementation that
can serve packed content. With this patch, it's technically possible for someone
to prefetch and repack large portions of history for long term storage with
remotefilelog.
My next set of commits (which haven't been written yet) will:
- add tests for all of this
- add an indexpack format for packing ancestor metadata (the datapack only packs
revision content)
Test Plan:
Ran the tests. Also repacked a repo, deleted the old cache files, ran
hg up null && hg up master, and verified it checked out master with the
right files and without fetching blobs from the server.
Reviewers: #sourcecontrol, ttung, rmcelroy
Reviewed By: rmcelroy
Subscribers: rmcelroy
Differential Revision: https://phabricator.intern.facebook.com/D3205351
Signature: t1:3205351:1461751649:45a56b57d962a282aeef9478500a3b23495a0eb7
Summary:
The previous patch added logic to repack store contents and write it to a
datapack file. This patch adds a new store implementation that knows how to read
datapacks.
It's just a simple implementation without any parallelism. So there's room for
improvement.
Test Plan:
Ran the tests. Also tested this in conjunction with the next patch
which actually reads from the data structure.
Reviewers: #sourcecontrol, ttung, rmcelroy
Reviewed By: rmcelroy
Subscribers: rmcelroy
Differential Revision: https://phabricator.intern.facebook.com/D3205342
Signature: t1:3205342:1461750967:84377517cb1f285d37694a3f503d60ae85bacb66
Summary:
This is an initial implementation of a repack algorithm that can read data from
an arbitrary store (in this case the remotefilelog content store), and repack it
into a datapack.
A datapack is two files:
- a .datapack file consisting of a series of deltas (a delta may be a full text if the delta base is the nullid)
- a .dataidx file consisting of delta information and an index into the deltas
See the code for documentation of the exact format.
Test Plan:
ran the tests
Ran `hg repack` in a large repo. Verified that a datapack and a dataidx file
were created in .hg/store/packs. The datapack used 148MB instead of the 439MB the
old remotefilelog storage used.
Reviewers: #sourcecontrol, ttung, rmcelroy
Reviewed By: rmcelroy
Subscribers: rmcelroy
Differential Revision: https://phabricator.intern.facebook.com/D3205334
Signature: t1:3205334:1461751366:ee4bf6a580ffb667071a8046fda6f0858b7f25ae