Optimize costs of importing in the cloud. #326

Every time I run an import, 10 GCE workers download about 20GB of
data/input. The S3 outbound charges are uncomfortably high.

Instead, use GCP's S3->GCS transfer tool manually before each run, and
make the GCE VMs read from that instead.

I haven't tested these changes yet, but will soon with the next import.
This commit is contained in:
Dustin Carlino 2021-05-27 08:09:03 -07:00
parent 53430319b1
commit 83bc768e28
2 changed files with 11 additions and 3 deletions

View File

@ -16,6 +16,11 @@ if [ "$EXPERIMENT_TAG" == "" ]; then
exit 1;
fi
if [ "$2" != "gcs_sync_done" ]; then
echo First go sync dev/data/input from S3 to GCS. https://console.cloud.google.com/transfer/cloud/jobs
exit 1;
fi
NUM_WORKERS=10
ZONE=us-east1-b
# See other options: https://cloud.google.com/compute/docs/machine-types

View File

@ -25,9 +25,12 @@ cd worker_payload
mv .aws ~/
# If we import without raw files, we'd wind up downloading fresh OSM data!
# Reuse what's in S3. We could use the updater, but probably aws sync is
# faster.
aws s3 sync s3://abstreet/dev/data/input data/input/
# Reuse what's in S3. But having a bunch of GCE VMs grab from S3 is expensive,
# so instead, sync from the GCS mirror that I manually update before each job.
gsutil -m cp -r gs://abstreet-importer/ .
mv abstreet-importer/dev/data/input data/input
rmdir abstreet-importer/dev
rmdir abstreet-importer
find data/input -name '*.gz' -print -exec gunzip '{}' ';'
# Set up Docker, for the elevation data