Optimize costs of importing in the cloud. #326

Every time I run an import, 10 GCE workers download about 20GB of data/input. The S3 outbound charges are uncomfortably high. Instead, use GCP's S3->GCS transfer tool manually before each run, and make the GCE VMs read from that instead. I haven't tested these changes yet, but will soon with the next import.
2024-11-24 01:15:12 +03:00 · 2021-05-27 08:09:03 -07:00 · 2021-05-27 08:09:03 -07:00 · 83bc768e28
commit 83bc768e28
parent 53430319b1
2 changed files with 11 additions and 3 deletions
--- a/cloud/start_batch_import.sh
+++ b/cloud/start_batch_import.sh
@ -16,6 +16,11 @@ if [ "$EXPERIMENT_TAG" == "" ]; then
 	exit 1;
 fi

+if [ "$2" != "gcs_sync_done" ]; then
+	echo First go sync dev/data/input from S3 to GCS. https://console.cloud.google.com/transfer/cloud/jobs
+	exit 1;
+fi
+
 NUM_WORKERS=10
 ZONE=us-east1-b
 # See other options: https://cloud.google.com/compute/docs/machine-types
--- a/cloud/worker_script.sh
+++ b/cloud/worker_script.sh
@ -25,9 +25,12 @@ cd worker_payload
 mv .aws ~/

 # If we import without raw files, we'd wind up downloading fresh OSM data!
-# Reuse what's in S3. We could use the updater, but probably aws sync is
-# faster.
-aws s3 sync s3://abstreet/dev/data/input data/input/
+# Reuse what's in S3. But having a bunch of GCE VMs grab from S3 is expensive,
+# so instead, sync from the GCS mirror that I manually update before each job.
+gsutil -m cp -r gs://abstreet-importer/ .
+mv abstreet-importer/dev/data/input data/input
+rmdir abstreet-importer/dev
+rmdir abstreet-importer
 find data/input -name '*.gz' -print -exec gunzip '{}' ';'

 # Set up Docker, for the elevation data