Accessing external data from Google Colab notebooks

Introduction

Google Colab notebooks provide a quick and easy way of prototyping different machine learning models. On the standard plan, we can get two machines with 12 GB of RAM, ~40 GB of disk space, and reasonably good GPUs, and we can use those machines for up to 12 hours at a time. If we implement proper model checkpointing, those resources are sufficient to train many state-of-the-art models to good accuracy.

The primary limitation that we have faced when using Colab notebooks to train machine learning models is the relatively small amount of disk space that is available on Colab VMs. Deep neural networks often require terabytes of training data to achieve state of the art performance, while Colab VMs have only ~40 GB of local storage available to the user.

In this blog post, we review different options for accessing large datasets from Colab notebooks. We find that Google Cloud Storage buckets hosted in the same region as the Colab VM offer the best performance at the lowest cost.

Accessing data from Google Drive

Mounting Google Drive using google.colab.drive

Instructions

from google.colab import drive
drive.mount('drive')

Results

I found that this approach would cache too much data to disk, leading to “out of space” errors after a small portion of the training data has been processed.

Mounting Google Drive using google-drive-ocamlfuse

Instructions

Note: These instructions are largely borrowed from a post by Daniel Leonard Robinson.

Install google-drive-ocamlfuse
sudo add-apt-repository -y ppa:alessandro-strada/ppa
sudo apt-get -y -q update
sudo apt-get -y -q install google-drive-ocamlfuse
Authenticate
import getpass
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
creds = GoogleCredentials.get_application_default()

!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL

vcode = getpass.getpass()

!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
Mount Google Drive to a “drive” folder
mkdir -p drive
google-drive-ocamlfuse drive

Results

It was taking more than half an hour to instantiate the 300 pa.RecordBatchFileReader instances that are used to read the files in my dataset, so I gave up 🤷.

Accessing data from Google Cloud Storage (GCS)

Introduction

Checking the location of our Colab notebook

We can guess the Google Cloud region on which the Colab notebook is running using gcping.

$ curl https://storage.googleapis.com/gcping-release/gcping_linux_amd64_0.0.3 > gcping && chmod +x gcping
$ ./gcping
 1.  [us-east1]                 1.26886ms
 2.  [global]                   4.434823ms  (10 errors)
 3.  [us-east4]                 25.503394ms
 4.  [northamerica-northeast1]  51.841746ms
 5.  [us-central1]              69.766231ms
 6.  [us-west2]                 119.139421ms
 7.  [us-west1]                 133.971136ms
 8.  [europe-west2]             174.459547ms
 9.  [europe-west1]             184.693043ms
10.  [europe-west4]             189.980326ms
11.  [europe-west3]             196.545677ms
12.  [europe-west6]             209.663205ms
13.  [southamerica-east1]       232.722674ms
14.  [europe-north1]            250.294457ms
15.  [asia-northeast1]          313.532715ms
16.  [asia-northeast2]          328.806103ms
17.  [asia-east1]               370.405143ms
18.  [australia-southeast1]     391.343945ms
19.  [asia-east2]               394.748545ms
20.  [asia-southeast1]          435.614842ms
21.  [asia-south1]              495.650972ms

In order to get around GCS network egress charges, it may be a good idea to “Factory reset” the Colab runtime until we land in a zone that is on the same continent as our GCS bucket.

Mounting a GCS bucket using gcsfuse

Instructions

Install gcsfuse
echo "deb http://packages.cloud.google.com/apt gcsfuse-`lsb_release -c -s` main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get -y -q update
sudo apt-get -y -q install gcsfuse
Authenticate
from google.colab import auth
auth.authenticate_user()
Mount our GCS bucket to a folder
mkdir -p data
gcsfuse --implicit-dirs --limit-bytes-per-sec -1 --limit-ops-per-sec -1 {bucket_name} data

Results

I can read ~35 items per second from a GCS bucket mounted on Google Colab, but only ~15 items per second from a GCS bucket mounted on a GCE VM. Each item corresponds to ~8 MB of data, so we are reaching the limit of a gigabit network.

Persistent disks show much more consistent speeds when mounted on a GCE VM.

timing graph

Just for fun, I also measured the reading speed from NVMe SSDs local to the GCE VM. They are close to an order of magnitude faster, so this is the way to go when we need ultimate performance :).

timing graph with ssd

Reading from a GCS bucket using gcsfs

Instructions

import gcsfs
import google.auth
from google.colab import auth

auth.authenticate_user()

credentials, project_id = google.auth.default()
fs = gcsfs.GCSFileSystem(project="my-project", token=credentials)

reader = pa.RecordBatchFileReader(fs.open("my-bucket/path/to/file.arrow"))

Results

I found that the using gcsfs resulted in about the same speed and throughput as using gcsfuse while leading to higher memory utilization.

Conclusion

If we are using Google Colab to train machine learning models and we run out of local disk space, Google Cloud Storage is a good option for hosting our data. Only we should be sure to check that the Colab notebook is running in the same continent as your Cloud Storage bucket, or we will incur network egress charges!

If we are training machine learning models on Google Compute Engine VMs, using local NVMe SSD disks offers the ultimate performance, while standard persistent disks offer reasonable performance while allowing us to shut down our VMs when they are not being used.

Alexey Strokach
Alexey Strokach
Graduate Student

If you found something wrong, let me know!