Hyperparameter tuning in SURFsara HPC Cloud

Hyperparameter tuning is difficult, not because it’s terribly complicated but obtaining enough resource is often not easy. I’m lucky enough to work at Vrije Universiteit and therefore can access the SURFsara HPC Cloud with not too much effort. Compared to Amazon EC2 (the only other cloud solution I have tried before), the functionality is rather basic but I think suits the needs of many researchers. Using the web interface or OpenNebula API, you can easily customize an image, attach hard drive, launch 10 instances and access any of them using a public key. What else do you need to run your experiments?

Oftentimes, you never have one machine for each experiment (an experiment = a way to set all hyperparameters and there are many of them). We must schedule the workers somehow given the restrictions of the cluster/grid/cloud. A generic way is to have a database storing the configuration of all experiments and multiple workers fetching experiments to execute. This way, the idle time of the few available workers is minimum while implementation is relatively easy.

I implemented such a setup during the practical session of my HPC Cloud class and decided to put it here for you to use. There are various libraries for grid search but this code is unique because it is minimal and flexible. I intentionally avoid command line parameters since I find some Python lines more transparent and easy-to-understand than 10 CLI options. Python also give you the freedom to explore disjoint hyperparam regions, separate points, or whatever you can think of. The downside is that you need to understand the code and know how to fix things when they are broken.

This code works on the “master” machine only. Each worker has to fetch experiments and call some program to execute them. I will write that part too when I have spare time.

### configurations
db_name = 'experiments'
coll_name = 'revision0'   # put the revision of your code before running
experiments = (
    {'adaEps': 1e-6, 'rho': 0.1},
    {'adaEps': 1e-8, 'rho': 0.1},
    {'adaEps': 1e-10, 'rho': 0.1},
    {'adaEps': 1e-6, 'rho': 0.9},
    {'adaEps': 1e-8, 'rho': 0.9},
    {'adaEps': 1e-10, 'rho': 0.9},

ONE_ENDPOINT = 'http://ui.hpccloud.surfsara.nl:2633/RPC2'
ONE_USER = 'your_username'          # replace this with your HPC Cloud UI username
ONE_PASSWORD = 'your_password'      # replace this with your HPC Cloud UI password

num_workers = 2

### setup
import os
os.system('sudo apt-get install mongodb -y')
os.system('pip install pymongo')

### populate database
from pymongo import MongoClient
client = MongoClient()
    db = client[db_name]
    coll = db[coll_name]
    ret = coll.insert_many(experiments)
    print('Successfully inserted %d experiments' %len(ret.inserted_ids))

### start workers
import oca
temp_pool = oca.VmTemplatePool(c)
temp = temp_pool.get_by_name('worker')
for i in range(num_workers):
    temp.instantiate('worker-%02d' %i)
print('Attempted to start %d workers' %num_workers)

### wait for workers to terminate
import time
last_remaining = len(experiments)
while (True):
    experiments = client[db_name][coll_name].find()
    remaining = sum(1 for e in experiments if 'terminated' not in e or not e['terminated'])
    if remaining < last_remaining:
        print('Remaining experiments to run %d' %remaining)
    if remaining <= 0: break
    last_remaining = remaining

### shut down virtual machines
vm_pool = oca.VirtualMachinePool(c)
for i in range(num_workers):
    vm_pool.get_by_name('worker-%02d' %i).shutdown()
print('Shut down %d workers' %num_workers)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s