autogluon.scheduler

Example

Define a toy training function with searchable spaces:

>>> import numpy as np
>>> import autogluon as ag
>>> @ag.args(
>>>     lr=ag.space.Real(1e-3, 1e-2, log=True),
>>>     wd=ag.space.Real(1e-3, 1e-2))
>>> def train_fn(args, reporter):
...     print('lr: {}, wd: {}'.format(args.lr, args.wd))
...     for e in range(10):
...         dummy_accuracy = 1 - np.power(1.8, -np.random.uniform(e, 2*e))
...         reporter(epoch=e, accuracy=dummy_accuracy, lr=args.lr, wd=args.wd)

Create a scheduler and use it to run training jobs:

>>> scheduler = ag.scheduler.HyperbandScheduler(train_fn,
...                                             resource={'num_cpus': 2, 'num_gpus': 0},
...                                             num_trials=10,
...                                             reward_attr='accuracy',
...                                             time_attr='epoch',
...                                             grace_period=1)
>>> scheduler.run()
>>> scheduler.join_jobs()

Visualize the results:

>>> scheduler.get_training_curves(plot=True)
https://raw.githubusercontent.com/zhanghang1989/AutoGluonWebdata/master/doc/api/autogluon.1.png

Schedulers

FIFOScheduler

Simple scheduler that just runs trials in submission order.

HyperbandScheduler

Implements different variants of asynchronous Hyperband

RLScheduler

Scheduler that uses Reinforcement Learning with a LSTM controller created based on the provided search spaces

FIFOScheduler

class autogluon.scheduler.FIFOScheduler(train_fn, args=None, resource=None, searcher=None, search_options=None, checkpoint='./exp/checkpoint.ag', resume=False, num_trials=None, time_out=None, max_reward=1.0, time_attr='epoch', reward_attr='accuracy', visualizer='none', dist_ip_addrs=None)

Simple scheduler that just runs trials in submission order.

Parameters
train_fncallable

A task launch function for training. Note: please add the @autogluon_method decorater to the original function.

argsobject (optional)

Default arguments for launching train_fn.

resourcedict

Computation resources. For example, {‘num_cpus’:2, ‘num_gpus’:1}

searcherstr or object

Autogluon searcher. For example, autogluon.searcher.self.argsRandomSampling

time_attrstr

A training result attr to use for comparing time. Note that you can pass in something non-temporal such as training_epoch as a measure of progress, the only requirement is that the attribute should increase monotonically.

reward_attrstr

The training result objective value attribute. As with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.

dist_ip_addrslist of str

IP addresses of remote machines.

Examples

>>> import numpy as np
>>> import autogluon as ag
>>> @ag.args(
...     lr=ag.space.Real(1e-3, 1e-2, log=True),
...     wd=ag.space.Real(1e-3, 1e-2))
>>> def train_fn(args, reporter):
...     print('lr: {}, wd: {}'.format(args.lr, args.wd))
...     for e in range(10):
...         dummy_accuracy = 1 - np.power(1.8, -np.random.uniform(e, 2*e))
...         reporter(epoch=e, accuracy=dummy_accuracy, lr=args.lr, wd=args.wd)
>>> scheduler = ag.scheduler.FIFOScheduler(train_fn,
...                                        resource={'num_cpus': 2, 'num_gpus': 0},
...                                        num_trials=20,
...                                        reward_attr='accuracy',
...                                        time_attr='epoch')
>>> scheduler.run()
>>> scheduler.join_jobs()
>>> scheduler.get_training_curves(plot=True)
Attributes
num_finished_tasks

Methods

add_job(self, task, \*\*kwargs)

Adding a training task to the scheduler.

add_remote(self, ip_addrs)

Add remote nodes to the scheduler computation resource.

add_task(self, task, \*\*kwargs)

add_task() is now deprecated in favor of add_job().

get_best_config(self)

Get the best configuration from the finished jobs.

get_best_reward(self)

Get the best reward from the finished jobs.

get_training_curves(self[, filename, plot, …])

Get Training Curves

join_jobs(self[, timeout])

Wait all scheduled jobs to finish

load_state_dict(self, state_dict)

Load from the saved state dict.

run(self, \*\*kwargs)

Run multiple number of trials

run_job(self, task)

Run a training task to the scheduler (Sync).

run_with_config(self, config)

Run with config for final fit.

save(self[, checkpoint])

Save Checkpoint

schedule_next(self)

Schedule next searcher suggested task

shutdown(self)

shutdown() is now deprecated in favor of autogluon.done().

state_dict(self[, destination])

Returns a dictionary containing a whole state of the Scheduler

upload_files(files, \*\*kwargs)

Upload files to remote machines, so that they are accessible by import or load.

join_tasks

add_job(self, task, **kwargs)

Adding a training task to the scheduler.

Args:

task (autogluon.scheduler.Task): a new training task

Relevant entries in kwargs:
  • bracket: HB bracket to be used. Has been sampled in _promote_config

  • new_config: If True, task starts new config eval, otherwise it promotes a config (only if type == ‘promotion’)

Only if new_config == False:
  • config_key: Internal key for config

  • resume_from: config promoted from this milestone

  • milestone: config promoted to this milestone (next from resume_from)

add_remote(self, ip_addrs)

Add remote nodes to the scheduler computation resource.

add_task(self, task, **kwargs)

add_task() is now deprecated in favor of add_job().

get_best_config(self)

Get the best configuration from the finished jobs.

get_best_reward(self)

Get the best reward from the finished jobs.

get_training_curves(self, filename=None, plot=False, use_legend=True)

Get Training Curves

Parameters
filenamestr

plot : bool use_legend : bool

Examples

>>> scheduler.run()
>>> scheduler.join_jobs()
>>> scheduler.get_training_curves(plot=True)
https://github.com/zhanghang1989/AutoGluonWebdata/blob/master/doc/api/autogluon.1.png?raw=true
join_jobs(self, timeout=None)

Wait all scheduled jobs to finish

load_state_dict(self, state_dict)

Load from the saved state dict.

Examples

>>> scheduler.load_state_dict(ag.load('checkpoint.ag'))
run(self, **kwargs)

Run multiple number of trials

run_job(self, task)

Run a training task to the scheduler (Sync).

run_with_config(self, config)

Run with config for final fit. It launches a single training trial under any fixed values of the hyperparameters. For example, after HPO has identified the best hyperparameter values based on a hold-out dataset, one can use this function to retrain a model with the same hyperparameters on all the available labeled data (including the hold out set). It can also returns other objects or states.

save(self, checkpoint=None)

Save Checkpoint

schedule_next(self)

Schedule next searcher suggested task

shutdown(self)

shutdown() is now deprecated in favor of autogluon.done().

state_dict(self, destination=None)

Returns a dictionary containing a whole state of the Scheduler

Examples

>>> ag.save(scheduler.state_dict(), 'checkpoint.ag')
classmethod upload_files(files, **kwargs)

Upload files to remote machines, so that they are accessible by import or load.

HyperbandScheduler

class autogluon.scheduler.HyperbandScheduler(train_fn, args=None, resource=None, searcher=None, search_options=None, checkpoint='./exp/checkpoint.ag', resume=False, num_trials=None, time_out=None, max_reward=1.0, time_attr='epoch', reward_attr='accuracy', max_t=100, grace_period=10, reduction_factor=4, brackets=1, visualizer='none', type='stopping', dist_ip_addrs=None, keep_size_ratios=False, maxt_pending=False)

Implements different variants of asynchronous Hyperband

See ‘type’ for the different variants. One implementation detail is when using multiple brackets, task allocation to bracket is done randomly based on a softmax probability.

Parameters
train_fncallable

A task launch function for training.

argsobject, optional

Default arguments for launching train_fn.

resourcedict

Computation resources. For example, {‘num_cpus’:2, ‘num_gpus’:1}

searcherobject, optional

Autogluon searcher. For example, autogluon.searcher.RandomSearcher

time_attrstr

A training result attr to use for comparing time. Note that you can pass in something non-temporal such as training_epoch as a measure of progress, the only requirement is that the attribute should increase monotonically.

reward_attr :str

The training result objective value attribute. As with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.

max_tfloat

max time units per task. Trials will be stopped after max_t time units (determined by time_attr) have passed.

grace_periodfloat

Only stop tasks at least this old in time. Also: min_t. The units are the same as the attribute named by time_attr.

reduction_factorfloat

Used to set halving rate and amount. This is simply a unit-less scalar.

bracketsint

Number of brackets. Each bracket has a different grace period, all share max_t and reduction_factor. If brackets == 1, we just run successive halving, for brackets > 1, we run Hyperband.

typestr
Type of Hyperband scheduler:
stopping:

See HyperbandStopping_Manager. Tasks and config evals are tightly coupled. A task is stopped at a milestone if worse than most others, otherwise it continues. As implemented in Ray/Tune: https://ray.readthedocs.io/en/latest/tune-schedulers.html#asynchronous-hyperband

promotion:

See HyperbandPromotion_Manager. A config eval may be associated with multiple tasks over its lifetime. It is never terminated, but may be paused. Whenever a task becomes available, it may promote a config to the next milestone, if better than most others. If no config can be promoted, a new one is chosen. This variant may benefit from pause&resume, which is not directly supported here. As proposed in this paper (termed ASHA): https://arxiv.org/abs/1810.05934

keep_size_ratiosbool

Implemented for type ‘promotion’ only. If True, promotions are done only if the (current estimate of the) size ratio between rung and next rung are 1 / reduction_factor or better. This avoids higher rungs to get more populated than they would be in synchronous Hyperband. A drawback is that promotions to higher rungs take longer.

maxt_pendingbool

Relevant only if a model-based searcher is used. If True, register pending config at level max_t whenever a new evaluation is started. This has a direct effect on the acquisition function (for model-based variant), which operates at level max_t. On the other hand, it decreases the variance of the latent process there. NOTE: This could also be removed…

dist_ip_addrslist of str

IP addresses of remote machines.

Examples

>>> import numpy as np
>>> import autogluon as ag
>>> 
>>> @ag.args(
...     lr=ag.space.Real(1e-3, 1e-2, log=True),
...     wd=ag.space.Real(1e-3, 1e-2))
>>> def train_fn(args, reporter):
...     print('lr: {}, wd: {}'.format(args.lr, args.wd))
...     for e in range(10):
...         dummy_accuracy = 1 - np.power(1.8, -np.random.uniform(e, 2*e))
...         reporter(epoch=e, accuracy=dummy_accuracy, lr=args.lr, wd=args.wd)
>>> scheduler = ag.scheduler.HyperbandScheduler(train_fn,
...                                             resource={'num_cpus': 2, 'num_gpus': 0},
...                                             num_trials=20,
...                                             reward_attr='accuracy',
...                                             time_attr='epoch',
...                                             grace_period=1)
>>> scheduler.run()
>>> scheduler.join_jobs()
>>> scheduler.get_training_curves(plot=True)
Attributes
num_finished_tasks

Methods

add_job(self, task, \*\*kwargs)

Adding a training task to the scheduler.

add_remote(self, ip_addrs)

Add remote nodes to the scheduler computation resource.

add_task(self, task, \*\*kwargs)

add_task() is now deprecated in favor of add_job().

get_best_config(self)

Get the best configuration from the finished jobs.

get_best_reward(self)

Get the best reward from the finished jobs.

get_training_curves(self[, filename, plot, …])

Get Training Curves

join_jobs(self[, timeout])

Wait all scheduled jobs to finish

load_state_dict(self, state_dict)

Load from the saved state dict.

run(self, \*\*kwargs)

Run multiple number of trials

run_job(self, task)

Run a training task to the scheduler (Sync).

run_with_config(self, config)

Run with config for final fit.

save(self[, checkpoint])

Save Checkpoint

schedule_next(self)

Schedule next searcher suggested task

shutdown(self)

shutdown() is now deprecated in favor of autogluon.done().

state_dict(self[, destination])

Returns a dictionary containing a whole state of the Scheduler

upload_files(files, \*\*kwargs)

Upload files to remote machines, so that they are accessible by import or load.

join_tasks

add_job(self, task, **kwargs)

Adding a training task to the scheduler.

Args:

task (autogluon.scheduler.Task): a new training task

Relevant entries in kwargs: - bracket: HB bracket to be used. Has been sampled in _promote_config - new_config: If True, task starts new config eval, otherwise it promotes

a config (only if type == ‘promotion’)

Only if new_config == False: - config_key: Internal key for config - resume_from: config promoted from this milestone - milestone: config promoted to this milestone (next from resume_from)

add_remote(self, ip_addrs)

Add remote nodes to the scheduler computation resource.

add_task(self, task, **kwargs)

add_task() is now deprecated in favor of add_job().

get_best_config(self)

Get the best configuration from the finished jobs.

get_best_reward(self)

Get the best reward from the finished jobs.

get_training_curves(self, filename=None, plot=False, use_legend=True)

Get Training Curves

Parameters
filenamestr

plot : bool use_legend : bool

Examples

>>> scheduler.run()
>>> scheduler.join_jobs()
>>> scheduler.get_training_curves(plot=True)
https://github.com/zhanghang1989/AutoGluonWebdata/blob/master/doc/api/autogluon.1.png?raw=true
join_jobs(self, timeout=None)

Wait all scheduled jobs to finish

load_state_dict(self, state_dict)

Load from the saved state dict.

Examples

>>> scheduler.load_state_dict(ag.load('checkpoint.ag'))
run(self, **kwargs)

Run multiple number of trials

run_job(self, task)

Run a training task to the scheduler (Sync).

run_with_config(self, config)

Run with config for final fit. It launches a single training trial under any fixed values of the hyperparameters. For example, after HPO has identified the best hyperparameter values based on a hold-out dataset, one can use this function to retrain a model with the same hyperparameters on all the available labeled data (including the hold out set). It can also returns other objects or states.

save(self, checkpoint=None)

Save Checkpoint

schedule_next(self)

Schedule next searcher suggested task

shutdown(self)

shutdown() is now deprecated in favor of autogluon.done().

state_dict(self, destination=None)

Returns a dictionary containing a whole state of the Scheduler

Examples

>>> ag.save(scheduler.state_dict(), 'checkpoint.ag')
classmethod upload_files(files, **kwargs)

Upload files to remote machines, so that they are accessible by import or load.

RLScheduler

class autogluon.scheduler.RLScheduler(train_fn, args=None, resource=None, checkpoint='./exp/checkpoint.ag', resume=False, num_trials=None, time_attr='epoch', reward_attr='accuracy', visualizer='none', controller_lr=0.001, ema_baseline_decay=0.95, controller_resource={'num_cpus': 0, 'num_gpus': 0}, controller_batch_size=1, dist_ip_addrs=[], sync=True, **kwargs)

Scheduler that uses Reinforcement Learning with a LSTM controller created based on the provided search spaces

Parameters
train_fncallable

A task launch function for training. Note: please add the @ag.args decorater to the original function.

argsobject (optional)

Default arguments for launching train_fn.

resourcedict

Computation resources. For example, {‘num_cpus’:2, ‘num_gpus’:1}

searcherobject (optional)

Autogluon searcher. For example, autogluon.searcher.RandomSearcher

time_attrstr

A training result attr to use for comparing time. Note that you can pass in something non-temporal such as training_epoch as a measure of progress, the only requirement is that the attribute should increase monotonically.

reward_attrstr

The training result objective value attribute. As with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.

controller_resourceint

Batch size for training controllers.

dist_ip_addrslist of str

IP addresses of remote machines.

Examples

>>> import numpy as np
>>> import autogluon as ag
>>> 
>>> @ag.args(
...     lr=ag.space.Real(1e-3, 1e-2, log=True),
...     wd=ag.space.Real(1e-3, 1e-2))
>>> def train_fn(args, reporter):
...     print('lr: {}, wd: {}'.format(args.lr, args.wd))
...     for e in range(10):
...         dummy_accuracy = 1 - np.power(1.8, -np.random.uniform(e, 2*e))
...         reporter(epoch=e, accuracy=dummy_accuracy, lr=args.lr, wd=args.wd)
... 
>>> scheduler = ag.scheduler.RLScheduler(train_fn,
...                                      resource={'num_cpus': 2, 'num_gpus': 0},
...                                      num_trials=20,
...                                      reward_attr='accuracy',
...                                      time_attr='epoch')
>>> scheduler.run()
>>> scheduler.join_jobs()
>>> scheduler.get_training_curves(plot=True)
Attributes
num_finished_tasks

Methods

add_job(self, task, \*\*kwargs)

Adding a training task to the scheduler.

add_remote(self, ip_addrs)

Add remote nodes to the scheduler computation resource.

add_task(self, task, \*\*kwargs)

add_task() is now deprecated in favor of add_job().

get_best_config(self)

Get the best configuration from the finished jobs.

get_best_reward(self)

Get the best reward from the finished jobs.

get_training_curves(self[, filename, plot, …])

Get Training Curves

join_jobs(self[, timeout])

Wait all scheduled jobs to finish

load_state_dict(self, state_dict)

Load from the saved state dict.

run(self, \*\*kwargs)

Run multiple number of trials

run_job(self, task)

Run a training task to the scheduler (Sync).

run_with_config(self, config)

Run with config for final fit.

save(self[, checkpoint])

Save Checkpoint

schedule_next(self)

Schedule next searcher suggested task

shutdown(self)

shutdown() is now deprecated in favor of autogluon.done().

state_dict(self[, destination])

Returns a dictionary containing a whole state of the Scheduler

upload_files(files, \*\*kwargs)

Upload files to remote machines, so that they are accessible by import or load.

join_tasks

sync_schedule_tasks

add_job(self, task, **kwargs)

Adding a training task to the scheduler.

Args:

task (autogluon.scheduler.Task): a new training task

add_remote(self, ip_addrs)

Add remote nodes to the scheduler computation resource.

add_task(self, task, **kwargs)

add_task() is now deprecated in favor of add_job().

get_best_config(self)

Get the best configuration from the finished jobs.

get_best_reward(self)

Get the best reward from the finished jobs.

get_training_curves(self, filename=None, plot=False, use_legend=True)

Get Training Curves

Parameters
filenamestr

plot : bool use_legend : bool

Examples

>>> scheduler.run()
>>> scheduler.join_jobs()
>>> scheduler.get_training_curves(plot=True)
https://github.com/zhanghang1989/AutoGluonWebdata/blob/master/doc/api/autogluon.1.png?raw=true
join_jobs(self, timeout=None)

Wait all scheduled jobs to finish

load_state_dict(self, state_dict)

Load from the saved state dict.

Examples

>>> scheduler.load_state_dict(ag.load('checkpoint.ag'))
run(self, **kwargs)

Run multiple number of trials

run_job(self, task)

Run a training task to the scheduler (Sync).

run_with_config(self, config)

Run with config for final fit. It launches a single training trial under any fixed values of the hyperparameters. For example, after HPO has identified the best hyperparameter values based on a hold-out dataset, one can use this function to retrain a model with the same hyperparameters on all the available labeled data (including the hold out set). It can also returns other objects or states.

save(self, checkpoint=None)

Save Checkpoint

schedule_next(self)

Schedule next searcher suggested task

shutdown(self)

shutdown() is now deprecated in favor of autogluon.done().

state_dict(self, destination=None)

Returns a dictionary containing a whole state of the Scheduler

Examples

>>> ag.save(scheduler.state_dict(), 'checkpoint.ag')
classmethod upload_files(files, **kwargs)

Upload files to remote machines, so that they are accessible by import or load.

Early Stopping Managers

HyperbandStopping_Manager

Hyperband Manager

HyperbandPromotion_Manager

Hyperband Manager

HyperbandStopping_Manager

class autogluon.scheduler.HyperbandStopping_Manager(time_attr, reward_attr, max_t, grace_period, reduction_factor, brackets)

Hyperband Manager

Implements stopping rule which uses the brackets and rung levels defined in Hyperband. The overall algorithm is NOT what is published as ASHA (see HyperbandPromotion_Manager), but rather something resembling the median rule.

Args:
time_attr (str): A training result attr to use for comparing time.

Note that you can pass in something non-temporal such as training_epoch as a measure of progress, the only requirement is that the attribute should increase monotonically.

reward_attr (str): The training result objective value attribute. As

with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.

max_t (float): max time units per task. Trials will be stopped after

max_t time units (determined by time_attr) have passed.

grace_period (float): Only stop tasks at least this old in time.

The units are the same as the attribute named by time_attr.

reduction_factor (float): Used to set halving rate and amount. This

is simply a unit-less scalar.

brackets (int): Number of brackets. Each bracket has a different

halving rate, specified by the reduction factor.

Methods

on_task_add(self, task, \*\*kwargs)

Since the bracket has already been sampled in on_task_schedule, not much is done here.

on_task_report(self, task, result)

Decides whether task can continue or is to be stopped, and also whether the searcher should be updated (iff milestone is reached).

on_task_complete

on_task_remove

on_task_schedule

on_task_add(self, task, **kwargs)

Since the bracket has already been sampled in on_task_schedule, not much is done here. We return the list of milestones for this bracket in reverse (decreasing) order. The first entry is max_t, even if it is not a milestone in the bracket. This list contains the resource levels the task would reach if it ran to max_t without being stopped.

Parameters

task – Only task.task_id is used

Returns

See above

on_task_report(self, task, result)

Decides whether task can continue or is to be stopped, and also whether the searcher should be updated (iff milestone is reached). If update_searcher = True and action = True, next_milestone is the next mileastone for the task (or None if there is none).

Parameters
  • task – Only task.task_id is used

  • result – Current reported results from task

Returns

action, update_searcher, next_milestone

HyperbandPromotion_Manager

class autogluon.scheduler.HyperbandPromotion_Manager(time_attr, reward_attr, max_t, grace_period, reduction_factor, brackets, keep_size_ratios)

Hyperband Manager

Implements both the promotion and stopping logic for an asynchronous variant of Hyperband, known as ASHA: https://arxiv.org/abs/1810.05934

In ASHA, configs sit paused at milestones (rung levels) in their bracket, until they get promoted, which means that a free task picks up their evaluation until the next milestone.

We do not directly support pause & resume here, so that in general, the evaluation for a promoted config is started from scratch. However, see Hyperband_Scheduler.add_task, task.args[‘resume_from’]: the evaluation function receives info about the promotion, so pause & resume can be implemented there.

Args:
time_attr (str): A training result attr to use for comparing time.

Note that you can pass in something non-temporal such as training_epoch as a measure of progress, the only requirement is that the attribute should increase monotonically.

reward_attr (str): The training result objective value attribute. As

with time_attr, this may refer to any objective value. Stopping procedures will use this attribute.

max_t (float): max time units per task. Trials will be stopped after

max_t time units (determined by time_attr) have passed.

grace_period (float): Only stop tasks at least this old in time.

The units are the same as the attribute named by time_attr.

reduction_factor (float): Used to set halving rate and amount. This

is simply a unit-less scalar.

brackets (int): Number of brackets. Each bracket has a different

grace period, all share max_t and reduction_factor. If brackets == 1, we just run successive halving, for brackets > 1, we run Hyperband.

keep_size_ratios (bool): If True,

promotions are done only if the (current estimate of the) size ratio between rung and next rung are 1 / reduction_factor or better. This avoids higher rungs to get more populated than they would be in synchronous Hyperband. A drawback is that promotions to higher rungs take longer.

Methods

on_task_add

on_task_complete

on_task_remove

on_task_report

on_task_schedule