Demo

Last updated: May 2^nd, 2020

Quick Note

Though they are not strictly connected, I recommend viewing the 4 sections in order to know how to not only train a model, but to make your own custom training procedure.

Building Your Model

source

1. Import

Importing batch_norm would also recursively import all of it's dependencies. Reducing the need of having many import statements.

from batch_norm import *

2. Layers

Building a model is incredibly simple. The comments in the snippet are the output dimensions of each layer for clarity. Now that we have a model, it will be used in the two final sections of this demo.

model = Sequential(Reshape((1, 28, 28)),
                   Conv(c_in=1, c_out=4, k_s=5, stride=2, pad=1), # 4, 13, 13
                   AvgPool(k_s=2, pad=0), # 4, 12, 12
                   BatchNorm(4),
                   Conv(c_in=4, c_out=16, stride=2, leak=1.), # 16, 5, 5
                   BatchNorm(16),
                   Flatten(),
                   Linear(400, 64), # 16 * 5 * 5 -> 400
                   ReLU(),
                   Linear(64, 10, True))

3. Display Model

Custom __repr__ methods let classes to be neatly displayed. It also works for nested models as shown in this notebook.

model

(Model)
    Reshape(1, 28, 28)
    Conv(1, 4, 5, 2)
    AvgPool(2, 1)
    BatchNorm()
    Conv(4, 16, 3, 2)
    BatchNorm()
    Flatten()
    Linear(400, 64)
    ReLU()
    Linear(64, 10)

4. Display Parameters

Printing out parameters' information is also useful for debugging. Here is how to print out all model parameters.

for p in model.parameters():
    print(p)

shape: (4, 1, 5, 5), grad: True
shape: (4,), grad: True
shape: (1, 4, 1, 1), grad: True
shape: (1, 4, 1, 1), grad: True
shape: (16, 4, 3, 3), grad: True
shape: (16,), grad: True
shape: (1, 16, 1, 1), grad: True
shape: (1, 16, 1, 1), grad: True
shape: (400, 64), grad: True
shape: (64,), grad: True
shape: (64, 10), grad: True
shape: (10,), grad: True

If you want to only look into a selected layer, here it is how.

print(f'layer2: {model.layers[1]}\n')
for p in model.layers[1].parameters():
    print(p)

layer2:     Conv(1, 4, 5, 2)

shape: (4, 1, 5, 5), grad: True
shape: (4,), grad: True

Making a Callback

source

1. Import

from callback import *

2. Making the Callback

I decided to use LearningRateSearch as our example since it will be used in the next section of this demo. As shown, before each batch, the callback would try a new learning rate while keeping track of the best learning rate (measured by loss) so far and automatically stop training after loss increases 10x or we have tried enough learning rates.

class LearningRateSearch(Callback):
    def __init__(self, max_iter=1000, min_lr=1e-4, max_lr=1):
        self.max_iter = max_iter # max number of candidates learning rates to try
        self.min_lr = min_lr  # lowest/starting candidate learning rate
        self.max_lr = max_lr  # highest candidate learning rate
        self.cur_lr = min_lr  # current candidate learning rate holder 
        self.best_lr = min_lr # recorded learning rate with the lowest loss
        self.best_loss = float('inf') # lowest loss so far
        
    def before_batch(self): 
        # assert training state
        if not self.model.training: return
        # calculate new candidate learning rate
        position = self.iters_count / self.iters
        self.cur_lr = self.min_lr * (self.max_lr/self.min_lr)**position
        # set learning rate in optimizer
        self.optimizer.hyper_params['learning_rate'] = self.cur_lr
        
    def after_step(self):
        # stop when either tried enough times or loss starts increasing
        if self.iters_count >= self.max_iter or self.loss > self.best_loss*10:
            raise CancelTrainException()
        # update best loss and best learning rate
        if self.loss < self.best_loss:
            self.best_loss = self.loss
            self.best_lr = self.cur_lr

Learning Rate Search

source

1. Import

from stateful_optim import *

2. Data Bunch, Loss Function

Let's retrieve the data bunch and loss function each using just one line. Since we are training on MNIST handwritten digits, cross entropy is an appropriate loss function to use.

data_bunch = get_data_bunch(*get_mnist_data(), batch_size=64)
loss_fn = CrossEntropy()

3. Model, Adam, Callbacks, Learner

model: grab the model from the first section of the demo.
optimizer: use the adam_opt util function to grab an adam optimizer
callbacks: use the LearningRateSearch callback built in the last section and Recorder to record the history values of hyperparameters.

model = get_conv_final_model(data_bunch)
optimizer = adam_opt(model, learning_rate=1e-3, weight_decay=1e-4)
callbacks = [LearningRateSearch(min_lr=1e-5, max_lr=1e-2), Recorder()]

For model training, the best practice is to throw all components from above into a learner class, which is able to interact with callbacks in various stages of training.

By printing out the learner class, again with custom __repr__, we can view details on the data bunch, model architecture, loss function, optimizer steppers, and callbacks.

learner = Learner(data_bunch, model, loss_fn, optimizer, callbacks)
print(learner)

(DataBunch) 
    (DataLoader) 
        (Dataset) x: (50000, 784), y: (50000,)
        (Sampler) total: 50000, batch_size: 64, shuffle: True
    (DataLoader) 
        (Dataset) x: (10000, 784), y: (10000,)
        (Sampler) total: 10000, batch_size: 128, shuffle: False
(Model)
    Reshape(1, 28, 28)
    Conv(1, 4, 5, 2)
    AvgPool(2, 1)
    BatchNorm()
    Conv(4, 16, 3, 2)
    BatchNorm()
    Flatten()
    Linear(400, 64)
    ReLU()
    Linear(64, 10)
(CrossEntropy)
(StatefulOpt) steppers: ['adam', 'l2_reg'], stats: ['ExpWeightedGrad', 'ExpWeightedSqrGrad', 'StepCount']
(Callbacks) ['TrainEval', 'LearningRateSearch', 'Recorder']

4. One Epoch Fit

To find our learning rate, simply do a 1 epoch fit. Recall from Making a Callback that the LearningRateSearch callback performs early stopping once it is confident in having the learning rate that yields the lowest loss.

learner.fit(1)

5. Display Loss vs. Learning Rate

By passing the Recorder callback into the util function plot_lr_loss, we can see the relationship between loss vs. learning rate.

plot_lr_loss(learner.callbacks[2])

Lastly, the LearningRateSearch callback also kept track of the best learning rate candidate for our use in the final section.

lr = learner.callbacks[1].best_lr
print(f'learning rate found: {lr}')

learning rate found: 0.004283273648329838

Model Training

source

1. Import

Same as previous sections, import modules, then grab data bunch and loss function.

from stateful_optim import *

2. Data Bunch, Loss Function

data_bunch = get_data_bunch(*get_mnist_data(), batch_size=64)
loss_fn = CrossEntropy()

3. Cosine Parameter Scheduling

New callback alert! Please refer to the code documentation to familiarize with the ParamScheduler callback. Here we build a custom cosine schedule for the learning rate that takes place each epoch using the learning rate from the last section.

schedule = combine_schedules([0.4, 0.6], one_cycle_cos(lr/3, lr*3, lr/3))

4. Model, Adam, Callbacks, Learner

Same as before, create model, optimizer, and callbacks. Notice that LearningRateSearch is no longer needed and that the ParamScheduler is now used for training with dynamic learning rate per epoch. I also added StatsLogging to print out loss and accuracy per epoch.

model = get_conv_final_model(data_bunch)
optimizer = adam_opt(model, learning_rate=lr, weight_decay=1e-4)
callbacks = [ParamScheduler('learning_rate', schedule), StatsLogging(), Recorder()]

learner = Learner(data_bunch, model, loss_fn, optimizer, callbacks)
print(learner)

(DataBunch) 
    (DataLoader) 
        (Dataset) x: (50000, 784), y: (50000,)
        (Sampler) total: 50000, batch_size: 64, shuffle: True
    (DataLoader) 
        (Dataset) x: (10000, 784), y: (10000,)
        (Sampler) total: 10000, batch_size: 128, shuffle: False
(Model)
    Reshape(1, 28, 28)
    Conv(1, 4, 5, 2)
    AvgPool(2, 1)
    BatchNorm()
    Conv(4, 16, 3, 2)
    BatchNorm()
    Flatten()
    Linear(400, 64)
    ReLU()
    Linear(64, 10)
(CrossEntropy)
(StatefulOpt) steppers: ['adam', 'l2_reg'], stats: ['ExpWeightedGrad', 'ExpWeightedSqrGrad', 'StepCount']
(Callbacks) ['TrainEval', 'ParamScheduler', 'StatsLogging', 'Recorder']

5. Train Model

Training is as simple as just calling the fit method with number of epochs. As shown below, the validation accuracy rises to 97.3 in just 3 epochs.

learner.fit(3)

Epoch - 1
train metrics - [5.624208450317383e-06, 0.89082]
valid metrics - [2.2692584991455077e-05, 0.9622]

Epoch - 2
train metrics - [6.513986587524414e-06, 0.95746]
valid metrics - [2.0359230041503905e-05, 0.9706]

Epoch - 3
train metrics - [6.942987442016601e-06, 0.96688]
valid metrics - [1.5890932083129882e-05, 0.9731]

6. Loss and Learning Rate

Lastly, plot the loss and learning rate recorded by the Recorder callback.

learner.callbacks[3].plot_losses()

As shown below, the learning rate values make a cosine-ish cycle each epoch, showing that our ParamScheduler callback is working properly.

learner.callbacks[3].plot_parameter('learning_rate')