Skip to content

Latest commit

 

History

History
561 lines (397 loc) · 26.3 KB

File metadata and controls

561 lines (397 loc) · 26.3 KB

Image Classification STM32 Model Training

This README shows how to train from scratch or apply transfer learning on an image classification model using a custom dataset. As an example, we will demonstrate the workflow on the tf_flowers classification dataset.

1. Prepare the dataset

After downloading and extracting the dataset files, the dataset directory tree should look as below:

dataset_root_directory/
   class_a/
      a_image_1.jpg
      a_image_2.jpg
   class_b/
      b_image_1.jpg
      b_image_2.jpg

The names of the subdirectories under the dataset root directory are the names of the classes.

As an example, the directory tree of the Flowers dataset is shown below:

flowers/
   daisy/
   dandelion/
   roses/
   sunflowers/
   tulips/

Other dataset formats are not supported. The only exceptions are the CIFAR-10/CIFAR-100 datasets. For these datasets, the official format in batches is supported.

2. Create your training configuration file
    2.1 Overview

    All the proposed services like the training of the model are driven by a configuration file written in the YAML language.

    For training, the configuration file should include at least the following sections:

    • general, describes your project, including project name, directory where to save models, etc.
    • operation_mode, describes the service or chained services to be used.
    • dataset, describes the dataset you are using, including directory paths, class names, etc.
    • preprocessing, specifies the methods you want to use for rescaling and resizing the images.
    • training, specifies your training setup, including batch size, number of epochs, optimizer, callbacks, etc.
    • mlflow, specifies the folder to save MLFlow logs.
    • hydra, specifies the folder to save Hydra logs.

    This tutorial only describes the settings needed to train a model. In the first part, we describe basic settings. At the end of this README, you can also find more advanced settings and callbacks supported.

    2.2 General settings

    The first section of the configuration file is the general section that provides information about your project.

    general:
       project_name: my_project
       logs_dir: logs
       saved_models_dir: saved_models
       deterministic_ops: True

    If you want your experiments to be fully reproducible, you need to activate the deterministic_ops attribute and set it to True. Enabling the deterministic_ops attribute will restrict TensorFlow to use only deterministic operations on the device, but it may lead to a drop in training performance. It should be noted that not all operations in the used version of TensorFlow can be computed deterministically. If your case involves any such operation, a warning message will be displayed and the attribute will be ignored.

    The logs_dir attribute is the name of the directory where the MLFlow and TensorBoard files are saved. The saved_models_dir attribute is the name of the directory where trained models are saved. These two directories are located under the top-level "hydra" directory (please see chapter 2.8 for Hydra information).

    2.3 Dataset specification

    Information about the dataset you want to use is provided in the dataset section of the configuration file, as shown in the YAML code below.

    dataset:
       name: flowers
       class_names: [daisy, dandelion, roses, sunflowers, tulips]
       training_path: ../datasets/flower_photos
       validation_path:
       validation_split: 0.15
       test_path:

    The state machine below describes the rules to follow when handling dataset paths for the training.

    plot

    In this example, no validation set path is provided, so the available data under the training_path directory is split in two to create a training set and a validation set. By default, 80% of the data is used for the training set and the remaining 20% is used for the validation set. If you want to use a different split ratio, you need to specify in validation_split the ratio to be used for the validation set (value between 0 and 1).

    No test set path is provided in this example to evaluate the model accuracy after training and quantization. Therefore, the validation set is used as the test set.

    2.4 Dataset preprocessing

    The images from the dataset need to be preprocessed before they are presented to the network. This includes rescaling and resizing, as illustrated in the YAML code below.

    preprocessing:
       rescaling: {scale: 1/127.5, offset: -1}
       resizing: {interpolation: nearest, aspect_ratio: "fit"}
       color_mode: rgb

    The pixels of the input images are in the interval [0, 255], that is UINT8. If you set scale to 1./255 and offset to 0, they will be rescaled to the interval [0.0, 1.0]. If you set scale to 1/127.5 and offset to -1, they will be rescaled to the interval [-1.0, 1.0].

    The resizing attribute specifies the image resizing methods you want to use:

    • The value of interpolation must be one of {"bilinear", "nearest", "bicubic", "area", "lanczos3", "lanczos5", "gaussian", "mitchellcubic"}.
    • The value of aspect_ratio must be either "fit" or "crop". If you set it to "fit", the resized images will be distorted if their original aspect ratio is not the same as the resizing size. If you set it to "crop", images will be cropped as necessary to preserve the aspect ratio.

    The color_mode attribute must be one of "grayscale", "rgb" or "rgba".

    2.5 Data augmentation

    Data augmentation is an effective technique to reduce the overfitting of a model when the dataset is too small or the classification problem to solve is too easy for the model.

    The data augmentation functions to apply to the input images are specified in the data_augmentation section of the configuration file, as illustrated in the YAML code below.

    data_augmentation:
      random_contrast:
        factor: 0.4
      random_brightness:
        factor: 0.2
      random_flip:
        mode: horizontal_and_vertical
      random_translation:
        width_factor: 0.2
        height_factor: 0.2
      random_rotation:
        factor: 0.15
      random_zoom:
        width_factor: 0.25
        height_factor: 0.25

    The data augmentation functions with their parameter settings are applied to the input images in their order of appearance in the configuration file. Refer to the data augmentation documentation README.md for more information about the available functions and their arguments.

    A script called test_data_augment.py is available in the data_augmentation directory. This script reads your configuration file, picks some images from the dataset, applies the data augmentation functions you specified to the images, and displays before/after images side by side. We strongly encourage you to run this script to develop your data augmentation and make sure that it is neither too aggressive nor too weak.

    2.6 Loading a model

    Information about the model you want to train is provided in the training section of the configuration file.

    The YAML code below shows how you can use a MobileNet V2 model from the Model Zoo.

    training:
       model:
          name: mobilenet
          version: v2
          alpha: 0.35
          pretrained_weights: imagenet
          input_shape: (224, 224, 3)

    The pretrained_weights attribute is set to "imagenet", which indicates that you want to load the weights pretrained on the ImageNet dataset and do a transfer learning type of training.

    If pretrained_weights was set to "None", no pretrained weights would be loaded in the model and the training would start from scratch, i.e., from randomly initialized weights.

    2.7 Training setup

    The training setup is described in the training section of the configuration file, as illustrated in the example below.

    training:
       batch_size: 64
       epochs: 400
       dropout: 0.3
       optimizer: 
          Adam: {learning_rate: 0.001}
       callbacks:
          ReduceLROnPlateau:
             monitor: val_accuracy
             factor: 0.5
             patience: 10
          EarlyStopping:
             monitor: val_accuracy
             patience: 60

    The batch_size, epochs, and optimizer attributes are mandatory. All the others are optional.

    The dropout attribute only makes sense if your model includes a dropout layer.

    All the TensorFlow optimizers can be used in the optimizer subsection. All the TensorFlow callbacks can be used in the callbacks subsection, except the ModelCheckpoint and TensorBoard callbacks that are built-in and can't be redefined.

    A number of learning rate schedulers are provided with the Model Zoo as custom callbacks. The YAML code below shows how to use the LRCosineDecay scheduler that implements a cosine decay function.

    training:
       batch_size: 64
       epochs: 400
       optimizer: Adam
       callbacks:
          LRCosineDecay:
             initial_learning_rate: 0.01
             decay_steps: 170
             alpha: 0.001

    A variety of learning rate schedulers are provided with the Model Zoo. If you want to use one of them, just include it in the callbacks subsection. Refer to the learning rate schedulers README for a description of the available callbacks and learning rate plotting utility.

    2.8 Hydra and MLflow settings

    The mlflow and hydra sections must always be present in the YAML configuration file. The hydra section can be used to specify the name of the directory where experiment directories are saved and/or the pattern used to name experiment directories. With the YAML code below, every time you run the Model Zoo, an experiment directory is created that contains all the directories and files created during the run. The names of experiment directories are all unique as they are based on the date and time of the run.

    hydra:
       run:
          dir: ./experiments_outputs/${now:%Y_%m_%d_%H_%M_%S}

    The mlflow section is used to specify the location and name of the directory where MLflow files are saved, as shown below:

    mlflow:
       uri: ./experiments_outputs/mlruns
3. Train your model

To launch your model training using a real dataset, run the following command from the src/ folder:

python stm32ai_main.py --config-path ./config_file_examples/ --config-name training_config.yaml

The trained .h5 model can be found in the corresponding experiments_outputs/ folder.

4. Visualize training results
    4.1 Saved results

    All training and evaluation artifacts are saved under the current output simulation directory "outputs/{run_time}".

    For example, you can retrieve the plots of the accuracy/loss curves as below:

    plot

    4.2 Run TensorBoard

    To visualize the training curves logged by TensorBoard, go to "outputs/{run_time}" and run the following command:

    tensorboard --logdir logs

    And open the URL http://localhost:6006 in your browser.

    4.3 Run MLFlow

    MLFlow is an API for logging parameters, code versions, metrics, and artifacts while running machine learning code and for visualizing results. To view and examine the results of multiple trainings, you can simply access the MLFlow Webapp by running the following command:

    mlflow ui

    And open the given IP address in your browser.

5. Advanced settings
    5.1 Training your own model

    You may want to train your own model rather than a model from the Model Zoo.

    This can be done using the model_path attribute of the general: section to provide the path to the model file to use, as illustrated in the example below.

    general:
       model_path: <path-to-a-Keras-model-file>    # Path to the model file to use for training
    
    operation_mode: training
    
    dataset:
       training_path: <training-set-root-directory>    # Path to the root directory of the training set.
       validation_split: 0.2                           # Use 20% of the training set to create the validation set.
       test_path: <test-set-root-directory>            # Path to the root directory of the test set.
    
    training:
       batch_size: 64
       epochs: 150
       dropout: 0.3
       frozen_layers: (0, -1)
       optimizer:
          Adam:                               
             learning_rate: 0.001
       callbacks:                    
          ReduceLROnPlateau:
             monitor: val_loss
             factor: 0.1
             patience: 10

    The model file must be a Keras model file with a '.h5' filename extension.

    The model: subsection of the training: section is not present as we are not training a model from the Model Zoo. An error will be thrown if it is present when model_path is set.

    About the model loaded from the file:

    • If some layers are frozen in the model, they will be reset to trainable before training. You can use the frozen_layers attribute if you want to freeze these layers (or different ones).
    • If you set the dropout attribute but the model does not include a dropout layer, an error will be thrown. Reciprocally, an error will also occur if the model includes a dropout layer but the dropout attribute is not set.
    • If the model was trained before, the state of the optimizer won't be preserved as the model is compiled before training.
    5.2 Resuming a training

    You may want to resume a training that you interrupted or that crashed.

    When running a training, the model is saved at the end of each epoch in the 'saved_models' directory that is under the experiment directory (see section "2.2 Output directories and files"). The model file is named 'last_augmented_model.h5'.

    To resume a training, you first need to choose the experiment you want to restart from. Then, set the resume_training_from attribute of the 'training' section to the path to the 'last_augmented_model.h5' file of the experiment. An example is shown below.

    operation_mode: training
    
    dataset:
       training_path: <training-set-root-directory>
       validation_split: 0.2
       test_path: <test-set-root-directory>
    
    training:
       batch_size: 64
       epochs: 150      # The number of epochs can be changed for resuming.
       dropout: 0.3 
       frozen_layers: (0:1)
       optimizer:
          Adam:
             learning_rate: 0.001
       callbacks:         
          ReduceLROnPlateau:
             monitor: val_accuracy
             factor: 0.1
             patience: 10
       resume_training_from: <path to the 'last_augmented_model.h5' file of the interrupted/crashed training>

    When setting the resume_training_from attribute, the model: subsection of the training: section and the model_path attribute of the general: section should not be used. An error will be thrown if you do so.

    The configuration file of the training you are resuming should be reused as is, the only exception being the number of epochs. If you make changes to the dropout rate, the frozen layers or the optimizer, they will be ignored and the original settings will be kept. Changes made to the batch size or the callback section will be taken into account. However, they may lead to unexpected results.

    Here is the corrected and improved version of the remaining sections of your README:

    The state of the optimizer is saved in the **last_augmented_model.h5** file, so you will restart from where you left off. The model is called 'augmented' because it includes the rescaling and data augmentation preprocessing layers.
    
    There are two other model files in the **saved_models** directory. The one that is called **best_augmented_model.h5** is the best augmented model that was obtained since the beginning of the training. The other one that is called **best_model.h5** is the same model as **best_augmented_model.h5**, but it does not include the preprocessing layers and cannot be used to resume a training. An error will be thrown if you attempt to do so.
    
    </details>
    
    <details open><summary><a href="#5-3">5.3 Transfer learning</a></summary><a id="5-3"></a>
    
    Transfer learning is a popular training methodology that is used to take advantage of models trained on large datasets, such as ImageNet. The Model Zoo features that are available to implement transfer training are presented in the next sections.
    
    <ul>
    <details open><summary><a href="#5-3-1">5.3.1 Using ImageNet pretrained weights</a></summary><a id="5-3-1"></a>
    
    Weights pretrained on the ImageNet dataset are available for the MobileNet-V1 and MobileNet-V2 models.
    
    If you want to use these pretrained weights, you need to add the `pretrained_weights` attribute to the `model:` subsection of the 'training' section of the configuration file and set it to 'imagenet', as shown in the YAML code below.
    
    ```yaml
    training:
       model:
          name: mobilenet
          version: v2
          alpha: 0.35
          input_shape: (224, 224, 3)
          pretrained_weights: imagenet

    By default, no pretrained weights are loaded. If you want to make it explicit that you are not using the ImageNet weights, you may add the pretrained_weights attribute and leave it unset or set to None.

    5.3.2 Using weights from another model

    When you train a model, you may want to take advantage of the weights from another model that was previously trained on another, larger dataset.

    Assume for example that you are training a MobileNet-V2 model on the Flowers dataset and you want to take advantage of the weights of another MobileNet-V2 model that you previously trained on the Plant Leaf Diseases dataset (for illustration purposes, this may not give valuable results). This can be specified using the pretrained_model_path attribute in the model: subsection as shown in the YAML code below.

    training:
       model:
          name: mobilenet
          version: v2
          alpha: 0.35
          input_shape: (224, 224, 3)
          pretrained_model_path: ../pretrained_models/mobilenetv2/ST_pretrainedmodel_public_dataset/plant-village/mobilenet_v2_0.35_224_fft/mobilenet_v2_0.35_224_fft.h5

    Weights are transferred between backbone layers (all layers but the classifier). The two models must have the same backbones, obviously. You could not transfer weights between two MobileNet-V2 models that have different 'alpha' parameter values, or from an FD-MobileNet model to a ResNet model.

    This weights transfer feature is available for all the models from the Model Zoo. Note that for MobileNet models, the pretrained_weights and pretrained_model_path attributes are mutually exclusive. If both are set, an error will be thrown.

    5.3.3 Freezing layers

    Once the pretrained weights have been loaded in the model to train, some layers are often frozen, that is made non-trainable, before training the model. A commonly used approach is to freeze all the layers but the last one, which is the classifier.

    By default, all the layers are trainable. If you want to freeze some layers, then you need to add the optional frozen_layers attribute to the training: section of your configuration file. The indices of the layers to freeze are specified using the Python syntax for indexing into lists and arrays. Below are some examples.

    training:
       frozen_layers: (0:-1)    # Freeze all the layers but the last one
       
    training:
       frozen_layers: (10:120)   # Freeze layers with indices from 10 to 119
    
    training:
       frozen_layers: (150:)     # Freeze layers from index 150 to the last layer
    
    training:
       frozen_layers: (8, 110:121, -1)  # Freeze layers with index 8, 110 to 120, and the last layer

    Note that if you want to make it explicit that all the layers are trainable, you may add the frozen_layers attribute and leave it unset or set to None.

    5.3.4 Multi-step training

    In some cases, better results may be obtained using multiple training steps.

    The first training step is generally done with only a few trainable layers, typically the classifier only. Then, more and more layers are made trainable in the subsequent training steps. Some other parameters may also be adjusted from one step to another, in particular the learning rate. Therefore, a different configuration file is needed at each step.

    The model_path attribute of the general: section and the trained_model_path attribute of the training: section are available to implement such a multi-step training. At a given step, model_path is used to load the model that was trained at the previous step and trained_model_path is used to save the model at the end of the step.

    Assume for example that you are doing a 3 steps training. Then, your 3 configurations would look as shown below.

    Training step #1 configuration file (initial training):

    training:
       model:
          name: mobilenet
          version: v2
          alpha: 0.35
          input_shape: (128, 128, 3)
          pretrained_weights: imagenet
       frozen_layers: (0:-1)
       trained_model_path: ${MODELS_DIR}/step_1.h5

    Training step #2 configuration file:

    general:
       model_path: ${MODELS_DIR}/step_1.h5
    
    training:
       frozen_layers: (50:)
       trained_model_path: ${MODELS_DIR}/step_2.h5

    Training step #3 configuration file:

    general:
       model_path: ${MODELS_DIR}/step_2.h5
    
    training:
       frozen_layers: None
       trained_model_path: ${MODELS_DIR}/step_3.h5
5.4 Creating your own custom model

You can create your own custom model and get it handled as any built-in Model Zoo model. If you want to do that, you need to modify a number of Python source code files that are all located under the //image_classification/src directory root.

An example of a custom model is given in the models/custom_model.py located in the //image_classification/src/models/. The model is constructed in the body of the get_custom_model() function that returns the model. Modify this function to implement your own model.

In the provided example, the get_custom_model() function takes in arguments:

  • num_classes, the number of classes.
  • input_shape, the input shape of the model.
  • dropout, the dropout rate if a dropout layer must be included in the model.

As you modify the get_custom_model() function, you can add your own arguments. Assume for example that you want to have an argument alpha that is a float. Then, just add it to the interface of the function.

Then, your custom model can be used as any other Model Zoo model using the configuration file as shown in the YAML code below:

training:
   model:
      name: custom
      alpha: 0.5       # The argument you added to get_custom_model().
      input_shape: (128, 128, 3)
   dropout: 0.2

If you want to use transfer learning with your custom model, you need to modify the value of the argument last_layer_index in the call to the function transfer_pretrained_weights() in the file common/utils/models_utils.py. This argument needs to be set to the index of the last layer of the model backbone, i.e., the last layer before the classifier begins. Layer indices are numbered from 0 (the input layer has index 0).

After doing this, you will be able to use transfer learning as shown below:

training:
   model:
      name: custom
      alpha: 0.5
      input_shape: (128, 128, 3)
      pretrained_model_path: ${MODELS}/pretrained_model.h5
   dropout: 0.2
5.5 Train, quantize, benchmark, and evaluate your model

In case you want to train and quantize a model, you can either launch the training operation mode followed by the quantization operation on the trained model (please refer to the quantization README.md that describes in detail the quantization part) or you can use chained services like launching chain_tqe example with the command below:

python stm32ai_main.py --config-path ./config_file_examples/ --config-name chain_tqe_config.yaml

This specific example trains a MobileNet V2 model with ImageNet pre-trained weights, fine-tunes it by retraining the latest seven layers but the fifth one (this only as an example), and quantizes it to 8-bits using quantization_split (30% in this example) of the train dataset for calibration before evaluating the quantized model.

In case you also want to execute a benchmark on top of training and quantizing services, it is recommended to launch the chain service called chain_tqeb that stands for train, quantize, evaluate, benchmark like the example with the command below:

python stm32ai_main.py --config-path ./config_file_examples/ --config-name chain_tqeb_config.yaml

This specific example uses the "Bring Your Own Model" feature using model_path, then it fine tunes the initial model by retraining all the layers but the twenty first (as an example), benchmarks the float model on the STM32H747I-DISCO board using the STM32Cube.AI developer cloud, quantizes it to 8-bits using quantization_split (30% in this example) of the train dataset for calibration before evaluating the quantized model and benchmarking it.