Image Classification - How to Use Your Own Datasets
===================================================
This tutorial demonstrates how to use AutoGluon with your own custom
datasets. As an example, we use a dataset from Kaggle to show the
required steps to format image data properly for AutoGluon.
Step 1: Organizing the dataset into proper directories
------------------------------------------------------
After completing this step, you will have the following directory
structure on your machine:
::
data/
├── class1/
├── class2/
├── class3/
├── ...
Here ``data`` is a folder containing the raw images categorized into
classes. For example, subfolder ``class1`` contains all images that
belong to the first class, ``class2`` contains all images belonging to
the second class, etc. We generally recommend at least 100 training
images per class for reasonable classification performance, but this
might depend on the type of images in your specific use-case.
Under each class, the following image formats are supported when
training your model:
- JPG
- JPEG
- PNG
In the same dataset, all the images should be in the same format. Note
that in image classification, we do not require that all images have the
same resolution.
You will need to organize your dataset into the above directory
structure before using AutoGluon. Below, we demonstrate how to construct
this organization for a Kaggle dataset.
Example: Kaggle dataset
~~~~~~~~~~~~~~~~~~~~~~~
Kaggle is a popular machine learning competition platform and contains
lots of datasets for different machine learning tasks including image
classification. If you don't have Kaggle account, please register one at
`Kaggle `__. Then, please follow the `Kaggle
installation `__ to obtain access
to Kaggle's data downloading API.
To find image classification datasets in Kaggle, let's go to
`Kaggle `__ and search using keyword
``image classification`` either under ``Datasets`` or ``Competitions``.
For example, we find the ``Shopee-IET Machine Learning Competition``
under the ``InClass`` tab in ``Competitions``.
We then navigate to
`Data `__
to download the dataset using the Kaggle API. Please make sure to click
the button of "I Understand and Accept" before downloading the data.
An example shell script to download the dataset to ``./data/shopeeiet/``
can be found here:
`download\_shopeeiet.sh `__.
After downloading this script to your machine, run it with:
::
import autogluon.core as ag
ag.download('https://raw.githubusercontent.com/zhanghang1989/AutoGluonWebdata/master/docs/tutorial/download_shopeeiet.sh')
!sh download_shopeeiet.sh
Now we have the desired directory structure under
``./data/shopeeiet/train/``, which in this case looks as follows:
::
shopeeiet/train
├── BabyBibs
├── BabyHat
├── BabyPants
├── ...
shopeeiet/test
├── ...
Here are some example images from this data:
.. figure:: ../../img/shopeeiet_example.png
Step 2: Split data into training/validation sets
------------------------------------------------
A fundamental step in machine learning is to split the data into
disjoint sets used for different purposes.
**Training Set**: The majority of your data should be in the training
set. This is the data use to train your model: data is used to learn the
parameters of the model, namely the weights of the neural network
classifier.
**Validation Set**: A separate validation set (sometimes also called the
dev set) is also used during AutoGluon's training process. While neural
network weights are updated based on the training data, each neural
network requires the user to specify many hyperparameters (e.g.,
learning rates, etc.). The choice of hyperparameters greatly impacts the
training process and resulting model. AutoGluon automatically tries many
different values of these hyperparameters and evaluates each
hyperparameter setting by measuring the performance of the resulting
network on the validation set.
**Test Set**: A separate set of images, possibly without available
labels. These data are never used during any part of the model
construction or learning process. If unlabeled, these may correspond to
images whose labels we would like to predict. If labeled, these images
may correspond to images we reserve for estimating the performance of
our final model.
Automatic training/validation split
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AutoGluon automatically does Training/Validation split:
::
from autogluon.vision import ImageClassification as task
dataset = task.Dataset('./data/shopeeiet/train')
AutoGluon automatically infers how many classes there are based on the
directory structure. By default, AutoGluon automatically constructs the
training/validation set split:
- Training Set: 80% of images.
- Validation Set: 20% of images.
where the images that fall into the validation set are randomly chosen
from the training data based on the class.
Step 3: Use AutoGluon fit to generate a classification model
------------------------------------------------------------
Now that we have a ``Dataset`` object, we can use AutoGluon's default
configuration to obtain an image classification model using the
```fit`` `__
function.
Due to the large size of the Kaggle dataset, calling ``fit`` without
specifying a time limit may result in long waiting times. Run the
following commands to run ``fit`` using a time limit:
::
time_limits = 10 * 60 # 10mins
classifier = task.fit(dataset, time_limits=time_limits,
ngpus_per_trial=1)
The top-1 accuracy of the best model on the validation set is:
::
print('Top-1 val acc: %.3f' % classifier.results['best_reward'])
Using AutoGluon to Generate Predictions on Test Images
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We can ask our final model to generate predictions on the provided test
images. We first load the test data as a ``Dataset`` object and then
call ``predict``:
::
test_dataset = task.Dataset('./data/shopeeiet/test', train=False)
inds, probs, probs_all = classifier.predict(test_dataset)
``inds`` above contains the indices of the predicted class for each test
image. ``probs`` contains the confidence in these class predictions.
``probs_all`` contains the confidence in the whole of classes
predictions.
Here are the results of AutoGluon's default ``fit`` and ``predict``
under different ``time_limits`` when executed on a p3.16xlarge EC2
instance:
- The validation top-1 accuracy within 5h is 0.842, and ranks 14th
place in Kaggle competition.
- The validation top-1 accuracy within 24h is 0.846, and ranks 12th
place in Kaggle competition.
- The validation top-1 accuracy within 72h is 0.852, and ranks 9th
place in Kaggle competition.
Step 4: Submit test predictions to Kaggle
-----------------------------------------
If you wish to upload the model's predictions to Kaggle, here is how to
convert them into a format suitable for a submission into the Kaggle
competition:
::
import autogluon.core as ag
ag.utils.generate_csv(inds, './data/shopeeiet/submission.csv')
This produces a submission file located at:
``./data/shopeeiet/submission.csv``.
To see an example submission, check out ``sample submission.csv`` at
this link:
`Data `__.
To make your own submission, click
`Submission `__
and then follow the steps in the submission page (upload submission
file, describe the submission, and click the ``Make Submission``
button). Let's see how your model fares in this competition!