This is AutoDL challenge: the final challenge in 2019 AutoDL challenges series, part of NeurIPS 2019 competition program! There is NO prerequite to have entered previous challenges to enter this challenge.
In this new challenge, we propose datasets from all the different modalities of the previous challenges in the series: image, video, speech, text. We target these domains because deep learning (DL) methods have had great success recently in these areas. We hope that this will drive the community to explore automated designs of DL models. However, we do not impose that participants use Deep Learning. We also added tabular data (i.e. in feature vector representation), from the AutoML challenges. Raw data are provided, formatted in a uniform tensor manner, to encourage participants to submit generic algorithms. All problems are multi-label classification problems. We impose restrictions on training time and resources to push the state-of-the-art further. We provide a large number of pre-formatted public datasets and offer the possibility of formatting your own datasets in the same way.
This is a 2-phase challenge; we are presently running the first phase (the Feed-back phase). The final blind testing (the Final phase) is scheduled to start March 14, 2020. Instructions will be posted.
We remind you that this a skilled-based contest and chance should not play a role in determining the winner. To that end, only participants whose performances exceed that of the best Baseline3 entry will qualify for winning prizes in the Final phase. We will invite to the Final phase any participant having results above the worst Baseline3 entry in the Feed-back phase.
This is a challenge with code submission for multilabel classification tasks. We provide 4 baseline methods for test purposes (only submitting baselines will not be enough to win the challenge, see Challenge Rules tab for more details):
Baseline 0: Constant (zero) predictions
Baseline 1: Linear classifier
Baseline 2: 3D Convolutional Neural Network
Baseline 3: All winner solutions (AutoCV, AutoNLP, AutoSpeech) combined
To make a test submission, download one of the baseline methods, click on the blue button "Upload a Submission" in the upper right corner of the page and re-upload it. You must click first the orange tab "All datasets" if you want to make a submission simultaneously on all datasets and get ranked in the challenge. You may also submit on a single dataset at a time (for debug purposes). To check progress on your submissions go to the "My Submissions" tab. Your best submission is shown on the leaderboard visible under the "Results" tab.
The starting kit contains everything you need to create your own code submission (just by modifying the file model.py) and to test it on your local computer, with the same handling programs and Docker image as those of the Codalab platform (but the hardware environment is in general different).
This includes a jupyter notebook tutorial.ipynb with step-by-step instructions. The interface is simple and generic: you must supply a Python class model.py with:
To make submissions, zip model.py (without the directory), then use the "Upload a Submission" button. That's it!
Each dataset in this competition comes from one of following 5 domains: image, video, speech, text or tabular. Every dataset is formatted in TFRecords and split into a train set (with true labels) and a test set (without true labels). The data loading process is done in the ingestion program (thus common to all participants), which parses these TFRecords to a `tf.data.Dataset` object. Each of its examples is of the form
(example, labels)
where `example` is a dense 4-D Tensor of dtype tf.float32 and of shape
(sequence_size, row_count, col_count, num_channels)
and `labels` is a 1-D Tensor of shape
(output_dim,).
Here `output_dim` represents number of classes of the multilabel classification task.
The metadata of each dataset contains info such as the shape of examples, number of examples, number of classes, etc. These info can be accessed by calling different functions found at here.
Although the domain information is not given directly in the metadata, it can be inferred from metadata by a function similar to this one.
Although it is straight-forward to interpret the 4-D Tensor representation of each example for most domains, we need to make some manual choices to encode text datasets. The choices we made are:
The mapping from token to integer index can be accessed by calling
token_to_index = metadata.get_channel_to_index_map()
Embedding weights:
In the Docker image running by the platform (evariste/autodl:gpu-latest), a built-in embedding model is provided for Chinese and English respectively, and the path of the embedding models is "/app/embedding". Both the embedding models are from fastText (Chinese, English). In addition, pre-trained weights using BERT can also be found in the same folder.
Alternative way to download the Docker image:
Since the Docker image (evariste/autodl:gpu-latest) is larger than 11GB, the usual way of downloading it using docker pull could be difficult. Thus we provide an alternative way for downloading:
Download the 3 files via: https://pan.baidu.com/s/1cxDDSZRSyGT6fH82cNkGRQ
Merge them using command:
cat autodl-gpu-latest.tar.part* > autodl-gpu-latest.tar
Load the image to Docker using:
docker load < autodl-gpu-latest.tar
We provide a list of public datasets. You will have access to the data (training set and test set) AND the true labels for these datasets. Notice that the video datasets do not include a sound track.
# | Name | Type | Domain | Size | Source |
Data (w/o test labels) |
Test labels |
1 | Munster | Image | HWR | 18 MB | MNIST | munster.data | munster.solution |
2 | City | Image | Objects | 128 MB | Cifar-10 | city.data | city.solution |
3 | Chucky | Image | Objects | 128 MB | Cifar-100 | chucky.data | chucky.solution |
4 | Pedro | Image | People | 377 MB | PA-100K | pedro.data | pedro.solution |
5 | Decal | Image | Aerial | 73 MB | NWPU VHR-10 | decal.data | decal.solution |
6 | Hammer | Image | Medical | 111 MB | Ham10000 | hammer.data | hammer.solution |
7 | Kreatur | Video | Action | 469 MB | KTH | kreatur.data | kreatur.solution |
8 | Kreatur3 | Video | Action | 588 MB | KTH | kreatur3.data | kreatur3.solution |
9 | Kraut | Video | Action | 1.9 GB | KTH | kraut.data | kraut.solution |
10 | Katze | Video | Action | 1.9 GB | KTH | katze.data | katze.solution |
11 | data01 | Speech | Speaker | 1.8 GB | -- | data01.data | data01.solution |
12 | data02 | Speech | Emotion | 53 MB | -- | data02.data | data02.solution |
13 | data03 | Speech | Accent | 1.8 GB | -- | data03.data | data03.solution |
14 | data04 | Speech | Genre | 469 MB | -- | data04.data | data04.solution |
15 | data05 | Speech | Language | 208 MB | -- | data05.data | data05.solution |
16 | O1 | Text | Comments | 828 KB | -- | O1.data | O1.solution |
17 | O2 | Text | Emotion | 25 MB | -- | O2.data | O2.solution |
18 | O3 | Text | News | 88 MB | -- | O3.data | O3.solution |
19 | O4 | Text | Spam | 87 MB | -- | O4.data | O4.solution |
20 | O5 | Text | News | 14 MB | -- | O5.data | O5.solution |
21 | Adult | Tabular | Census | 2 MB | Adult | adult.data | adult.solution |
22 | Dilbert | Tabular | -- | 162 MB | -- | dilbert.data | dilbert.solution |
23 | Digits | Tabular | HWR | 137 MB | MNIST | digits.data | digits.solution |
24 | Madeline | Tabular | -- | 2.6 MB | -- | madeline.data | madeline.solution |
# |
Name |
num_train |
num_test |
sequence_size |
row_count |
col_count |
num_channels |
output_dim |
1 | Munster | 60000 | 10000 | 1 | 28 | 28 | 1 | 10 |
2 | City | 48060 | 11940 | 1 | 32 | 32 | 3 | 10 |
3 | Chucky | 48061 | 11939 | 1 | 32 | 32 | 3 | 100 |
4 | Pedro | 80095 | 19905 | 1 | -1 | -1 | 3 | 26 |
5 | Decal | 634 | 166 | 1 | -1 | -1 | 3 | 11 |
6 | Hammer | 8050 | 1965 | 1 | 400 | 300 | 3 | 7 |
7 | Kreatur | 1528 | 863 | 181 | 60 | 80 | 1 | 4 |
8 | Kreatur3 | 1528 | 863 | 181 | 60 | 80 | 3 | 4 |
9 | Kraut | 1528 | 863 | 181 | 120 | 160 | 1 | 4 |
10 | Katze | 1528 | 863 | 181 | 120 | 160 | 1 | 6 |
11 | data01 | 3000 | 3000 | -1 | 1 | 1 | 1 | 100 |
12 | data02 | 428 | 107 | -1 | 1 | 1 | 1 | 7 |
13 | data03 | 796 | 200 | -1 | 1 | 1 | 1 | 3 |
14 | data04 | 940 | 473 | -1 | 1 | 1 | 1 | 20 |
15 | data05 | 199 | 597 | -1 | 1 | 1 | 1 | 10 |
16 | O1 | 7796 | 1817 | -1 | 1 | 1 | 1 | 2 |
17 | O2 | 11308 | 7538 | -1 | 1 | 1 | 1 | 20 |
18 | O3 | 60000 | 40000 | -1 | 1 | 1 | 1 | 2 |
19 | O4 | 54990 | 10010 | -1 | 1 | 1 | 1 | 10 |
20 | O5 | 155952 | 72048 | -1 | 1 | 1 | 1 | 18 |
21 | Adult | 39073 | 9768 | 1 | 1 | 24 | 1 | 3 |
22 | Dilbert | 14871 | 9709 | 1 | 1 | 2000 | 1 | 5 |
23 | Digits | 35000 | 35000 | 1 | 1 | 1568 | 1 | 10 |
24 | Madeline | 4222 | 3238 | 1 | 1 | 259 | 1 | 2 |
These data were re-formatted from original public datasets. If you use them, please make sure to acknowledge the original data donnors (see "Source" in the data table) and check the tems of use.
To download all public datasets at once:
cd autodl_starting_kit_stable
python download_public_datasets.py
We provide toolkit to participants to format their own datasets to the same format of this challenge. If you want to practice designing algorithms with your own datasets, follow these steps.
This challenge has two phases. This is the feedback phase: when you submit your code, you get immediate feedback on 5 feedback datasets. In the final test phase, you will be evaluated on several new datasets. Eligible participants to the final phase will be notified when and where to submit their code for a final blind test. The ranking in the final phase will count towards determining the winners.
Code submitted is trained and tested automatically, without any human intervention. Code submitted on "All datasets" is run on all five feedback datasets in parallel on separate compute workers, each one with its own time budget.
The identities of the datasets used for testing on the platform are concealed. The data are provided in a raw form (no feature extraction) to encourage researchers to use Deep Learning methods performing automatic feature learning, although this is NOT a requirement. All problems are multi-label classification problems. The tasks are constrained by the time budget (20 minutes/dataset).
Here is some pseudo-code of the evaluation protocol:
# For each dataset, our evaluation program calls the model constructor
# IMPORTANT: this initilization step doesn't consume time in the total time budget
# so one should carry out meta-learning or loading pre-trained weights in this step.
# This step should not exceed 20min. Otherwise the submission will fail.
M = Model(metadata=dataset_metadata)
# Initialize
remaining_time budget = overall_time_budget
start_time = time()
# Ingestion program calls multiple times train and test:
repeat until M.done_training or remaining_time_budget < 0
{
M.train(training_data, remaining_time_budget)
remaining_time_budget = start_time + overall_time_budget - time.time()
results = M.test(test_data, remaining_time_budget)
remaining_time_budget = start_time + overall_time_budget - time.time()
# Results made available to scoring program (run in separate container)
save(results)
}
It is the responsibility of the participants to make sure that neither the "train" nor the "test" methods exceed the “remaining_time_budget”. The method “train” can choose to manage its time budget such that it trains in varying time increments. There is pressure that it does not use all "overall_time_budget" at the first iteration because we use the area under the learning curve as metric.
The participants can train in batches of pre-defined duration to incrementally improve their performance, until the time limit is attained. In this way we can plot learning curves: "performance" as a function of time. Each time the "train" method terminates, the "test" method is called and the results are saved, so the scoring program can use them, together with their timestamp.
We treat both multi-class and multi-label problems alike. Each label/class is considered a separate binary classification problem, and we compute the normalized AUC (NAUC or Gini coefficient)
2 * AUC - 1
as score for each prediction, here AUC is the usual area under ROC curve (ROC AUC).
For each dataset, we compute the Area under Learning Curve (ALC). The learning curve is drawn as follows:
After we compute the ALC for all 5 datasets, the overall ranking is used as the final score for evaluation and will be used in the learderboard. It is computed by averaging the ranks (among all participants) of ALC obtained on the 5 datasets.
Examples of learning curves:
No, they can make entries that show on the leaderboard for test purposes and to stimulate participation, but they are excluded from winning prizes. Excluded entrants include: baseline0, baseline1, baseline2, baseline3, baseline3_a, baseline3_b, baseline3_c, baseline3_d, baiyu, eric, hugo.jair, juliojj, Lukasz, madclam, Pavao, shangeth, thomas, tthomas, Zhen, Zhengying.
No, except accepting the TERMS AND CONDITIONS.
Yes, until the challenge deadline.
You can download "public data" only from the Data page. The data on which your code is evaluated cannot be downloaded, it will be visible to your code only, on the Codalab platform.
To make a valid challenge entry, make sure to click first the orange button "All datasets", then click the blue button on the upper right side "Upload a Submission". This will ensure that you submit on all 5 datasets of the challenge simultaneously. You may also make a submission on a single dataset for debug purposes, but it will not count towards the final ranking.
We provide a Starting Kit in Python with step-by-step instructions in a Jupyter notebook called "tutorial.ipynb", which can be found in the github repository https://github.com/zhengying-liu/autodl_starting_kit_stable. You can also have a well rendered preview here.
Yes. Top ranking participants will be invited to submit papers to a special issue of the IEEE transaction journal PAMI on Automated Machine Learning and will be entered in a contest for the best paper. Deadline to be define.
There will be 2 best paper awards of $1000 ("best paper" and "best student paper").
Yes, a 4000 USD prize pool.
1st place |
2nd place |
3rd place |
|
Prize |
2000 USD |
1500 USD |
500 USD |
Yes, participation is by code submission.
No. You just grant to the ORGANIZERS a license to use your code for evaluation purposes during the challenge. You retain all other rights.
Yes, please download it [HERE].
We are running your submissions on Google Cloud NVIDIA Tesla P100 GPUs. In non peak times we are planning to use 10 workers, each of which will have one NVIDIA Tesla P100 GPU (running CUDA 10 with drivers cuDNN 7.5) and 4 vCPUs, with 26 GB of memory, 100 GB disk.
The PARTICIPANTS will be informed if the computational resources increase. They will NOT decrease.
This is not explicitly forbidden, but it is discouraged. We prefer if all calculations are performed on the server. If you submit a pre-trained model, you will have to disclose it in the fact sheets.
YES. The ranking of participants will be made from a final blind test made by evaluating a SINGLE SUBMISSION made on the final test submission site. The submission will be evaluated on five new datasets in a completely "blind testing" manner. The final test ranking will determine the winners.
Each execution must run in less than 20 minutes (1200 seconds) for each dataset.
Wall time.
In principle no more than its time budget. We kill the process if the time budget is exceeded. Submissions are queued and run on a first time first serve basis. We are using several identical servers. Contact us if your submission is stuck more than 24 hours. Check on the leaderboard the execution time.
Five per day (and up to a total of 100), but up to a total computational time of 5 hours (submissions taking longer will be aborted). This may be subjet to change, according to the number of participants. Please respect other users. It is forbidden to register under multiple user IDs to gain an advantage and make more submissions. Violators will be DISQUALIFIED FROM THE CONTEST.
No. Please contact us if you think the failure is due to the platform rather than to your code and we will try to resolve the problem promptly.
This should be avoided. In the case where a submission exceeds 20 minutes of time budget for a particular task (dataset), the submission handling process (ingestion program in particular) will be killed when time budget is used up and predictions made so far (with their corresponding timestamps) will be used for evaluation. In the other case where a submission exceeds the total compute time per day, all running tasks will be killed by CodaLab and the status will be marked 'Failed' and a score of -1.0 will be produced.
No sorry, not for this challenge.
All problems are multi-label problems and we treat them as multiple 2-class classification problems. For a given dataset, all binary classification problems are scored with the ROC AUC and results are averaged (over all classes/binary problems). For each time step at which you save results, this gives you one point on the learning curve. The final score for one dataset is the area under the learning curve. The overall score on all 5 datasets is the average rank on the 5 datasets. For more details, go to 'Get Started' -> 'Instructions' -> 'Metrics' section.
The code was tested under Python 3.5. We are running Python 3.5 on the server and the same libraries are available.
Yes. Any Linux executable can run on the system, provided that it fulfills our Python interface and you bundle all necessary libraries with your submission.
No. We use TFRecords to format the datasets in a uniform manner, but you can use other software to process the data, including PyTorch (included in the Docker, see the following question).
evariste/autodl:gpu-latest, see the Dockerfile and some instructions on dockerhub.
When you submit code to Codalab, your code is executed inside a Docker container. This environment can be exactly reproduced on your local machine by downloading the corresponding docker image. The docker environment of the challenge contains Anaconda libraries, TensorFlow, and PyTorch (among other things).
Non GPU users, if you are new to Docker, follow these instructions to install docker. You may then use the docker evariste/autodl:cpu-latest. See details in the Starting Kit that can be downloaded from the Instructions page. GPU users, follow these more detailed instructions.
Your last submission is shown automatically on the leaderboard. You cannot choose which submission to select. If you want another submission than the last one you submitted to "count" and be displayed on the leaderboard, you need to re-submit it.
No. If you accidentally register multiple times or have multiple accounts from members of the same team, please notify the ORGANIZERS. Teams or solo PARTICIPANTS with multiple accounts will be disqualified.
We have disabled Codalab team registration. To join as a team, just share one account with your team. The team leader is responsible for making submissions and observing the rules.
You cannot. If you need to destroy your team, contact us.
It is up to you and the team leader to make arrangements. However, you cannot participate in multiple teams.
No. If we discover that you are trying to cheat in this way you will be disqualified. All your actions are logged and your code will be examined if you win.
ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". UPSUD, CHALEARN, IDF, AND/OR OTHER ORGANIZERS AND SPONSORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. In case of dispute or possible exclusion/disqualification from the competition, the PARTICIPANTS agree not to take immediate legal action against the ORGANIZERS or SPONSORS. Decisions can be appealed by submitting a letter to the CHALEARN president, and disputes will be resolved by the CHALEARN board of directors. See contact information.
For questions of general interest, THE PARTICIPANTS should post their questions to the forum.
Other questions should be directed to the organizers.
This challenge would not have been possible without the help of many people.
Main organizers:
Other contributors to the organization, starting kit, and datasets, include:
The challenge is running on the Codalab platform, administered by Université Paris-Saclay and maintained by CKCollab LLC, with primary developers:
ChaLearn is the challenge organization coordinator. Google is the primary sponsor of the challenge and helped defining the tasks, protocol, and data formats. 4Paradigm donated prizes, datasets, and contributed to the protocol, baselines methods and beta-testing. Other institutions of the co-organizers provided in-kind contributions, including datasets, data formatting, baseline methods, and beta-testing.
Start: Dec. 14, 2019, midnight
Description: Please make submissions by clicking on following 'Submit' button. Then you can view the submission results of your algorithm on each dataset in corresponding tab (Dataset 1, Dataset 2, etc).
Color | Label | Description | Start |
---|---|---|---|
Dataset 1 | This tab contains submission results of your algorithm on Dataset 1. | Dec. 14, 2019, midnight | |
Dataset 2 | This tab contains submission results of your algorithm on Dataset 2. | Dec. 14, 2019, midnight | |
Dataset 3 | This tab contains submission results of your algorithm on Dataset 3. | Dec. 14, 2019, midnight | |
Dataset 4 | This tab contains submission results of your algorithm on Dataset 4. | Dec. 14, 2019, midnight | |
Dataset 5 | This tab contains submission results of your algorithm on Dataset 5. | Dec. 14, 2019, midnight |
March 14, 2020, 11:59 p.m.
You must be logged in to participate in competitions.
Sign In