Overview

This chapter describes the available models, gives insight into the underlying dataset splits and demonstrates their performance based on individual evaluation and testsets with various metrics.

BirdBox accepts YOLO models in multiple formats such as .pt, .onnx and .engine. Trained models for this task can be found on the TUC-Cloud. Alternatively, you can train your own model on a custom dataset by using the code available in the BirdBox-Train repository.

Restricted Access

BirdBox-Train is currently only accessible to members of the BirdNET-Team.

To run benchmarks with a model of your choice and a custom dataset see CLI Reference.

Available Models and Respective Datasets¶

The available models are trained with datasets from Zenodo.org. The utilized datasets are:

A model has been trained for each of those datasets individually. Additionally, there are two models that were trained on an orchestration of all five datasets:

Dataset Splits¶

Each split of a dataset aims to disjunctively divide the data into 70% training, 15% validation, and 15% test data. The counted quantity is the number of labels within a split, not the number of files. For example, a file with five aldfly annotations is counted as five, not one.

The exact quantities for each model can be examined in the corresponding Species Distribution Across Splits sections (e.g. see All-in-One or Just-Bird).

Limitations¶

There exist multiple reasons why this 70/15/15 split is not trivial:

it is strictly forbidden to split overlapping data into different data-splits (but 50% overlap by default)
temporally adjacent recordings may record the same bird with the same background noise twice [6]
single three-second clips often contain multiple annotations from different species

The first and second problem can be solved by dividing the soundscape recordings into one-minute chunks before any data is split. Three-second clips with overlap are then only generated within these one-minute files. The resulting three-second clips can then only be sorted into the same dataset split. Therefore, no direct overlap between splits is possible and the problem of temporal data leakage is mitigated.

However, combined with the third problem, this limits the capabilities of the dataset-splitter. From now on, only one minute chunks can be sorted into a distinct split.

Rare Species¶

Yet some rare species only occur in one or two minutes of the original audio. This makes it impossible to split them into three datasets.

One could think about rather complex data augmentation techniques like cutting out the rare vocalizations and inserting them into multiple different background noises. This newly generated synthetic data could then be split. But this data scaling can still be employed after a first successful deployment of BirdBox.

Since this project is in its early stages, it has been decided to just consider every species with less than 100 annotations as rare. Those rare species are only used for training. The metrics during training and hyperparameter tuning (evaluation dataset) and for testing (test dataset) are computed only with splittable, often occurring species.

Clipping¶

The amount of labels within a species has been reduced if the total amount of labels exceeded 10,000. However, this limit is soft because we can only discard or keep entire three second clips. If a spectrogram contains a species which should be clipped, but also a rare one, the entire spectrogram is kept. Thus adding another annotation above 10,000.

Increase of Annotation Amount¶

Additionally, the amount of annotations in the dataset exceed the amount of original annotations by far due to two reasons:

50% overlap leads to label duplicates within the same dataset-split
3 second clips cut long annotations (e.g. 3 minutes) into multiple small ones

This behavior is intended. See How it works.

ID to Species Mapping Compatibility¶

Because BirdBox is capable of running different models for the detection, it has to manage multiple different id to species mappings. The possible mappings are defined in src/config.py. Those mappings directly refer to the utilized conf.yaml during the training of the corresponding YOLO-model.

The class-id decoding depends on the selected mapping. This mapping is set like this:

CLI: pass explicit --species-mapping (details)
Streamlit: mapping is inferred from model file name (details)

Mapping/Model Mismatch

If the species mapping does not match the selected model, the species labels in the output will be silently invalid. Always pass the --species-mapping value that corresponds to the model you are running.

References¶

[6] Roberts, D. R., et al. (2017). "Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure." Ecography, 40(8):913–929.