Evaluation Tool

The Evaluation Tab in BirdNET Analyzer is a tool designed to assess the performance of deep learning models on bioacoustic data. Whether you are dealing with binary or multi-label classification tasks, this interface calculates and visualizes essential performance metrics. This guide explains each component of the Evaluation Tab and offers step-by-step instructions to ensure a smooth evaluation process.

1. Overview

The Evaluation Tab works by comparing two primary inputs:

Annotation Files: Files that provide the ground truth labels using Raven selection tables.
Prediction Files: Files generated by the BirdNET Analyzer that contain your model’s prediction scores and labels.

By aligning predictions with annotations over uniform time intervals, the system computes a range of performance metrics such as:

F1 Score
Recall
Precision
Average Precision (AP)
AUROC (Area Under the Receiver Operating Characteristic)
Accuracy

These metrics help you evaluate how well your model performs on bioacoustic data.

2. File selection

Annotations

Purpose: Provide the true labels for evaluation.
How to Use: Upload one or more annotation files via the file dialog or simply drag-and-drop them into the designated area.

Predictions

Purpose: Supply the model’s prediction data.
How to Use: Upload one or more prediction files using the same drag-and-drop or file dialog method.

3. Column Mapping for Annotations and Predictions

Different input files may use different column names. To ensure the tool can correctly interpret your data, you can map the columns from your files to the expected parameters.

Annotations Mapping

Start Time: Marks the beginning of the annotation.
End Time: Marks the end of the annotation.
Class: Contains the label or category.
Recording: Identifies the audio file.
Duration: Indicates the total duration of the audio file.

Predictions Mapping

Start Time: Marks the beginning of the prediction.
End Time: Marks the end of the prediction.
Class: Contains the predicted label.
Confidence: Holds the confidence scores of the predictions.
Recording: Identifies the audio file.
Duration: Indicates the total duration of the audio file.

Note

The system pre-populates these fields with default column names. If your files use different column names, simply select the appropriate ones from the drop-down menus.

4. Class Mapping (Optional)

If there is a discrepancy between class names in your annotation and prediction files, you can reconcile these differences using a JSON mapping file.

Download Template: Click the “Download Template” button to obtain a sample JSON file that shows how to map the predicted class names to the annotation class names.
Upload Mapping File: After editing the template to match your naming conventions, upload the updated file to standardize class names across your data.

5. Classes and Recordings Selection

Once you have uploaded and mapped your files, the system automatically extracts the available classes and recordings.

Select Classes: Use the checkbox group to choose specific classes for evaluation. If no selection is made, all classes are included by default.
Select Recordings: Similarly, select the recordings you wish to evaluate to focus on specific data subsets.

6. Parameters Configuration

Customize the evaluation process by adjusting the following parameters:

Sample Duration (s): The length of each audio segment. (Default: 3 seconds – matching BirdNET’s prediction segment.)
Recording Duration: Explicitly set the recording duration. (Default: The recording duration is automatically inferred from your files.)
Minimum Overlap (s): The minimum time overlap between an annotation and a prediction for them to be considered a match. (Default: 0.5 seconds)
Threshold: The cut-off value to decide if a prediction is positive. (Default: 0.1)
Class-wise Metrics: Toggle this option if you want to compute performance metrics for each class individually. If disabled, metrics are averaged across all classes.

7. Metrics Selection

Select the performance metrics you want to compute and visualize. The available options include:

AUROC: Measures the probability that the model will rank a random positive case higher than a random negative one.
- Advantage: Provides an overall sense of the model’s discriminative power, especially with imbalanced data.
- Disadvantage: Can be challenging to interpret.
Precision: Indicates how often the model’s positive predictions are correct.
- Advantage: Highlights the model’s accuracy in predicting positives.
- Disadvantage: Does not account for missed positive cases.
Recall: Measures the percentage of actual positive cases the model correctly identifies.
- Advantage: Ensures that most positive cases are detected.
- Disadvantage: May lead to many false positives if not balanced with precision.
F1 Score: The harmonic mean of precision and recall, offering a balanced metric.
- Advantage: Combines both false positives and false negatives into one score.
- Disadvantage: Can be less intuitive if precision and recall values differ greatly.
Average Precision (AP): Summarizes the precision-recall curve by averaging the precision at each recall level.
- Advantage: Provides a single metric across all thresholds.
- Disadvantage: Can be noisy for classes with few positive cases.
Accuracy: The overall percentage of correct predictions.
- Advantage: Simple to understand and calculate.
- Disadvantage: May be misleading in cases of class imbalance.

8. Actions

After configuring your files and parameters, use the action buttons to execute the evaluation and visualize the results.

Calculate Metrics: Processes your input files and computes the selected performance metrics.
Plot Metrics: Generates visualizations (line/bar plots) of the computed metrics.
Plot Confusion Matrix: Displays a confusion matrix showing the correct and incorrect predictions for each class.
Plot Metrics All Thresholds: Visualizes how performance metrics change across a range of threshold values, helping you understand trade-offs (e.g., between precision and recall).
Download Results Table: Exports a CSV file containing the computed metrics.
Download Data Table: Exports a CSV file with the processed data that details the alignment between annotations and predictions.

9. Step-by-Step Usage

1. File Upload

Navigate to the File Selection section.
Upload your annotation and prediction files using the provided file dialog or drag-and-drop interface.

2. Column Mapping

Review and adjust the column mappings using the drop-down menus to match your file’s structure.

3. Optional Class Mapping

If your class names differ between annotation and prediction files, download the JSON template, update it, and then upload the class mapping file.

4. Select Classes and Recordings

Use the checkbox groups to select the specific classes and recordings you want to evaluate.

5. Set Parameters

Adjust the sample duration, recording duration, minimum overlap, and threshold values.
Toggle the Class-wise Metrics option if you require individual class evaluations.

6. Select Metrics

Check the boxes for the performance metrics (AUROC, Precision, Recall, F1 Score, AP, Accuracy) you wish to compute and visualize.

7. Execute Evaluation and Visualizations

Click Calculate Metrics to process the data.
Generate visualizations by clicking on Plot Metrics, Plot Confusion Matrix, or Plot Metrics All Thresholds
Download the results or processed data tables as needed.

Note

Before generating the visualizations, ensure that you have calculated the metrics by clicking the “Calculate Metrics” button.

10. Conclusion

The Evaluation Tab in BirdNET Analyzer provides a comprehensive and flexible framework to assess the performance of bioacoustic classification models. By following this guide, you can efficiently configure your inputs, adjust evaluation parameters, compute key performance metrics, and generate insightful visualizations. This tool is designed to streamline your evaluation workflow and deepen your understanding of your model’s performance.