Audio Frontends¶

The AudioFrontendLayer in birdnet_stm32.models.frontend implements five audio frontend modes, each providing a different trade-off between flexibility and deployment complexity.

Canonical names: librosa, hybrid, raw, mfcc, log_mel. Deprecated aliases: precomputed → librosa, tf → raw.

flowchart LR
    subgraph librosa ["librosa (precomputed mel)"]
        direction LR
        L1["WAV"] --> L2["Offline\nlibrosa mel"] --> L3["Mel spectrogram\n→ model"]
    end
    subgraph hybrid ["hybrid (default)"]
        direction LR
        H1["WAV"] --> H2["Offline\nSTFT |X|"] --> H3["Learned mel\nConv2D 1×1"] --> H4["Mag scaling\n→ CNN"]
    end
    subgraph raw ["raw (waveform)"]
        direction LR
        R1["WAV"] --> R2["Learned filterbank\nConv2D + BN + ReLU6"] --> R3["Mag scaling\n→ CNN"]
    end
    subgraph mfcc ["mfcc (precomputed)"]
        direction LR
        M1["WAV"] --> M2["Offline\nMFCC"] --> M3["MFCC features\n→ model"]
    end
    subgraph log_mel ["log_mel (precomputed)"]
        direction LR
        LM1["WAV"] --> LM2["Offline\nlog-mel"] --> LM3["Log-mel spectrogram\n→ model"]
    end

Frontend modes¶

`librosa` (precomputed)¶

Spectrograms are computed offline using librosa before being fed to the model. The model receives a ready-made mel spectrogram tensor.

Input: [B, num_mels, spec_width, 1] mel spectrogram
In-graph ops: magnitude scaling only (if enabled)
Pros: simplest, fastest training
Cons: frontend is not part of the TFLite model; preprocessing must be replicated on-device

`hybrid` (default)¶

The model receives a linear magnitude STFT (|STFT|). A 1×1 Conv2D applies a learned mel filter bank, optionally with magnitude scaling.

Input: [B, fft_bins, spec_width, 1] linear magnitude spectrogram
In-graph ops: mel projection (Conv2D) + magnitude scaling
Mel initialization: weights seeded from a librosa Slaney mel basis
Trainable: optionally via --frontend_trainable
Pros: complete mel-to-prediction path in TFLite; flexible; good default
Cons: requires offline STFT computation

`raw` (waveform)¶

The model receives raw waveform samples and learns the entire filterbank from scratch using Conv2D layers.

Input: [B, samples, 1] raw audio waveform
In-graph ops: learned Conv2D filterbank + BN + ReLU6 + magnitude scaling
Pros: end-to-end learnable; no preprocessing needed
Cons: highest memory usage; risk of exceeding activation limits

Raw frontend memory limit

At 22,050 Hz × 3 s = 66,150 samples, the raw input exceeds the 16-bit activation size limit (65,536) on the STM32N6 NPU. Either reduce the sample rate to 16 kHz, shorten the chunk, or use a different frontend.

`mfcc` (precomputed)¶

Mel-frequency cepstral coefficients are computed offline before being fed to the model. Useful for compact feature representations.

Input: [B, num_mfcc, spec_width, 1] MFCC features
In-graph ops: magnitude scaling only (if enabled)
Pros: compact features, well-studied in speech/audio classification
Cons: frontend is not part of the TFLite model; must be replicated on-device

`log_mel` (precomputed)¶

Log-scaled mel spectrograms are computed offline before being fed to the model.

Input: [B, num_mels, spec_width, 1] log-mel spectrogram
In-graph ops: magnitude scaling only (if enabled)
Pros: simple, standard feature representation
Cons: frontend is not part of the TFLite model; must be replicated on-device

Deployment frontends

Only hybrid and raw produce a self-contained TFLite model with the full audio-to-prediction path. The precomputed frontends (librosa, mfcc, log_mel) require matching preprocessing on the target device.

Magnitude scaling¶

Magnitude scaling is applied after the mel projection (or filterbank) and before the CNN body. It compresses the dynamic range of spectrogram values.

`pwl` (piecewise-linear) — recommended¶

Learned piecewise-linear compression using depthwise convolution branches. Quantizes cleanly — no log operations, no running statistics.

`pcen` (per-channel energy normalization)¶

Applies automatic gain control per frequency band using a learned smoothing filter. Uses pooling and convolution — generally N6-compatible but more complex than PWL.

`db` (decibels)¶

Log-scale compression: \(20 \cdot \log_{10}(\text{mag} + \epsilon)\).

Warning

Avoid db for quantized models. The log operation produces wide dynamic ranges that lead to poor INT8 quantization.

`none`¶

No magnitude scaling. Useful as a baseline for comparison only.

N6 compatibility checklist¶

When modifying or adding frontends, verify:

[ ] Channel counts are multiples of 8
[ ] No ops that expand beyond 16-bit activation limits
[ ] All ops are in the STM32N6 NPU operator set
[ ] Run stedgeai analyze on the exported TFLite to confirm
[ ] Cosine similarity > 0.95 after quantization

Audio Frontends¶

Frontend modes¶

librosa (precomputed)¶

hybrid (default)¶

raw (waveform)¶

mfcc (precomputed)¶

log_mel (precomputed)¶