Skip to content

BirdNET-STM32

spectrogram

birdnet-team/birdnet-stm32

spectrogram

`birdnet_stm32.audio.spectrogram` ¶

Spectrogram computation, magnitude scaling, and normalization.

Supports mel spectrograms, MFCC, linear STFT, and multiple magnitude compression modes (none, pwl, pcen, db). All scaling is designed to be quantization-friendly for INT8 deployment on the STM32N6 NPU.

`normalize(S)` ¶

Normalize a spectrogram to [0, 1] per sample.

Parameters:

Name	Type	Description	Default
`S`	`ndarray`	Spectrogram array.	required

Returns:

Type	Description
`ndarray`	Normalized spectrogram, same shape as input.

Source code in birdnet_stm32/audio/spectrogram.py

def normalize(S: np.ndarray) -> np.ndarray:
    """Normalize a spectrogram to [0, 1] per sample.

    Args:
        S: Spectrogram array.

    Returns:
        Normalized spectrogram, same shape as input.
    """
    return (S - S.min()) / (S.max() - S.min() + 1e-10)

`get_spectrogram_from_audio(audio, sample_rate=24000, n_fft=512, mel_bins=64, spec_width=256, mag_scale='none', mode='mel', n_mfcc=20)` ¶

Compute a magnitude spectrogram with optional scaling and normalization.

Modes

'mel': Standard mel spectrogram.
'mfcc': Mel-frequency cepstral coefficients (mel -> DCT -> truncate).
'log_mel': Log-scaled mel spectrogram (log1p, quantization-friendly).
'linear': Linear STFT magnitude (when mel_bins <= 0).

Behavior by mag_scale (applied only in 'mel' and 'linear' modes): - 'none': Magnitude mel (power=1.0), then normalize to [0, 1]. - 'pcen': Magnitude mel, scale to 32-bit PCM range, librosa.pcen, normalize. - 'pwl': Magnitude mel, pre-normalize, piecewise compression, normalize. - 'db': Magnitude mel, amplitude_to_db(ref=max), normalize.

Parameters:

Name	Type	Description	Default
`audio`	`ndarray`	1D audio array (mono).	required
`sample_rate`	`int`	Sampling rate (Hz).	`24000`
`n_fft`	`int`	FFT size for STFT.	`512`
`mel_bins`	`int`	Number of mel bands, or <=0 for linear STFT bins (magnitude).	`64`
`spec_width`	`int`	Target number of time frames (columns).	`256`
`mag_scale`	`str`	'none' \| 'db' \| 'pcen' \| 'pwl'.	`'none'`
`mode`	`str`	'mel' \| 'mfcc' \| 'log_mel' \| 'linear'.	`'mel'`
`n_mfcc`	`int`	Number of MFCC coefficients (only used when mode='mfcc').	`20`

Returns:

Type	Description
`ndarray`	Spectrogram array (mel_bins or n_mfcc or fft_bins, spec_width), values in [0, 1].

Source code in birdnet_stm32/audio/spectrogram.py

def get_spectrogram_from_audio(
    audio: np.ndarray,
    sample_rate: int = 24000,
    n_fft: int = 512,
    mel_bins: int = 64,
    spec_width: int = 256,
    mag_scale: str = "none",
    mode: str = "mel",
    n_mfcc: int = 20,
) -> np.ndarray:
    """Compute a magnitude spectrogram with optional scaling and normalization.

    Modes:
        - 'mel': Standard mel spectrogram.
        - 'mfcc': Mel-frequency cepstral coefficients (mel -> DCT -> truncate).
        - 'log_mel': Log-scaled mel spectrogram (log1p, quantization-friendly).
        - 'linear': Linear STFT magnitude (when mel_bins <= 0).

    Behavior by mag_scale (applied only in 'mel' and 'linear' modes):
        - 'none': Magnitude mel (power=1.0), then normalize to [0, 1].
        - 'pcen': Magnitude mel, scale to 32-bit PCM range, librosa.pcen, normalize.
        - 'pwl': Magnitude mel, pre-normalize, piecewise compression, normalize.
        - 'db': Magnitude mel, amplitude_to_db(ref=max), normalize.

    Args:
        audio: 1D audio array (mono).
        sample_rate: Sampling rate (Hz).
        n_fft: FFT size for STFT.
        mel_bins: Number of mel bands, or <=0 for linear STFT bins (magnitude).
        spec_width: Target number of time frames (columns).
        mag_scale: 'none' | 'db' | 'pcen' | 'pwl'.
        mode: 'mel' | 'mfcc' | 'log_mel' | 'linear'.
        n_mfcc: Number of MFCC coefficients (only used when mode='mfcc').

    Returns:
        Spectrogram array (mel_bins or n_mfcc or fft_bins, spec_width), values in [0, 1].
    """
    hop_length = (len(audio) // spec_width) if spec_width > 0 else n_fft // 2

    if mode == "mfcc":
        S_mel = librosa.feature.melspectrogram(
            y=audio,
            sr=sample_rate,
            n_fft=n_fft,
            hop_length=hop_length,
            win_length=n_fft,
            window="hann",
            n_mels=mel_bins,
            power=2.0,
            fmin=150,
            fmax=sample_rate // 2,
            htk=False,
            norm="slaney",
        )
        S_log = librosa.power_to_db(S_mel, ref=np.max)
        S = librosa.feature.mfcc(
            S=S_log,
            n_mfcc=n_mfcc,
            norm="ortho",
        )
        S = S[:, :spec_width]
        return normalize(S)

    if mode == "log_mel":
        S = librosa.feature.melspectrogram(
            y=audio,
            sr=sample_rate,
            n_fft=n_fft,
            hop_length=hop_length,
            win_length=n_fft,
            window="hann",
            n_mels=mel_bins,
            power=1.0,
            fmin=150,
            fmax=sample_rate // 2,
            htk=False,
            norm="slaney",
        )
        S = S[:, :spec_width]
        S = np.log1p(S)
        return normalize(S)

    if mel_bins <= 0 or mode == "linear":
        S = np.abs(
            librosa.stft(
                y=audio,
                n_fft=n_fft,
                hop_length=hop_length,
                win_length=n_fft,
                window="hann",
            )
        )
    else:
        S = librosa.feature.melspectrogram(
            y=audio,
            sr=sample_rate,
            n_fft=n_fft,
            hop_length=hop_length,
            win_length=n_fft,
            window="hann",
            n_mels=mel_bins,
            power=1.0,
            fmin=150,
            fmax=sample_rate // 2,
            htk=False,
            norm="slaney",
        )

    # Ensure fixed width
    S = S[:, :spec_width]

    if mag_scale == "pcen":
        S = librosa.pcen(S * (2.0**31), sr=sample_rate, hop_length=hop_length, axis=1)

    elif mag_scale == "pwl":
        Smin, Smax = S.min(), S.max()
        Snorm = (S - Smin) / (Smax - Smin + 1e-10)
        t1, t2, t3 = 0.10, 0.35, 0.65
        k0, k1, k2, k3 = 0.40, 0.25, 0.15, 0.08
        relu = lambda z: np.maximum(z, 0.0)  # noqa: E731
        S = k0 * Snorm + k1 * relu(Snorm - t1) + k2 * relu(Snorm - t2) + k3 * relu(Snorm - t3)

    elif mag_scale == "db":
        S = librosa.amplitude_to_db(S, ref=np.max)

    return normalize(S)