Skip to content

Architecture

Pipeline overview

flowchart TD
    A["Audio (.wav)"] --> B["Audio Frontend\nlibrosa / hybrid / raw"]
    B -->|"[B, H, W, 1]"| C["Magnitude Scaling\npwl / pcen / db / none"]
    C -->|"[B, H, W, 1]"| D["DS-CNN Body\n4 stages × depth_multiplier blocks\nchannels scaled by α"]
    D -->|"[B, H', W', C]"| E["Global Avg Pool\nDropout → Dense"]
    E -->|"[B, num_classes]"| F["Predictions"]

Component boundaries

Component Module Responsibility
Audio I/O birdnet_stm32.audio.io Load, resample, chunk WAV files
Spectrogram birdnet_stm32.audio.spectrogram Compute mel spectrograms (librosa path)
Frontend layer birdnet_stm32.models.frontend In-graph frontend (hybrid/raw modes + mag scaling)
Model builder birdnet_stm32.models.dscnn DS-CNN construction with scaling knobs
Data pipeline birdnet_stm32.data.dataset File discovery, upsampling, tf.data generation
Training birdnet_stm32.training.trainer Training loop, LR schedule, callbacks
Conversion birdnet_stm32.conversion.quantize PTQ, representative dataset, TFLite export
Validation birdnet_stm32.conversion.validate Keras vs. TFLite output comparison
Evaluation birdnet_stm32.evaluation Pooling, metrics (ROC-AUC, cmAP, F1), reporting
Deployment birdnet_stm32.deploy Config resolution, stedgeai/n6_loader wrappers

Data flow

Training

flowchart LR
    A["WAV files"] --> B["load_file_paths\n+ upsample"]
    B --> C["data.generator\nbatches"]
    C --> D["AudioFrontendLayer\nhybrid / raw"]
    D --> E["DS-CNN\nsigmoid"]
    E --> F["Cosine LR decay\n+ early stopping"]
    F --> G[".keras checkpoint"]

Inference

flowchart LR
    A["Audio file"] --> B["Chunk into\nfixed-length segments"]
    B --> C["Model\nfrontend + backbone + head"]
    C --> D["Pool chunk scores\navg / max / LME"]
    D --> E["Compute metrics\nvs. ground truth"]

Deployment

flowchart LR
    A[".tflite model"] --> B["stedgeai generate\nN6-optimized binary"]
    B --> C["n6_loader.py\nserial flash"]
    C --> D["stedgeai validate\non-device inference"]

Key design decisions

  • Float32 I/O: Audio spectrograms are continuous-valued; INT8 inputs would lose meaningful precision. Only internal weights/activations are quantized.
  • PWL over PCEN/dB: Piecewise-linear magnitude scaling quantizes cleanly (no log ops, no running statistics). PCEN is acceptable; dB should be avoided.
  • Hybrid as default frontend: Keeps the STFT offline (cheaper) while learning the mel projection in-graph, giving the TFLite model a complete mel-to-prediction path.
  • Channel alignment to 8: The N6 NPU vectorizes in groups of 8. Misaligned channels waste compute cycles or fail compilation.