Source Modules¶
Detailed reference for every firmware C source file. The firmware is intentionally compact — ~700 lines of application code across 5 files, plus a 230-line FFT.
main.c — Orchestrator¶
Location: firmware/Src/main.c (~310 lines)
The entry point. Initializes the board, mounts the SD card, loops over WAV files, and coordinates the STFT → NPU → output pipeline.
Key Sections¶
Board init — a 12-step sequence that must execute in order. See Building & Flashing — Init Sequence for the full table.
UART redirect — provides __io_putchar() which routes printf() output to
USART1 via HAL. The weak symbol in syscalls.c (from NPU_Validation) calls
this.
int __io_putchar(int ch)
{
HAL_UART_Transmit(&UartHandle, (uint8_t *)&ch, 1, HAL_MAX_DELAY);
return ch;
}
NPU inference — run_inference() handles the CPU ↔ NPU data transfer:
- Query input/output buffer addresses from LL_ATON.
memcpydata into the NPU input buffer. Depending onAPP_AUDIO_FRONTEND:- Hybrid:
spec_buf(STFT spectrogram) - Raw:
audio_buf(raw waveform directly) - Precomputed:
mel_buf(STFT + Mel filterbank) SCB_CleanDCache_by_Addr()— flush CPU cache so the NPU sees fresh data.LL_ATON_RT_Main()— run inference (blocking).SCB_InvalidateDCache_by_Addr()— invalidate cache so the CPU sees NPU output.memcpyscores out of the NPU output buffer.
Top-K selection — simple O(K×N) partial sort. Prints predictions with score
≥ APP_SCORE_THRESHOLD.
Benchmark timing — HAL_GetTick() (1 ms resolution) wraps each stage:
read, STFT, NPU. Per-file [BENCH] lines and an aggregate summary at the end.
Functions¶
| Function | Signature | Purpose |
|---|---|---|
main() |
int main(void) |
Entry point: init + processing loop |
run_inference() |
bool run_inference(const float *, float *) |
Copy spec to NPU, run, copy scores back |
print_top_k() |
void print_top_k(const char *, const float *, int) |
Print top-K over UART |
__io_putchar() |
int __io_putchar(int) |
UART printf redirect |
aiValidationInit() |
static void aiValidationInit(void) |
GDB breakpoint stub (do not remove) |
Error_Handler() |
void Error_Handler(void) |
HAL error hook (infinite loop) |
assert_failed() |
void assert_failed(uint8_t *, uint32_t) |
HAL assert hook (infinite loop) |
wav_reader.c — WAV File Parser¶
Location: firmware/Src/wav_reader.c (~130 lines)
Parses standard RIFF/WAVE files with PCM encoding.
How It Works¶
- Reads 12-byte RIFF header, validates
"RIFF"and"WAVE"magic. - Walks sub-chunks looking for
"fmt "and"data": fmt— extracts channels, sample rate, bits per sample.data— records the file offset and byte count.- Unknown chunks are skipped by reading their size and seeking past them.
- Validates: PCM format (tag 1), 16-bit, correct sample rate.
- Returns a
WavInfostruct with all parsed metadata.
wav_read_chunk_f32()¶
Reads a chunk of PCM16 audio from the file and converts to float32:
- Reads
num_samples × num_channels × 2bytes via FatFsf_read(). - Converts each
int16_tsample tofloat32by dividing by 32768. - For stereo: extracts channel 0 only (every other sample).
- Zero-pads if the file is shorter than
num_samples.
Data Structures¶
typedef struct {
uint16_t num_channels; // 1 = mono, 2 = stereo
uint32_t sample_rate; // e.g. 24000
uint16_t bits_per_sample; // must be 16
uint32_t data_size; // PCM data size in bytes
uint32_t num_samples; // total samples (per channel)
uint32_t data_offset; // file offset to start of PCM data
} WavInfo;
Limitations¶
- 16-bit PCM only — no float32 WAV, A-law, mu-law, or ADPCM.
- No resampling — sample rate must match
APP_SAMPLE_RATE. - First chunk only — reads from the start of the file, not seeking to arbitrary positions.
audio_stft.c — STFT Engine¶
Location: firmware/Src/audio_stft.c (~68 lines)
Computes a Hann-windowed magnitude STFT using fft.c.
stft_magnitude()¶
void stft_magnitude(const float *audio, uint32_t num_samples,
uint32_t fft_length, uint32_t hop_length,
uint32_t spec_width, float *out);
Algorithm:
- Pre-compute a Hann window of
fft_lengthsamples (done once, cached). - For each of
spec_widthtime frames: - Extract
fft_lengthsamples starting atframe × hop_length. - Multiply by the Hann window.
- Call
fft_512_real()for the FFT. - Compute magnitude:
sqrt(re² + im²)for each of 257 bins. - Store in output as
out[freq_bin * spec_width + frame](frequency-major). - Zero-fill if the audio is shorter than expected.
Output layout: [filters, spec_width] — frequency-major (each row is one
frequency bin across all time frames). This matches the expected layout layout
[B, filters, spec_width, 1] for both the hybrid and precomputed frontends.
Why frequency-major?
The TFLite model's first layer expects input shaped [B, filters, T, 1].
By storing frequency-major on the firmware side, the memcpy to the NPU
input buffer preserves the correct layout without a transpose.
Stack Usage¶
Working buffers are stack-allocated: - Hann window: 512 × 4 = 2,048 bytes - FFT work buffer: 512 × 4 = 2,048 bytes - Total: ~4 KB stack per call
audio_mel.c — Mel Filterbank¶
Location: firmware/Src/audio_mel.c (~80 lines)
Computes a triangular Mel-frequency filterbank. Only compiled and used if APP_AUDIO_FRONTEND == APP_FRONTEND_PRECOMPUTED.
mel_filterbank()¶
Algorithm:
1. A 1D array of pre-calculated APP_NUM_MELS triangular weights is generated once during startup by mel_init().
2. The mel_filterbank() applies that matrix over the frequency-major [APP_FFT_BINS, spec_width] spectrogram.
3. Produces a compacted [APP_NUM_MELS, spec_width] output array.
This reproduces librosa's Slaney-normalized mel weight matrices natively on the Cortex-M55 CPU.
fft.c — 512-Point Real FFT¶
Location: firmware/Src/fft.c (~230 lines)
A self-contained radix-2 Decimation-in-Time (DIT) FFT with zero external dependencies.
Why Not CMSIS-DSP?¶
The CMSIS-DSP arm_rfft_fast_f32 requires linking libarm_cortexM55l_math.a,
which adds ~200 KB to the binary and complicates the Makefile. Our custom
512-point FFT is 230 lines of plain C, has no dependencies, and runs in ~0.05 ms
per frame at 800 MHz. For a 256-frame spectrogram the entire STFT takes
~25–35 ms — negligible compared to the 3 s audio chunk.
Algorithm¶
- Input packing: treat 512 real samples as 256 complex pairs
(
x[2k] + j·x[2k+1]). - Bit-reversal permutation: reorder the 256 complex values for in-place butterfly computation.
- Butterfly stages: 8 stages of radix-2 DIT butterflies with precomputed
twiddle factors (
cos+j·sin). - Split-radix unpack: decompose the 256-point complex FFT result into 257 real-FFT bins (DC through Nyquist) using even/odd twiddle factors.
Output Format¶
CMSIS-DSP compatible layout:
| Index | Content |
|---|---|
buf[0] |
DC component (real) |
buf[1] |
Nyquist component (real) |
buf[2k], buf[2k+1] |
Real and imaginary parts of bin k (k = 1..255) |
Performance¶
| Metric | Value |
|---|---|
| Per-frame time | ~0.05 ms @ 800 MHz |
| 256 frames | ~13 ms (FFT only, excludes windowing and magnitude) |
| Table init | One-time, first call only |
| Static memory | ~4 KB (twiddle + bit-reversal tables) |
Limitations¶
The FFT size is hardcoded at 512 (256 complex points). The twiddle tables, bit-reversal tables, and buffer sizes are all compile-time constants. To support other FFT sizes, the code would need generalization or conditional compilation.
sd_handler.c — SD Card + FatFs¶
Location: firmware/Src/sd_handler.c (~120 lines)
Manages the SD card via BSP_SD (SDMMC2) and FatFs filesystem.
Functions¶
| Function | Purpose |
|---|---|
sd_mount() |
Init BSP SD, link FatFs diskio driver, mount filesystem |
sd_unmount() |
Unmount filesystem, de-init SD |
sd_scan_audio_dir(dir, list) |
Enumerate up to 512 .wav files in dir |
sd_write_header(path, classes, n) |
Write TSV header row to results file |
sd_append_result(path, name, scores, n) |
Append one TSV row with all class scores |
sd_scan_audio_dir()¶
Uses FatFs f_opendir / f_readdir to enumerate files:
- Non-recursive (flat directory only).
- Checks extension:
.wavor.WAV. - Stores full paths (
/audio/filename.wav) in anSdFileListstruct. - Limited to
SD_MAX_FILES(512) entries.
sd_append_result()¶
Writes scores as integer-formatted values because FatFs's f_printf doesn't
support %f. Scores are multiplied by 10,000 and written as integers (e.g.,
9230 for 0.923).
Data Structures¶
#define SD_MAX_PATH 256
#define SD_MAX_FILES 512
typedef struct {
char paths[SD_MAX_FILES][SD_MAX_PATH];
uint32_t count;
} SdFileList;
Memory usage
SdFileList is ~128 KB. It's declared static in main.c to keep it off
the stack.
Board Support Files (NPU_Validation)¶
These files come from ST's NPU_Validation example project and are not modified by the board-test workflow:
| File | Purpose |
|---|---|
misc_toolbox.c/h |
UART config, NPU config, RISAF config; UartHandle global |
mcu_cache.c/h |
CPU cache enable/clean/invalidate helpers |
npu_cache.c/h |
NPU-specific cache management |
system_clock_config.c |
Clock tree setup (HSI overdrive, PLL config) |
stm32n6xx_it.c |
Interrupt handlers (SysTick, HardFault, etc.) |
syscalls.c |
Newlib stubs (_write, _read, _sbrk) for printf/malloc |
sysmem.c |
Heap region definition |
startup_stm32n657xx.s |
Vector table, reset handler, initial stack pointer |
These provide the foundation that our application code builds on. They handle clock trees, voltage rails, cache management, and the NPU register interface.