Model Architecture¶
The BirdNET Geomodel is a multi-task neural network that predicts species occurrence from raw location and time inputs.
Design Philosophy¶
The model is designed with a key constraint: at inference time, only latitude, longitude, and week number are needed. No environmental data, no preprocessing — just three numbers.
To make this work, the model learns spatial and temporal patterns during training by jointly predicting:
- Species occurrence (primary task) — which species are present at a location/time
- Environmental features (auxiliary task) — what the environment looks like at a location
The auxiliary task acts as a regularizer, encouraging the model to learn meaningful spatial representations even when species labels are sparse.
Architecture Overview¶
graph TD
subgraph Input
A[lat, lon, week]
end
subgraph CircularEncoding
B["lat → sin/cos harmonics (2 × n)"]
C["lon → sin/cos harmonics (2 × n)"]
D["week → sin/cos harmonics (2 × n)"]
end
A --> B
A --> C
A --> D
subgraph Encoder
E["Concatenate lat + lon"]
F[Linear Projection]
G["Residual Block × N<br/>(each FiLM-modulated)"]
H[LayerNorm]
end
B --> E
C --> E
D --> G
E --> F --> G --> H
subgraph Heads
I["Species Head<br/>(multi-label classification)"]
J["Environmental Head<br/>(regression)"]
end
H --> I
H --> J
subgraph "Habitat Head (optional)"
K["Habitat-Species Head<br/>(env → species)"]
L["Learned Gate σ(W·emb + b)"]
M["gate × direct +<br/>(1−gate) × habitat"]
end
J -.->|detach| K
K --> M
I --> M
H --> L
L --> M
Components¶
Circular Encoding¶
Raw coordinates and week numbers are poor inputs for neural networks — the model wouldn't know that longitude -180° and +180° are the same place, or that week 48 is adjacent to week 1.
Circular encoding solves this by mapping each value to sine/cosine pairs at multiple harmonics (Tancik et al., 2020):
- Latitude: degrees → radians, then encoded with
coord_harmonicsharmonics (default 8 → 16 features) - Longitude: same as latitude (16 features)
- Week: mapped to \([0, 2\pi)\) over 48 weeks, then encoded with
week_harmonicsharmonics (default 8 → 16 features)
Spatial input features: \(2 \times 2 \times n\) = 32 by default (where \(n\) = coord_harmonics). Week features (16) are used for FiLM conditioning rather than concatenated.
Year-round predictions (week 0) are computed at inference time as the max across all 48 weekly predictions — no special week-0 encoding is needed.
Shared Encoder (SpatioTemporalEncoder)¶
The encoder maps spatial coordinates into a rich embedding, modulated by temporal information via FiLM (Feature-wise Linear Modulation; Perez et al., 2018):
- Spatial projection: Concatenated lat+lon circular features → Linear to
embed_dim(default 512) - Residual blocks — each block applies LayerNorm → GELU → Linear → LayerNorm → GELU → Dropout → Linear with a skip connection. All LayerNorm layers use
eps=1e-4(above the FP16 minimum normal ~6e-5) so that the epsilon retains full precision after half-precision quantization. - FiLM conditioning — after each residual block, the week encoding generates per-block scale (γ) and shift (β) parameters via a two-layer MLP: \(x' = (\gamma + 1) \cdot \text{block}(x) + \beta\). The \(+1\) centers γ around identity (no scaling) at initialization, stabilizing early training. This forces the model to actively modulate spatial representations based on the time of year.
- Final LayerNorm for stable downstream processing
The pre-norm residual design ensures stable training and strong gradient flow even with many blocks.
Species Prediction Head¶
A multi-label classification head that outputs one logit per species:
- Residual blocks for further processing
- Low-rank bottleneck: instead of a single large Linear(hidden → n_species), the head uses Linear(hidden → bottleneck) → GELU → Linear(bottleneck → n_species)
The bottleneck (default 128) dramatically reduces parameters when n_species is large (10K+) and learns a compact species-embedding space whose dimensions can be interpreted as latent ecological niches.
Output: raw logits (apply sigmoid for probabilities).
Environmental Prediction Head¶
A regression head that predicts normalized environmental features (elevation, temperature, precipitation, etc.) from the shared embedding. Used during training as an auxiliary objective. When the habitat-species head is enabled (see below), the environmental head also runs during inference so its output can feed the habitat pathway.
Habitat-Species Association Head (optional)¶
Enabled with --habitat_head, this head creates an explicit pathway from predicted environment to species occurrence, making the relationship directly learnable rather than implicit in the shared encoder.
Architecture:
- Input: predicted environmental features from the environmental head, detached (
env_pred.detach()) so species-loss gradients don't corrupt the env head's MSE regression objective — the env head learns clean environmental representations from MSE alone - Projection + residual blocks + low-rank bottleneck → species logits (same structure as the species head)
- Learned gate: a per-species gate \(g = \sigma(W \cdot e + b)\) conditioned on the encoder embedding \(e\) combines the two pathways:
The gate is initialized with zero weights and bias = 3 (\(\sigma(3) \approx 0.95\)), so the direct species head strongly dominates initially and the habitat contribution only fades in once the env and habitat heads have learned useful representations.
- Auxiliary habitat loss: the same species loss function is applied directly to the habitat head's logits (before gating) with weight
--habitat_weight(default 0.5). This gives the habitat head a full-strength learning signal independent of the gate value — critical because the gate initially suppresses the habitat contribution to ~5%.
Why this helps:
- The direct species head learns spatial-temporal patterns from coordinates alone
- The habitat head learns explicit environment→species associations (e.g., "high elevation + conifer forest → Clark's Nutcracker")
- The stop-gradient ensures the env head produces stable, accurate environmental features, while the habitat head learns from those clean representations
- Together, they enable the model to predict species in unobserved areas with similar environments to observed ones — without needing data-level label propagation
Parameter overhead: the gate linear layer adds embed_dim × n_species parameters; the habitat head itself is similar in size to the environmental head. At scale 0.5 with 12K species, this adds ~3.9M parameters.
Model Scaling¶
Model capacity is controlled by a continuous scaling factor (--model_scale).
All dimensions scale linearly from the reference point at 1.0; block counts
are rounded to the nearest integer (minimum 1).
Approximate parameter counts assume ~12K species (the species head's final
layer adds bottleneck × n_species parameters, so the total varies with
vocabulary size):
| Scale | Embed Dim | Encoder Blocks | Species Head | Bottleneck | Env Head | Approx. Parameters |
|---|---|---|---|---|---|---|
| 0.25 | 128 | 1 | 128, 1 block | 32 | 64, 1 block | ~0.5M |
| 0.50 | 256 | 2 | 256, 1 block | 64 | 128, 1 block | ~1.8M |
| 0.75 | 384 | 3 | 384, 2 blocks | 96 | 192, 1 block | ~3.8M |
| 1.00 | 512 | 4 | 512, 2 blocks | 128 | 256, 1 block | ~7.2M |
| 1.25 | 640 | 5 | 640, 2 blocks | 160 | 320, 1 block | ~12.4M |
| 1.50 | 768 | 6 | 768, 3 blocks | 192 | 384, 2 blocks | ~21.2M |
| 1.75 | 896 | 7 | 896, 4 blocks | 224 | 448, 2 blocks | ~33.2M |
| 2.00 | 1024 | 8 | 1024, 4 blocks | 256 | 512, 2 blocks | ~36M |
The "+ species" part scales with the number of species in the vocabulary (bottleneck × n_species parameters).
Encoding Parameters¶
| Parameter | Default | Effect |
|---|---|---|
--coord_harmonics |
8 | Higher values capture finer spatial patterns (more harmonics) |
--week_harmonics |
8 | Higher values capture sharper weekly transitions |
Choosing harmonics
The default values (8 coordinate, 8 week) work well for global models. Higher harmonics add capacity for finer-grained patterns but increase input dimensionality and risk overfitting on small datasets.
References¶
Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. (2018). FiLM: Visual Reasoning with a General Conditioning Layer. In AAAI Conference on Artificial Intelligence (pp. 3942–3951).
Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raber, N., Barron, J.T., & Ng, R. (2020). Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In Advances in Neural Information Processing Systems (pp. 7537–7547).