utils.data¶
Data loading, preprocessing, and PyTorch Dataset/DataLoader utilities.
Data loading, preprocessing, and PyTorch dataset utilities for BirdNET Geomodel.
Handles the full pipeline from parquet files to training-ready DataLoaders: - H3DataLoader: Load and flatten H3 cell parquet data - H3DataPreprocessor: Sinusoidal encoding, normalization, species vocab, splitting - BirdSpeciesDataset: PyTorch Dataset wrapper - create_dataloaders / get_class_weights: DataLoader and class weight utilities
Classes¶
H3DataLoader
¶
Load and prepare H3 cell-based species occurrence data for model training.
Functions¶
__init__(data_path)
¶
Initialize the data loader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_path
|
str
|
Path to the H3 cell parquet file. |
required |
load_data()
¶
Load the H3 cell data from parquet file.
get_h3_cells()
¶
Return the array of H3 cell index strings.
h3_to_latlon(h3_cells)
staticmethod
¶
Convert H3 cell indices to latitude/longitude arrays.
compute_jitter_std(h3_cells)
staticmethod
¶
Compute coordinate jitter std (degrees) from H3 cell resolution.
Returns a standard deviation equal to 40 % of the average hexagon edge length (converted to degrees). With Gaussian noise at this scale, ~95 % of jittered points remain inside the originating cell.
get_environmental_features()
¶
Return the environmental feature columns as a DataFrame.
flatten_to_samples(ocean_sample_rate=1.0, water_threshold=0.9, include_yearly=True)
¶
Flatten H3-cell × weeks to individual (lat, lon, week, species, env) samples.
For each cell, creates 48 weekly samples (week 1–48) and optionally one yearly sample (week 0) whose species list is the union of all weeks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ocean_sample_rate
|
float
|
Fraction of high-water cells to keep (0–1).
Cells whose |
1.0
|
water_threshold
|
float
|
|
0.9
|
include_yearly
|
bool
|
If True (default), include a week-0 yearly sample per cell. Set to False to train on weekly data only. |
True
|
Returns:
| Type | Description |
|---|---|
Tuple[ndarray, ndarray, ndarray, List[List[str]], DataFrame]
|
lats, lons, weeks, species_lists, env_features |
get_data_info()
¶
Return a summary dict with counts and column names.
H3DataPreprocessor
¶
Preprocess H3 cell and species occurrence data for multi-task learning.
Functions¶
__init__()
¶
Initialize the preprocessor with empty state.
normalize_environmental_features(env_features, fit=True)
¶
Encode environmental features with type-appropriate transformations
- Categorical columns → one-hot encoded (NaN → all-zero row)
- Fraction columns → passed through as-is (NaN → 0)
- Continuous columns → StandardScaler (NaN → column mean before scaling)
- Constant columns → dropped
build_species_vocabulary(species_lists, min_obs_per_species=0, max_species=0)
¶
Build vocabulary of all unique species codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
species_lists
|
List[List[str]]
|
Per-sample lists of species codes (eBird codes for birds, iNat IDs for non-birds). |
required |
min_obs_per_species
|
int
|
If >0, exclude species observed in fewer than this many samples. Default 0 (keep all). |
0
|
max_species
|
int
|
If >0, randomly subsample the vocabulary to at most this many species (after min-obs filtering). Uses a fixed seed for reproducibility. Default 0 (keep all). |
0
|
encode_species_multilabel(species_lists)
¶
Convert species lists to multi-label binary matrix.
NOTE: only used for small datasets. For large datasets use encode_species_sparse() to avoid OOM on the dense matrix.
encode_species_sparse(species_lists)
¶
Convert species lists to packed sparse index arrays.
Returns a dict with two contiguous arrays instead of a list of millions of small numpy arrays. This eliminates per-object refcount overhead that causes copy-on-write memory bloat with forked DataLoader workers.
Returns:
| Type | Description |
|---|---|
Dict[str, ndarray]
|
|
Dict[str, ndarray]
|
to |
compute_obs_density(inputs, species_lists)
staticmethod
¶
Compute per-sample observation density for density-stratified evaluation.
For each unique location (lat, lon), sums the total number of species detections across all samples at that location. Each sample is then assigned its location's total density. This serves as a proxy for observer effort / survey intensity.
A well-surveyed H3 cell (e.g. Central Park, NYC) will have a high density value; a poorly surveyed cell (e.g. rural Siberia) will have a low value. During validation the density is used to stratify metrics — a model that generalizes well should have similar mAP in dense and sparse strata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Dict[str, ndarray]
|
Dict with 'lat', 'lon' float32 arrays. |
required |
species_lists
|
List[List[str]]
|
Per-sample lists of species codes (before encoding). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Float32 array of shape |
mask_regions(inputs, targets, regions)
staticmethod
¶
Split data into outside-region and inside-region subsets.
Samples whose (lat, lon) falls inside any of the given bounding boxes are moved to the "inside" subset; the rest stay in "outside". This enables region hold-out experiments: train on the outside subset and evaluate spatial generalisation on the inside (held-out) subset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Dict[str, ndarray]
|
Dict with 'lat', 'lon', 'week' (and optionally 'obs_density') arrays. |
required |
targets
|
Dict[str, Any]
|
Dict with 'species' and 'env_features'. |
required |
regions
|
List[Tuple[float, float, float, float]]
|
List of |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, Any], Dict[str, Any], Dict[str, Any], Dict[str, Any]]
|
|
propagate_env_labels(lats, lons, weeks, species_lists, env_features, k=10, max_radius_km=1000.0, min_obs_threshold=10, soft_weight=0.5, max_spread_factor=2.0, env_dist_max=2.0, range_cap_km=500.0)
staticmethod
¶
Propagate species labels from observed to sparse/unobserved cells.
For each sample whose species list is shorter than min_obs_threshold, find the k nearest observed samples in environmental feature space (among samples from the same week), then copy species from neighbours within max_radius_km. Per-week matching prevents seasonal species from leaking across weeks (e.g. summer migrants appearing in winter).
Uses sparse matrix operations to vectorize the species merge and range check, avoiding per-species Python loops.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
lats
|
ndarray
|
Per-sample latitudes. |
required |
lons
|
ndarray
|
Per-sample longitudes. |
required |
weeks
|
ndarray
|
Per-sample week numbers (0-48). |
required |
species_lists
|
List[List[str]]
|
Per-sample species occurrence lists (mutable). |
required |
env_features
|
DataFrame
|
Per-sample environmental feature DataFrame. |
required |
k
|
int
|
Number of nearest neighbors to consider (default 10). |
10
|
max_radius_km
|
float
|
Geographic radius cap in km (default 1000). |
1000.0
|
min_obs_threshold
|
int
|
Samples with fewer species than this are considered sparse and receive propagated labels (default 10). |
10
|
soft_weight
|
float
|
Reserved for future soft-label support. |
0.5
|
max_spread_factor
|
float
|
Restrict species propagation based on their observed geographic range. A species will only propagate to a cell if the cell is within distance D of the nearest original observation, where D = max_spread_factor × (observed range diameter / 2). Set to 0 to disable range filtering (default 2.0). |
2.0
|
env_dist_max
|
float
|
Maximum Euclidean distance in standardized env-feature space between a sparse cell and its KNN neighbor for that neighbor to contribute labels. Neighbors further away in env space are dropped even if within max_radius_km. Set to 0 to disable (default 2.0). |
2.0
|
range_cap_km
|
float
|
Hard cap in km on the per-species propagation distance from the nearest original observation. Even if a species' bounding-box range would allow propagation farther, it is clamped to at most range_cap_km. Set to 0 to disable (default 500). |
500.0
|
Returns:
| Type | Description |
|---|---|
List[List[str]]
|
Modified species_lists with propagated labels (also mutated |
List[List[str]]
|
in place). |
prepare_training_data(lats, lons, weeks, species_lists, env_features, fit=True, max_obs_per_species=0, min_obs_per_species=0, max_species=0)
¶
Run full preprocessing: encode inputs, normalize targets, build vocab.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_obs_per_species
|
int
|
If >0, cap observations so no single species contributes more than this many positive samples. Reduces the influence of hyper-common species on training. Samples are dropped randomly. Default 0 (no cap). |
0
|
min_obs_per_species
|
int
|
If >0, exclude species observed in fewer than this many samples from the vocabulary. Default 0 (keep all). |
0
|
max_species
|
int
|
If >0, randomly subsample the vocabulary to at most this many species. Default 0 (keep all). |
0
|
compute_species_freq_weights(species_lists, lats, lons, min_weight=0.1, pct_lo=10.0, pct_hi=90.0)
¶
Compute per-species label weights via region-normalized frequency.
Citizen-science observation density varies enormously across regions. The US alone can contribute an order of magnitude more records than the Neotropics, so a naive global frequency count would assign high weights to common US species while suppressing species-rich tropical communities. Region-normalized weighting solves this by computing frequency percentile ranks within geographic bins and using the maximum regional percentile as each species' weight basis.
Algorithm:
- Partition samples into geographic bins (30° lat × 60° lon).
- Within each bin, count per-species occurrences.
- Within each bin, compute the percentile rank of every species (among species present in that bin).
- For each species, take the max percentile rank across bins.
- Map that max-regional-percentile to a weight via linear interpolation controlled by pct_lo / pct_hi.
This makes weights independent of absolute observation density: a species at the 90th percentile in Colombia gets the same weight as one at the 90th percentile in the US.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
species_lists
|
List[List[str]]
|
Per-sample species occurrence lists. |
required |
lats
|
ndarray
|
Per-sample latitudes. |
required |
lons
|
ndarray
|
Per-sample longitudes. |
required |
min_weight
|
float
|
Floor weight for rare species. |
0.1
|
pct_lo
|
float
|
Lower percentile threshold. Default 10. |
10.0
|
pct_hi
|
float
|
Upper percentile threshold. Default 90. |
90.0
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Array of shape |
ndarray
|
|
subsample_by_location(inputs, targets, fraction=1.0, random_state=42)
¶
Randomly subsample a fraction of locations (and all their samples).
Subsampling is location-based: unique (lat, lon) positions are sampled, then all rows belonging to the selected locations are retained. This preserves the temporal structure within each H3 cell and keeps the data suitable for a subsequent location-based train/val/test split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Dict[str, ndarray]
|
Dict with 'lat', 'lon', 'week' arrays. |
required |
targets
|
Dict[str, Any]
|
Dict with 'species' and 'env_features'. |
required |
fraction
|
float
|
Fraction of locations to keep (0 < fraction <= 1). |
1.0
|
random_state
|
int
|
Random seed for reproducibility. |
42
|
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, ndarray], Dict[str, Any]]
|
(inputs, targets) subsets with only the selected locations. |
subsample_by_samples(inputs, targets, fraction=1.0, random_state=42)
¶
Randomly subsample a fraction of individual samples (week@location rows).
Unlike :meth:subsample_by_location, which drops entire H3 cells,
this method drops individual week-rows while preserving at least some
data for every location. This avoids losing small islands that have
few cells but whose endemic species are important to monitor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Dict[str, ndarray]
|
Dict with 'lat', 'lon', 'week' arrays. |
required |
targets
|
Dict[str, Any]
|
Dict with 'species' and 'env_features'. |
required |
fraction
|
float
|
Fraction of samples to keep (0 < fraction <= 1). |
1.0
|
random_state
|
int
|
Random seed for reproducibility. |
42
|
Returns:
| Type | Description |
|---|---|
Tuple[Dict[str, ndarray], Dict[str, Any]]
|
(inputs, targets) subsets with the selected samples. |
split_data(inputs, targets, val_size=0.1, random_state=42, split_by_location=True, **kwargs)
¶
Split into train/val (optionally grouped by location to prevent leakage).
Handles both dense ndarray and sparse list-of-arrays species targets.
Returns:
| Type | Description |
|---|---|
Tuple
|
(train_inputs, val_inputs, train_targets, val_targets) |
get_preprocessing_info()
¶
Return a dict with species vocab size and environmental feature info.
BirdSpeciesDataset
¶
Bases: Dataset
PyTorch Dataset for bird species occurrence prediction.
Species targets can be either
- Dense: np.ndarray of shape [n_samples, n_species]
- Sparse (packed): dict with 'values' (int32) and 'offsets' (int64)
When sparse, the dense one-hot vector is materialised on the fly in the collate function, keeping resident memory proportional to the number of observations rather than samples × species.
Functions¶
__init__(inputs, targets, n_species=0, jitter_std=0.0, species_freq_weights=None)
¶
Wrap preprocessed arrays as a PyTorch Dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Dict[str, ndarray]
|
Dict with 'lat', 'lon', 'week' float32 arrays. |
required |
targets
|
Dict[str, Any]
|
Dict with 'species' (dense or sparse) and 'env_features'. |
required |
n_species
|
int
|
Total number of species (required when species is sparse). |
0
|
jitter_std
|
float
|
Standard deviation (degrees) of Gaussian noise added
to lat/lon coordinates each time a sample is drawn. Set to
0.0 to disable (default). Typically derived from H3 cell
resolution via |
0.0
|
species_freq_weights
|
Optional[ndarray]
|
Optional 1-D array of per-species label weights. When provided, positive labels use the weight instead of 1.0. |
None
|
__getitem__(idx)
¶
Return (inputs_dict, targets_dict) for one sample.
Functions¶
create_dataloaders(train_inputs, train_targets, val_inputs, val_targets, batch_size=256, num_workers=0, pin_memory=True, n_species=0, jitter_std=0.0, species_freq_weights=None)
¶
Create training and validation DataLoaders.
All data is held in memory as PyTorch tensors. Callers should
subsample before calling this function if only a fraction of the
data is needed (see H3DataPreprocessor.subsample_by_location).
When species targets are sparse (list of index arrays), a custom
collate function builds the dense (batch, n_species) tensor once
per batch instead of per sample, reducing allocation pressure ~1000×.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jitter_std
|
float
|
Gaussian noise std (degrees) added to training coordinates each time a sample is drawn. Validation coordinates are never jittered. |
0.0
|
species_freq_weights
|
Optional[ndarray]
|
Optional per-species label weights. Applied to training set only; validation uses binary labels. |
None
|
get_class_weights(species_targets, smoothing=100.0, max_weight=50.0)
¶
Compute positive class weights for imbalanced species.