utils.data¶

Data loading, preprocessing, and PyTorch Dataset/DataLoader utilities.

Data loading, preprocessing, and PyTorch dataset utilities for BirdNET Geomodel.

Handles the full pipeline from parquet files to training-ready DataLoaders: - H3DataLoader: Load and flatten H3 cell parquet data - H3DataPreprocessor: Sinusoidal encoding, normalization, species vocab, splitting - BirdSpeciesDataset: PyTorch Dataset wrapper - create_dataloaders / get_class_weights: DataLoader and class weight utilities

Classes¶

`H3DataLoader` ¶

Load and prepare H3 cell-based species occurrence data for model training.

Functions¶

`init(data_path)` ¶

Initialize the data loader.

Parameters:

Name	Type	Description	Default
`data_path`	`str`	Path to the H3 cell parquet file.	required

`load_data()` ¶

Load the H3 cell data from parquet file.

`get_h3_cells()` ¶

Return the array of H3 cell index strings.

`h3_to_latlon(h3_cells)` `staticmethod` ¶

Convert H3 cell indices to latitude/longitude arrays.

`compute_jitter_std(h3_cells)` `staticmethod` ¶

Compute coordinate jitter std (degrees) from H3 cell resolution.

Returns a standard deviation equal to 40 % of the average hexagon edge length (converted to degrees). With Gaussian noise at this scale, ~95 % of jittered points remain inside the originating cell.

`get_environmental_features()` ¶

Return the environmental feature columns as a DataFrame.

`flatten_to_samples(ocean_sample_rate=1.0, water_threshold=0.9, include_yearly=True)` ¶

Flatten H3-cell × weeks to individual (lat, lon, week, species, env) samples.

For each cell, creates 48 weekly samples (week 1–48) and optionally one yearly sample (week 0) whose species list is the union of all weeks.

Parameters:

Name	Type	Description	Default
`ocean_sample_rate`	`float`	Fraction of high-water cells to keep (0–1). Cells whose `water_fraction` exceeds water_threshold are randomly kept at this rate. Default 1.0 (keep all).	`1.0`
`water_threshold`	`float`	`water_fraction` above which a cell is considered ocean. Default 0.9.	`0.9`
`include_yearly`	`bool`	If True (default), include a week-0 yearly sample per cell. Set to False to train on weekly data only.	`True`

Returns:

Type	Description
`Tuple[ndarray, ndarray, ndarray, List[List[str]], DataFrame]`	lats, lons, weeks, species_lists, env_features

`get_data_info()` ¶

Return a summary dict with counts and column names.

`H3DataPreprocessor` ¶

Preprocess H3 cell and species occurrence data for multi-task learning.

Functions¶

`init()` ¶

Initialize the preprocessor with empty state.

`normalize_environmental_features(env_features, fit=True)` ¶

Encode environmental features with type-appropriate transformations

Categorical columns → one-hot encoded (NaN → all-zero row)
Fraction columns → passed through as-is (NaN → 0)
Continuous columns → StandardScaler (NaN → column mean before scaling)
Constant columns → dropped

`build_species_vocabulary(species_lists, min_obs_per_species=0, max_species=0)` ¶

Build vocabulary of all unique species codes.

Parameters:

Name	Type	Description	Default
`species_lists`	`List[List[str]]`	Per-sample lists of species codes (eBird codes for birds, iNat IDs for non-birds).	required
`min_obs_per_species`	`int`	If >0, exclude species observed in fewer than this many samples. Default 0 (keep all).	`0`
`max_species`	`int`	If >0, randomly subsample the vocabulary to at most this many species (after min-obs filtering). Uses a fixed seed for reproducibility. Default 0 (keep all).	`0`

`encode_species_multilabel(species_lists)` ¶

Convert species lists to multi-label binary matrix.

NOTE: only used for small datasets. For large datasets use encode_species_sparse() to avoid OOM on the dense matrix.

`encode_species_sparse(species_lists)` ¶

Convert species lists to packed sparse index arrays.

Returns a dict with two contiguous arrays instead of a list of millions of small numpy arrays. This eliminates per-object refcount overhead that causes copy-on-write memory bloat with forked DataLoader workers.

Returns:

Type	Description
`Dict[str, ndarray]`	`{'values': int32, 'offsets': int64}` where `offsets[i]`
`Dict[str, ndarray]`	to `offsets[i+1]` gives the slice of `values` for sample i.

`compute_obs_density(inputs, species_lists)` `staticmethod` ¶

Compute per-sample observation density for density-stratified evaluation.

For each unique location (lat, lon), sums the total number of species detections across all samples at that location. Each sample is then assigned its location's total density. This serves as a proxy for observer effort / survey intensity.

A well-surveyed H3 cell (e.g. Central Park, NYC) will have a high density value; a poorly surveyed cell (e.g. rural Siberia) will have a low value. During validation the density is used to stratify metrics — a model that generalizes well should have similar mAP in dense and sparse strata.

Parameters:

Name	Type	Description	Default
`inputs`	`Dict[str, ndarray]`	Dict with 'lat', 'lon' float32 arrays.	required
`species_lists`	`List[List[str]]`	Per-sample lists of species codes (before encoding).	required

Returns:

Type	Description
`ndarray`	Float32 array of shape `(n_samples,)` with per-location density.

`mask_regions(inputs, targets, regions)` `staticmethod` ¶

Split data into outside-region and inside-region subsets.

Samples whose (lat, lon) falls inside any of the given bounding boxes are moved to the "inside" subset; the rest stay in "outside". This enables region hold-out experiments: train on the outside subset and evaluate spatial generalisation on the inside (held-out) subset.

Parameters:

Name	Type	Description	Default
`inputs`	`Dict[str, ndarray]`	Dict with 'lat', 'lon', 'week' (and optionally 'obs_density') arrays.	required
`targets`	`Dict[str, Any]`	Dict with 'species' and 'env_features'.	required
`regions`	`List[Tuple[float, float, float, float]]`	List of `(lon_min, lat_min, lon_max, lat_max)` bboxes.	required

Returns:

Type	Description
`Tuple[Dict[str, Any], Dict[str, Any], Dict[str, Any], Dict[str, Any]]`	`(inputs_outside, targets_outside, inputs_inside, targets_inside)`

`propagate_env_labels(lats, lons, weeks, species_lists, env_features, k=10, max_radius_km=1000.0, min_obs_threshold=10, soft_weight=0.5, max_spread_factor=2.0, env_dist_max=2.0, range_cap_km=500.0)` `staticmethod` ¶

Propagate species labels from observed to sparse/unobserved cells.

For each sample whose species list is shorter than min_obs_threshold, find the k nearest observed samples in environmental feature space (among samples from the same week), then copy species from neighbours within max_radius_km. Per-week matching prevents seasonal species from leaking across weeks (e.g. summer migrants appearing in winter).

Uses sparse matrix operations to vectorize the species merge and range check, avoiding per-species Python loops.

Parameters:

Name	Type	Description	Default
`lats`	`ndarray`	Per-sample latitudes.	required
`lons`	`ndarray`	Per-sample longitudes.	required
`weeks`	`ndarray`	Per-sample week numbers (0-48).	required
`species_lists`	`List[List[str]]`	Per-sample species occurrence lists (mutable).	required
`env_features`	`DataFrame`	Per-sample environmental feature DataFrame.	required
`k`	`int`	Number of nearest neighbors to consider (default 10).	`10`
`max_radius_km`	`float`	Geographic radius cap in km (default 1000).	`1000.0`
`min_obs_threshold`	`int`	Samples with fewer species than this are considered sparse and receive propagated labels (default 10).	`10`
`soft_weight`	`float`	Reserved for future soft-label support.	`0.5`
`max_spread_factor`	`float`	Restrict species propagation based on their observed geographic range. A species will only propagate to a cell if the cell is within distance D of the nearest original observation, where D = max_spread_factor × (observed range diameter / 2). Set to 0 to disable range filtering (default 2.0).	`2.0`
`env_dist_max`	`float`	Maximum Euclidean distance in standardized env-feature space between a sparse cell and its KNN neighbor for that neighbor to contribute labels. Neighbors further away in env space are dropped even if within max_radius_km. Set to 0 to disable (default 2.0).	`2.0`
`range_cap_km`	`float`	Hard cap in km on the per-species propagation distance from the nearest original observation. Even if a species' bounding-box range would allow propagation farther, it is clamped to at most range_cap_km. Set to 0 to disable (default 500).	`500.0`

Returns:

Type	Description
`List[List[str]]`	Modified species_lists with propagated labels (also mutated
`List[List[str]]`	in place).

`prepare_training_data(lats, lons, weeks, species_lists, env_features, fit=True, max_obs_per_species=0, min_obs_per_species=0, max_species=0)` ¶

Run full preprocessing: encode inputs, normalize targets, build vocab.

Parameters:

Name	Type	Description	Default
`max_obs_per_species`	`int`	If >0, cap observations so no single species contributes more than this many positive samples. Reduces the influence of hyper-common species on training. Samples are dropped randomly. Default 0 (no cap).	`0`
`min_obs_per_species`	`int`	If >0, exclude species observed in fewer than this many samples from the vocabulary. Default 0 (keep all).	`0`
`max_species`	`int`	If >0, randomly subsample the vocabulary to at most this many species. Default 0 (keep all).	`0`

`compute_species_freq_weights(species_lists, lats, lons, min_weight=0.1, pct_lo=10.0, pct_hi=90.0)` ¶

Compute per-species label weights via region-normalized frequency.

Citizen-science observation density varies enormously across regions. The US alone can contribute an order of magnitude more records than the Neotropics, so a naive global frequency count would assign high weights to common US species while suppressing species-rich tropical communities. Region-normalized weighting solves this by computing frequency percentile ranks within geographic bins and using the maximum regional percentile as each species' weight basis.

Algorithm:

Partition samples into geographic bins (30° lat × 60° lon).
Within each bin, count per-species occurrences.
Within each bin, compute the percentile rank of every species (among species present in that bin).
For each species, take the max percentile rank across bins.
Map that max-regional-percentile to a weight via linear interpolation controlled by pct_lo / pct_hi.

This makes weights independent of absolute observation density: a species at the 90th percentile in Colombia gets the same weight as one at the 90th percentile in the US.

Parameters:

Name	Type	Description	Default
`species_lists`	`List[List[str]]`	Per-sample species occurrence lists.	required
`lats`	`ndarray`	Per-sample latitudes.	required
`lons`	`ndarray`	Per-sample longitudes.	required
`min_weight`	`float`	Floor weight for rare species.	`0.1`
`pct_lo`	`float`	Lower percentile threshold. Default 10.	`10.0`
`pct_hi`	`float`	Upper percentile threshold. Default 90.	`90.0`

Returns:

Type	Description
`ndarray`	Array of shape `(n_species,)` stored as
`ndarray`	`self.species_freq_weights`.

`subsample_by_location(inputs, targets, fraction=1.0, random_state=42)` ¶

Randomly subsample a fraction of locations (and all their samples).

Subsampling is location-based: unique (lat, lon) positions are sampled, then all rows belonging to the selected locations are retained. This preserves the temporal structure within each H3 cell and keeps the data suitable for a subsequent location-based train/val/test split.

Parameters:

Name	Type	Description	Default
`inputs`	`Dict[str, ndarray]`	Dict with 'lat', 'lon', 'week' arrays.	required
`targets`	`Dict[str, Any]`	Dict with 'species' and 'env_features'.	required
`fraction`	`float`	Fraction of locations to keep (0 < fraction <= 1).	`1.0`
`random_state`	`int`	Random seed for reproducibility.	`42`

Returns:

Type	Description
`Tuple[Dict[str, ndarray], Dict[str, Any]]`	(inputs, targets) subsets with only the selected locations.

`subsample_by_samples(inputs, targets, fraction=1.0, random_state=42)` ¶

Randomly subsample a fraction of individual samples (week@location rows).

Unlike :meth:subsample_by_location, which drops entire H3 cells, this method drops individual week-rows while preserving at least some data for every location. This avoids losing small islands that have few cells but whose endemic species are important to monitor.

Parameters:

Name	Type	Description	Default
`inputs`	`Dict[str, ndarray]`	Dict with 'lat', 'lon', 'week' arrays.	required
`targets`	`Dict[str, Any]`	Dict with 'species' and 'env_features'.	required
`fraction`	`float`	Fraction of samples to keep (0 < fraction <= 1).	`1.0`
`random_state`	`int`	Random seed for reproducibility.	`42`

Returns:

Type	Description
`Tuple[Dict[str, ndarray], Dict[str, Any]]`	(inputs, targets) subsets with the selected samples.

`split_data(inputs, targets, val_size=0.1, random_state=42, split_by_location=True, **kwargs)` ¶

Split into train/val (optionally grouped by location to prevent leakage).

Handles both dense ndarray and sparse list-of-arrays species targets.

Returns:

Type	Description
`Tuple`	(train_inputs, val_inputs, train_targets, val_targets)

`get_preprocessing_info()` ¶

Return a dict with species vocab size and environmental feature info.

`BirdSpeciesDataset` ¶

Bases: Dataset

PyTorch Dataset for bird species occurrence prediction.

Species targets can be either

Dense: np.ndarray of shape [n_samples, n_species]
Sparse (packed): dict with 'values' (int32) and 'offsets' (int64)

When sparse, the dense one-hot vector is materialised on the fly in the collate function, keeping resident memory proportional to the number of observations rather than samples × species.

Functions¶

`init(inputs, targets, n_species=0, jitter_std=0.0, species_freq_weights=None)` ¶

Wrap preprocessed arrays as a PyTorch Dataset.

Parameters:

Name	Type	Description	Default
`inputs`	`Dict[str, ndarray]`	Dict with 'lat', 'lon', 'week' float32 arrays.	required
`targets`	`Dict[str, Any]`	Dict with 'species' (dense or sparse) and 'env_features'.	required
`n_species`	`int`	Total number of species (required when species is sparse).	`0`
`jitter_std`	`float`	Standard deviation (degrees) of Gaussian noise added to lat/lon coordinates each time a sample is drawn. Set to 0.0 to disable (default). Typically derived from H3 cell resolution via `H3DataLoader.compute_jitter_std`.	`0.0`
`species_freq_weights`	`Optional[ndarray]`	Optional 1-D array of per-species label weights. When provided, positive labels use the weight instead of 1.0.	`None`

`getitem(idx)` ¶

Return (inputs_dict, targets_dict) for one sample.

Functions¶

`create_dataloaders(train_inputs, train_targets, val_inputs, val_targets, batch_size=256, num_workers=0, pin_memory=True, n_species=0, jitter_std=0.0, species_freq_weights=None)` ¶

Create training and validation DataLoaders.

All data is held in memory as PyTorch tensors. Callers should subsample before calling this function if only a fraction of the data is needed (see H3DataPreprocessor.subsample_by_location).

When species targets are sparse (list of index arrays), a custom collate function builds the dense (batch, n_species) tensor once per batch instead of per sample, reducing allocation pressure ~1000×.

Parameters:

Name	Type	Description	Default
`jitter_std`	`float`	Gaussian noise std (degrees) added to training coordinates each time a sample is drawn. Validation coordinates are never jittered.	`0.0`
`species_freq_weights`	`Optional[ndarray]`	Optional per-species label weights. Applied to training set only; validation uses binary labels.	`None`

`get_class_weights(species_targets, smoothing=100.0, max_weight=50.0)` ¶

Compute positive class weights for imbalanced species.

utils.data¶

Classes¶

H3DataLoader ¶

Functions¶

__init__(data_path) ¶

load_data() ¶

get_h3_cells() ¶

h3_to_latlon(h3_cells) staticmethod ¶

compute_jitter_std(h3_cells) staticmethod ¶

get_environmental_features() ¶

flatten_to_samples(ocean_sample_rate=1.0, water_threshold=0.9, include_yearly=True) ¶

get_data_info() ¶

H3DataPreprocessor ¶

Functions¶

__init__() ¶

normalize_environmental_features(env_features, fit=True) ¶

build_species_vocabulary(species_lists, min_obs_per_species=0, max_species=0) ¶

encode_species_multilabel(species_lists) ¶

encode_species_sparse(species_lists) ¶

compute_obs_density(inputs, species_lists) staticmethod ¶

mask_regions(inputs, targets, regions) staticmethod ¶

propagate_env_labels(lats, lons, weeks, species_lists, env_features, k=10, max_radius_km=1000.0, min_obs_threshold=10, soft_weight=0.5, max_spread_factor=2.0, env_dist_max=2.0, range_cap_km=500.0) staticmethod ¶

prepare_training_data(lats, lons, weeks, species_lists, env_features, fit=True, max_obs_per_species=0, min_obs_per_species=0, max_species=0) ¶

compute_species_freq_weights(species_lists, lats, lons, min_weight=0.1, pct_lo=10.0, pct_hi=90.0) ¶

subsample_by_location(inputs, targets, fraction=1.0, random_state=42) ¶

subsample_by_samples(inputs, targets, fraction=1.0, random_state=42) ¶

split_data(inputs, targets, val_size=0.1, random_state=42, split_by_location=True, **kwargs) ¶

get_preprocessing_info() ¶

BirdSpeciesDataset ¶

Functions¶

__init__(inputs, targets, n_species=0, jitter_std=0.0, species_freq_weights=None) ¶

__getitem__(idx) ¶

Functions¶

create_dataloaders(train_inputs, train_targets, val_inputs, val_targets, batch_size=256, num_workers=0, pin_memory=True, n_species=0, jitter_std=0.0, species_freq_weights=None) ¶

get_class_weights(species_targets, smoothing=100.0, max_weight=50.0) ¶

`H3DataLoader` ¶

`init(data_path)` ¶

`load_data()` ¶

`get_h3_cells()` ¶

`h3_to_latlon(h3_cells)` `staticmethod` ¶

`compute_jitter_std(h3_cells)` `staticmethod` ¶

`get_environmental_features()` ¶

`flatten_to_samples(ocean_sample_rate=1.0, water_threshold=0.9, include_yearly=True)` ¶

`get_data_info()` ¶

`H3DataPreprocessor` ¶

`init()` ¶

`normalize_environmental_features(env_features, fit=True)` ¶

`build_species_vocabulary(species_lists, min_obs_per_species=0, max_species=0)` ¶

`encode_species_multilabel(species_lists)` ¶

`encode_species_sparse(species_lists)` ¶

`compute_obs_density(inputs, species_lists)` `staticmethod` ¶

`mask_regions(inputs, targets, regions)` `staticmethod` ¶

`propagate_env_labels(lats, lons, weeks, species_lists, env_features, k=10, max_radius_km=1000.0, min_obs_threshold=10, soft_weight=0.5, max_spread_factor=2.0, env_dist_max=2.0, range_cap_km=500.0)` `staticmethod` ¶

`prepare_training_data(lats, lons, weeks, species_lists, env_features, fit=True, max_obs_per_species=0, min_obs_per_species=0, max_species=0)` ¶

`compute_species_freq_weights(species_lists, lats, lons, min_weight=0.1, pct_lo=10.0, pct_hi=90.0)` ¶

`subsample_by_location(inputs, targets, fraction=1.0, random_state=42)` ¶

`subsample_by_samples(inputs, targets, fraction=1.0, random_state=42)` ¶

`split_data(inputs, targets, val_size=0.1, random_state=42, split_by_location=True, **kwargs)` ¶

`get_preprocessing_info()` ¶

`BirdSpeciesDataset` ¶

`init(inputs, targets, n_species=0, jitter_std=0.0, species_freq_weights=None)` ¶

`getitem(idx)` ¶

`create_dataloaders(train_inputs, train_targets, val_inputs, val_targets, batch_size=256, num_workers=0, pin_memory=True, n_species=0, jitter_std=0.0, species_freq_weights=None)` ¶

`get_class_weights(species_targets, smoothing=100.0, max_weight=50.0)` ¶