Skip to content

utils.data

Data loading, preprocessing, and PyTorch Dataset/DataLoader utilities.

Data loading, preprocessing, and PyTorch dataset utilities for BirdNET Geomodel.

Handles the full pipeline from parquet files to training-ready DataLoaders: - H3DataLoader: Load and flatten H3 cell parquet data - H3DataPreprocessor: Sinusoidal encoding, normalization, species vocab, splitting - BirdSpeciesDataset: PyTorch Dataset wrapper - create_dataloaders / get_class_weights: DataLoader and class weight utilities

Classes

H3DataLoader

Load and prepare H3 cell-based species occurrence data for model training.

Functions
__init__(data_path)

Initialize the data loader.

Parameters:

Name Type Description Default
data_path str

Path to the H3 cell parquet file.

required
load_data()

Load the H3 cell data from parquet file.

get_h3_cells()

Return the array of H3 cell index strings.

h3_to_latlon(h3_cells) staticmethod

Convert H3 cell indices to latitude/longitude arrays.

compute_jitter_std(h3_cells) staticmethod

Compute coordinate jitter std (degrees) from H3 cell resolution.

Returns a standard deviation equal to 40 % of the average hexagon edge length (converted to degrees). With Gaussian noise at this scale, ~95 % of jittered points remain inside the originating cell.

get_environmental_features()

Return the environmental feature columns as a DataFrame.

flatten_to_samples(ocean_sample_rate=1.0, water_threshold=0.9, include_yearly=True)

Flatten H3-cell × weeks to individual (lat, lon, week, species, env) samples.

For each cell, creates 48 weekly samples (week 1–48) and optionally one yearly sample (week 0) whose species list is the union of all weeks.

Parameters:

Name Type Description Default
ocean_sample_rate float

Fraction of high-water cells to keep (0–1). Cells whose water_fraction exceeds water_threshold are randomly kept at this rate. Default 1.0 (keep all).

1.0
water_threshold float

water_fraction above which a cell is considered ocean. Default 0.9.

0.9
include_yearly bool

If True (default), include a week-0 yearly sample per cell. Set to False to train on weekly data only.

True

Returns:

Type Description
Tuple[ndarray, ndarray, ndarray, List[List[str]], DataFrame]

lats, lons, weeks, species_lists, env_features

get_data_info()

Return a summary dict with counts and column names.

H3DataPreprocessor

Preprocess H3 cell and species occurrence data for multi-task learning.

Functions
__init__()

Initialize the preprocessor with empty state.

normalize_environmental_features(env_features, fit=True)
Encode environmental features with type-appropriate transformations
  • Categorical columns → one-hot encoded (NaN → all-zero row)
  • Fraction columns → passed through as-is (NaN → 0)
  • Continuous columns → StandardScaler (NaN → column mean before scaling)
  • Constant columns → dropped
build_species_vocabulary(species_lists, min_obs_per_species=0, max_species=0)

Build vocabulary of all unique species codes.

Parameters:

Name Type Description Default
species_lists List[List[str]]

Per-sample lists of species codes (eBird codes for birds, iNat IDs for non-birds).

required
min_obs_per_species int

If >0, exclude species observed in fewer than this many samples. Default 0 (keep all).

0
max_species int

If >0, randomly subsample the vocabulary to at most this many species (after min-obs filtering). Uses a fixed seed for reproducibility. Default 0 (keep all).

0
encode_species_multilabel(species_lists)

Convert species lists to multi-label binary matrix.

NOTE: only used for small datasets. For large datasets use encode_species_sparse() to avoid OOM on the dense matrix.

encode_species_sparse(species_lists)

Convert species lists to packed sparse index arrays.

Returns a dict with two contiguous arrays instead of a list of millions of small numpy arrays. This eliminates per-object refcount overhead that causes copy-on-write memory bloat with forked DataLoader workers.

Returns:

Type Description
Dict[str, ndarray]

{'values': int32, 'offsets': int64} where offsets[i]

Dict[str, ndarray]

to offsets[i+1] gives the slice of values for sample i.

compute_obs_density(inputs, species_lists) staticmethod

Compute per-sample observation density for density-stratified evaluation.

For each unique location (lat, lon), sums the total number of species detections across all samples at that location. Each sample is then assigned its location's total density. This serves as a proxy for observer effort / survey intensity.

A well-surveyed H3 cell (e.g. Central Park, NYC) will have a high density value; a poorly surveyed cell (e.g. rural Siberia) will have a low value. During validation the density is used to stratify metrics — a model that generalizes well should have similar mAP in dense and sparse strata.

Parameters:

Name Type Description Default
inputs Dict[str, ndarray]

Dict with 'lat', 'lon' float32 arrays.

required
species_lists List[List[str]]

Per-sample lists of species codes (before encoding).

required

Returns:

Type Description
ndarray

Float32 array of shape (n_samples,) with per-location density.

mask_regions(inputs, targets, regions) staticmethod

Split data into outside-region and inside-region subsets.

Samples whose (lat, lon) falls inside any of the given bounding boxes are moved to the "inside" subset; the rest stay in "outside". This enables region hold-out experiments: train on the outside subset and evaluate spatial generalisation on the inside (held-out) subset.

Parameters:

Name Type Description Default
inputs Dict[str, ndarray]

Dict with 'lat', 'lon', 'week' (and optionally 'obs_density') arrays.

required
targets Dict[str, Any]

Dict with 'species' and 'env_features'.

required
regions List[Tuple[float, float, float, float]]

List of (lon_min, lat_min, lon_max, lat_max) bboxes.

required

Returns:

Type Description
Tuple[Dict[str, Any], Dict[str, Any], Dict[str, Any], Dict[str, Any]]

(inputs_outside, targets_outside, inputs_inside, targets_inside)

propagate_env_labels(lats, lons, weeks, species_lists, env_features, k=10, max_radius_km=1000.0, min_obs_threshold=10, soft_weight=0.5, max_spread_factor=2.0, env_dist_max=2.0, range_cap_km=500.0) staticmethod

Propagate species labels from observed to sparse/unobserved cells.

For each sample whose species list is shorter than min_obs_threshold, find the k nearest observed samples in environmental feature space (among samples from the same week), then copy species from neighbours within max_radius_km. Per-week matching prevents seasonal species from leaking across weeks (e.g. summer migrants appearing in winter).

Uses sparse matrix operations to vectorize the species merge and range check, avoiding per-species Python loops.

Parameters:

Name Type Description Default
lats ndarray

Per-sample latitudes.

required
lons ndarray

Per-sample longitudes.

required
weeks ndarray

Per-sample week numbers (0-48).

required
species_lists List[List[str]]

Per-sample species occurrence lists (mutable).

required
env_features DataFrame

Per-sample environmental feature DataFrame.

required
k int

Number of nearest neighbors to consider (default 10).

10
max_radius_km float

Geographic radius cap in km (default 1000).

1000.0
min_obs_threshold int

Samples with fewer species than this are considered sparse and receive propagated labels (default 10).

10
soft_weight float

Reserved for future soft-label support.

0.5
max_spread_factor float

Restrict species propagation based on their observed geographic range. A species will only propagate to a cell if the cell is within distance D of the nearest original observation, where D = max_spread_factor × (observed range diameter / 2). Set to 0 to disable range filtering (default 2.0).

2.0
env_dist_max float

Maximum Euclidean distance in standardized env-feature space between a sparse cell and its KNN neighbor for that neighbor to contribute labels. Neighbors further away in env space are dropped even if within max_radius_km. Set to 0 to disable (default 2.0).

2.0
range_cap_km float

Hard cap in km on the per-species propagation distance from the nearest original observation. Even if a species' bounding-box range would allow propagation farther, it is clamped to at most range_cap_km. Set to 0 to disable (default 500).

500.0

Returns:

Type Description
List[List[str]]

Modified species_lists with propagated labels (also mutated

List[List[str]]

in place).

prepare_training_data(lats, lons, weeks, species_lists, env_features, fit=True, max_obs_per_species=0, min_obs_per_species=0, max_species=0)

Run full preprocessing: encode inputs, normalize targets, build vocab.

Parameters:

Name Type Description Default
max_obs_per_species int

If >0, cap observations so no single species contributes more than this many positive samples. Reduces the influence of hyper-common species on training. Samples are dropped randomly. Default 0 (no cap).

0
min_obs_per_species int

If >0, exclude species observed in fewer than this many samples from the vocabulary. Default 0 (keep all).

0
max_species int

If >0, randomly subsample the vocabulary to at most this many species. Default 0 (keep all).

0
compute_species_freq_weights(species_lists, lats, lons, min_weight=0.1, pct_lo=10.0, pct_hi=90.0)

Compute per-species label weights via region-normalized frequency.

Citizen-science observation density varies enormously across regions. The US alone can contribute an order of magnitude more records than the Neotropics, so a naive global frequency count would assign high weights to common US species while suppressing species-rich tropical communities. Region-normalized weighting solves this by computing frequency percentile ranks within geographic bins and using the maximum regional percentile as each species' weight basis.

Algorithm:

  1. Partition samples into geographic bins (30° lat × 60° lon).
  2. Within each bin, count per-species occurrences.
  3. Within each bin, compute the percentile rank of every species (among species present in that bin).
  4. For each species, take the max percentile rank across bins.
  5. Map that max-regional-percentile to a weight via linear interpolation controlled by pct_lo / pct_hi.

This makes weights independent of absolute observation density: a species at the 90th percentile in Colombia gets the same weight as one at the 90th percentile in the US.

Parameters:

Name Type Description Default
species_lists List[List[str]]

Per-sample species occurrence lists.

required
lats ndarray

Per-sample latitudes.

required
lons ndarray

Per-sample longitudes.

required
min_weight float

Floor weight for rare species.

0.1
pct_lo float

Lower percentile threshold. Default 10.

10.0
pct_hi float

Upper percentile threshold. Default 90.

90.0

Returns:

Type Description
ndarray

Array of shape (n_species,) stored as

ndarray

self.species_freq_weights.

subsample_by_location(inputs, targets, fraction=1.0, random_state=42)

Randomly subsample a fraction of locations (and all their samples).

Subsampling is location-based: unique (lat, lon) positions are sampled, then all rows belonging to the selected locations are retained. This preserves the temporal structure within each H3 cell and keeps the data suitable for a subsequent location-based train/val/test split.

Parameters:

Name Type Description Default
inputs Dict[str, ndarray]

Dict with 'lat', 'lon', 'week' arrays.

required
targets Dict[str, Any]

Dict with 'species' and 'env_features'.

required
fraction float

Fraction of locations to keep (0 < fraction <= 1).

1.0
random_state int

Random seed for reproducibility.

42

Returns:

Type Description
Tuple[Dict[str, ndarray], Dict[str, Any]]

(inputs, targets) subsets with only the selected locations.

subsample_by_samples(inputs, targets, fraction=1.0, random_state=42)

Randomly subsample a fraction of individual samples (week@location rows).

Unlike :meth:subsample_by_location, which drops entire H3 cells, this method drops individual week-rows while preserving at least some data for every location. This avoids losing small islands that have few cells but whose endemic species are important to monitor.

Parameters:

Name Type Description Default
inputs Dict[str, ndarray]

Dict with 'lat', 'lon', 'week' arrays.

required
targets Dict[str, Any]

Dict with 'species' and 'env_features'.

required
fraction float

Fraction of samples to keep (0 < fraction <= 1).

1.0
random_state int

Random seed for reproducibility.

42

Returns:

Type Description
Tuple[Dict[str, ndarray], Dict[str, Any]]

(inputs, targets) subsets with the selected samples.

split_data(inputs, targets, val_size=0.1, random_state=42, split_by_location=True, **kwargs)

Split into train/val (optionally grouped by location to prevent leakage).

Handles both dense ndarray and sparse list-of-arrays species targets.

Returns:

Type Description
Tuple

(train_inputs, val_inputs, train_targets, val_targets)

get_preprocessing_info()

Return a dict with species vocab size and environmental feature info.

BirdSpeciesDataset

Bases: Dataset

PyTorch Dataset for bird species occurrence prediction.

Species targets can be either
  • Dense: np.ndarray of shape [n_samples, n_species]
  • Sparse (packed): dict with 'values' (int32) and 'offsets' (int64)

When sparse, the dense one-hot vector is materialised on the fly in the collate function, keeping resident memory proportional to the number of observations rather than samples × species.

Functions
__init__(inputs, targets, n_species=0, jitter_std=0.0, species_freq_weights=None)

Wrap preprocessed arrays as a PyTorch Dataset.

Parameters:

Name Type Description Default
inputs Dict[str, ndarray]

Dict with 'lat', 'lon', 'week' float32 arrays.

required
targets Dict[str, Any]

Dict with 'species' (dense or sparse) and 'env_features'.

required
n_species int

Total number of species (required when species is sparse).

0
jitter_std float

Standard deviation (degrees) of Gaussian noise added to lat/lon coordinates each time a sample is drawn. Set to 0.0 to disable (default). Typically derived from H3 cell resolution via H3DataLoader.compute_jitter_std.

0.0
species_freq_weights Optional[ndarray]

Optional 1-D array of per-species label weights. When provided, positive labels use the weight instead of 1.0.

None
__getitem__(idx)

Return (inputs_dict, targets_dict) for one sample.

Functions

create_dataloaders(train_inputs, train_targets, val_inputs, val_targets, batch_size=256, num_workers=0, pin_memory=True, n_species=0, jitter_std=0.0, species_freq_weights=None)

Create training and validation DataLoaders.

All data is held in memory as PyTorch tensors. Callers should subsample before calling this function if only a fraction of the data is needed (see H3DataPreprocessor.subsample_by_location).

When species targets are sparse (list of index arrays), a custom collate function builds the dense (batch, n_species) tensor once per batch instead of per sample, reducing allocation pressure ~1000×.

Parameters:

Name Type Description Default
jitter_std float

Gaussian noise std (degrees) added to training coordinates each time a sample is drawn. Validation coordinates are never jittered.

0.0
species_freq_weights Optional[ndarray]

Optional per-species label weights. Applied to training set only; validation uses binary labels.

None

get_class_weights(species_targets, smoothing=100.0, max_weight=50.0)

Compute positive class weights for imbalanced species.