Combining Data¶
utils/combine.py joins the H3 environmental grid (from Stage 1) with the processed GBIF observations (from Stage 2) to produce a training-ready dataset.
How It Works¶
- Load the H3 grid — reads the GeoParquet with environmental features
- Stream GBIF observations — reads the processed CSV in chunks
- Map observations to cells — each observation's (lat, lon) is mapped to its containing H3 cell using
h3.latlng_to_cell() - Aggregate by week — for each cell, observations are grouped by BirdNET week number (1–48), producing a list of species codes per week
- Write outputs — combined parquet and a taxonomy CSV
CLI Options¶
python utils/combine.py \
--geodata data/global_50km_ee.parquet \
--gbif ./outputs/gbif_processed.csv.gz \
--output ./outputs/combined.parquet \
--valid_classes Aves Mammalia Amphibia \
--workers 16
| Flag | Description |
|---|---|
--geodata |
H3 GeoParquet from geoutils.py |
--gbif |
Processed GBIF CSV from gbifutils.py |
--output |
Output path for combined parquet |
--valid_classes |
Taxonomic classes to include (default: all) |
--workers |
Parallel worker processes for H3 cell computation (default: 1) |
Output Files¶
Combined Parquet¶
Each row is an H3 cell with:
| Columns | Description |
|---|---|
h3_index |
H3 cell identifier |
geometry |
Cell polygon |
| Environmental columns | elevation_m, temperature_c, etc. |
week_1 … week_48 |
List of species codes observed in that week |
Taxonomy CSV¶
Auto-generated alongside the parquet (with _taxonomy.csv suffix):
| Column | Description |
|---|---|
species_code |
eBird species code or iNat ID |
scientificName |
Binomial scientific name |
commonName |
Common name (if available) |