utils.gbifutils¶
GBIF Darwin Core Archive processing and filtering.
GBIF data processing utilities.
Functions for reading, filtering, and transforming GBIF Darwin Core Archive records into a clean CSV suitable for downstream H3 aggregation.
Functions¶
date_to_week(day, month)
¶
Convert day/month arrays to BirdNET week numbers (1-48, 4 weeks per month). Works with both scalar and vectorized (numpy array) inputs.
estimate_rows(zip_archive, file_path, sample_rows=10000)
¶
Estimate the total number of rows in a zipped CSV file by sampling.
load_taxonomy(taxonomy_path)
¶
Load taxonomy CSV.
Returns:
| Name | Type | Description |
|---|---|---|
valid_names |
set
|
All valid scientific names (including synonyms). |
common_names |
dict
|
Mapping of sciName to common name (English). |
process_gbif_file(gbif_zip_path, file, output_csv_path, valid_classes=None, taxonomy_path=None, max_rows=None, n_workers=None)
¶
Process a GBIF Darwin Core Archive zip using parallel workers.
Reads raw byte blocks from the zip sequentially, then distributes parsing and filtering across n_workers processes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gbif_zip_path
|
str
|
Path to the GBIF Darwin Core Archive zip file. |
required |
file
|
str
|
Name of the CSV/TSV file inside the zip. |
required |
output_csv_path
|
str
|
Output path for the processed CSV. |
required |
valid_classes
|
list[str] | None
|
List of taxonomic classes to keep. |
None
|
taxonomy_path
|
str | None
|
Path to taxonomy CSV for filtering. |
None
|
max_rows
|
int | None
|
Maximum number of rows to process. |
None
|
n_workers
|
int | None
|
Number of parallel worker processes.
Default: |
None
|