Skip to content

utils.gbifutils

GBIF Darwin Core Archive processing and filtering.

GBIF data processing utilities.

Functions for reading, filtering, and transforming GBIF Darwin Core Archive records into a clean CSV suitable for downstream H3 aggregation.

Functions

date_to_week(day, month)

Convert day/month arrays to BirdNET week numbers (1-48, 4 weeks per month). Works with both scalar and vectorized (numpy array) inputs.

estimate_rows(zip_archive, file_path, sample_rows=10000)

Estimate the total number of rows in a zipped CSV file by sampling.

load_taxonomy(taxonomy_path)

Load taxonomy CSV.

Returns:

Name Type Description
valid_names set

All valid scientific names (including synonyms).

common_names dict

Mapping of sciName to common name (English).

process_gbif_file(gbif_zip_path, file, output_csv_path, valid_classes=None, taxonomy_path=None, max_rows=None, n_workers=None)

Process a GBIF Darwin Core Archive zip using parallel workers.

Reads raw byte blocks from the zip sequentially, then distributes parsing and filtering across n_workers processes.

Parameters:

Name Type Description Default
gbif_zip_path str

Path to the GBIF Darwin Core Archive zip file.

required
file str

Name of the CSV/TSV file inside the zip.

required
output_csv_path str

Output path for the processed CSV.

required
valid_classes list[str] | None

List of taxonomic classes to keep.

None
taxonomy_path str | None

Path to taxonomy CSV for filtering.

None
max_rows int | None

Maximum number of rows to process.

None
n_workers int | None

Number of parallel worker processes. Default: min(cpu_count - 1, 8).

None