utils.gbifutils¶

GBIF Darwin Core Archive processing and filtering.

GBIF data processing utilities.

Functions for reading, filtering, and transforming GBIF Darwin Core Archive records into a clean CSV suitable for downstream H3 aggregation.

Convert day/month arrays to BirdNET week numbers (1-48, 4 weeks per month). Works with both scalar and vectorized (numpy array) inputs.

Estimate the total number of rows in a zipped CSV file by sampling.

Load taxonomy CSV.

Returns:

Name	Type	Description
`valid_names`	`set`	All valid scientific names (including synonyms).
`common_names`	`dict`	Mapping of sciName to common name (English).

Process a GBIF Darwin Core Archive zip using parallel workers.

Reads raw byte blocks from the zip sequentially, then distributes parsing and filtering across n_workers processes.

Parameters:

Name	Type	Description	Default
`gbif_zip_path`	`str`	Path to the GBIF Darwin Core Archive zip file.	required
`file`	`str`	Name of the CSV/TSV file inside the zip.	required
`output_csv_path`	`str`	Output path for the processed CSV.	required
`valid_classes`	`list[str] \| None`	List of taxonomic classes to keep.	`None`
`taxonomy_path`	`str \| None`	Path to taxonomy CSV for filtering.	`None`
`max_rows`	`int \| None`	Maximum number of rows to process.	`None`
`n_workers`	`int \| None`	Number of parallel worker processes. Default: `min(cpu_count - 1, 8)`.	`None`