Skip to content

Consider reading in data with {duckplyr} #91

@lawalter

Description

@lawalter

Reading in HIP data is slow. By switching to a data lake model with {duckplyr}, which will allow better inter-download (#46) and inter-state deduplication, we may also see some speed increases.

Running migbirdHIP::read_hip() on the 3.2 million records of 2025-2026 season data takes ~5.4 minutes. In comparison, running the following {duckplyr} reading and column delimitation functions (which does not include several data checks etc) takes only 40 seconds.

fwfpos <- 
  c(1, 15, 1, 20, 3, 60, 20, 2, 10, 10, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 100) |> 
  setNames(migbirdHIP:::REF_FIELDS_ALL)

raw_data <- 
  duckplyr::read_file_duckdb(
    path = my_path,
    table_function = "read_csv_auto",
    options = list(header = FALSE)
  )

season_data <-
  raw_data |> 
  duckplyr::as_duckdb_tibble(prudence = "lavish") |> 
  separate_wider_position(
    cols = everything(),
    widths = fwfpos,
    too_few = "align_start", 
    too_many = "drop") |> 
  duckplyr::collect()

Metadata

Metadata

Assignees

Labels

workflowImprovement to processing speed or methodology
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions