Consider reading in data with `{duckplyr}`

Reading in HIP data is slow. By switching to a data lake model with `{duckplyr}`, which will allow better inter-download (#46) and inter-state deduplication, we may also see some speed increases.

Running `migbirdHIP::read_hip()` on the 3.2 million records of 2025-2026 season data takes **~5.4 minutes**. In comparison, running the following `{duckplyr}` reading and column delimitation functions (which does not include several data checks etc) takes only **40 seconds**.

```
fwfpos <- 
  c(1, 15, 1, 20, 3, 60, 20, 2, 10, 10, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 100) |> 
  setNames(migbirdHIP:::REF_FIELDS_ALL)

raw_data <- 
  duckplyr::read_file_duckdb(
    path = my_path,
    table_function = "read_csv_auto",
    options = list(header = FALSE)
  )

season_data <-
  raw_data |> 
  duckplyr::as_duckdb_tibble(prudence = "lavish") |> 
  separate_wider_position(
    cols = everything(),
    widths = fwfpos,
    too_few = "align_start", 
    too_many = "drop") |> 
  duckplyr::collect()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider reading in data with `{duckplyr}` #91

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Consider reading in data with {duckplyr} #91

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Consider reading in data with `{duckplyr}` #91