This vignette shows how to ingest many CSV files at
once via convert_standard(), deriving
sample_id automatically from filenames
(sample_id_from_filenames = TRUE). We’ll:
R25P01_01_002616.csv).run_id and plate_id from the
sample_id pattern.Pattern:
R<run> P<plate> _ .... We’ll split on the underscore and then peel off theR..andP..parts.
Each file represents one sample. Columns are long-format:
peptide_id, exist,
fold_change.
tmp_dir <- withr::local_tempdir()
file_names <- c(
"R25P01_01_002616.csv",
"R25P01_01_002617.csv",
"R25P02_01_002618.csv",
"R25P02_02_002619.csv",
"R26P01_03_002620.csv",
"R26P01_04_002621.csv",
"R26P02_05_002622.csv",
"R27P01_01_002623.csv",
"R27P02_01_002624.csv",
"R27P02_02_002625.csv"
)
# Simple helper to write a tiny two-row CSV
write_one <- function(path) {
dat <- data.frame(
peptide_id = c("p1", "p2"),
exist = c(1, 0),
fold_change = c(1.2, 0.9),
stringsAsFactors = FALSE
)
utils::write.csv(dat, file.path(tmp_dir, path), row.names = FALSE)
}
invisible(lapply(file_names, write_one))one_file <- file.path(tmp_dir, file_names[[1]])
read.csv(one_file, stringsAsFactors = FALSE)
#> peptide_id exist fold_change
#> 1 p1 1 1.2
#> 2 p2 0 0.9You see two rows: one per peptide_id for this sample.
All files have the same column layout; only the filename differs (and
will become sample_id).
We point convert_standard() at the directory and set
sample_id_from_filenames = TRUE. phiperio will:
sample_id from the filename stem (e.g.,
R25P01_01_002616),pd <- convert_standard(
data_long_path = tmp_dir,
sample_id_from_filenames = TRUE,
peptide_library = FALSE, # set TRUE if you need peptide annotations
materialise_table = FALSE, # view is fine for exploration
auto_expand = FALSE
)
#> Skipping ANALYZE - raw_combined is a view.
#> [06:12:04] INFO Constructing <phip_data> object
#> -> create_data()
#> [06:12:04] INFO Validating <phip_data>
#> -> validate_phip_data()
#> [06:12:04] INFO Checking structural requirements (shape & mandatory columns)
#> [06:12:04] INFO Checking outcome family availability (exist / fold_change /
#> raw_counts)
#> [06:12:04] INFO Checking collisions with reserved names
#> - subject_id, sample_id, timepoint, peptide_id, exist,
#> fold_change, counts_input, counts_hit
#> [06:12:04] INFO Ensuring all columns are atomic (no list-cols)
#> [06:12:04] INFO Checking key uniqueness
#> [06:12:04] INFO Validating value ranges & types for outcomes
#> [06:12:04] INFO Assessing sparsity (NA/zero prevalence vs threshold)
#> - warn threshold: 50%
#> [06:12:04] INFO Checking peptide_id coverage against peptide_library
#> [06:12:04] INFO Checking full grid completeness (peptide * sample)
#> [06:12:04] OK Counts table is a full peptide * sample grid
#> [06:12:04] OK Validating <phip_data> - done
#> -> elapsed: 0.198s
#> [06:12:04] OK Constructing <phip_data> object - done
#> -> elapsed: 0.199sCheck distinct sample IDs:
get_counts(pd) |>
distinct(sample_id) |>
arrange(sample_id) |>
collect()
#> # A tibble: 10 × 1
#> sample_id
#> <chr>
#> 1 R25P01_01_002616
#> 2 R25P01_01_002617
#> 3 R25P02_01_002618
#> 4 R25P02_02_002619
#> 5 R26P01_03_002620
#> 6 R26P01_04_002621
#> 7 R26P02_05_002622
#> 8 R27P01_01_002623
#> 9 R27P02_01_002624
#> 10 R27P02_02_002625Our filenames have the shape
R<run> P<plate> _ rest. We can extract those
parts with a couple of string splits:
pd_with_meta <- pd |>
mutate(
# Keep the part before first underscore: e.g., "R25P01"
rp = regexp_replace(sample_id, '_.*$', ''),
# run_id = chunk starting with R up to P
run_id = regexp_extract(rp, 'R[^P]+'),
# plate_id = chunk starting with P
plate_id = regexp_extract(rp, 'P.+')
)
get_counts(pd_with_meta) |>
distinct(sample_id, run_id, plate_id) |>
arrange(sample_id) |>
collect()
#> # A tibble: 10 × 3
#> sample_id run_id plate_id
#> <chr> <chr> <chr>
#> 1 R25P01_01_002616 R25 P01
#> 2 R25P01_01_002617 R25 P01
#> 3 R25P02_01_002618 R25 P02
#> 4 R25P02_02_002619 R25 P02
#> 5 R26P01_03_002620 R26 P01
#> 6 R26P01_04_002621 R26 P01
#> 7 R26P02_05_002622 R26 P02
#> 8 R27P01_01_002623 R27 P01
#> 9 R27P02_01_002624 R27 P02
#> 10 R27P02_02_002625 R27 P02Now you have per-sample run_id and plate_id
derived from the filename, alongside the original measurements.
convert_standard(..., sample_id_from_filenames = TRUE) to
ingest them all at once.sample_id to pull out run/plate metadata as
needed. DuckDB keeps the workflow fast even with many files and millions
of rows.