--- title: "Importing multiple files with phiperio" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Importing multiple files with phiperio} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(phiperio) library(dplyr) ``` # Overview This vignette shows how to ingest **many CSV files at once** via `convert_standard()`, deriving `sample_id` automatically from filenames (`sample_id_from_filenames = TRUE`). We’ll: 1. Create a temporary directory with multiple tiny CSVs (names like `R25P01_01_002616.csv`). 2. Peek at one file to see the expected long format. 3. Import all files in one call. 4. Derive `run_id` and `plate_id` from the `sample_id` pattern. > Pattern: `R P _ ...`. We’ll split on the underscore and then > peel off the `R..` and `P..` parts. # 1. Create a bunch of example files Each file represents one sample. Columns are long-format: `peptide_id`, `exist`, `fold_change`. ```{r} tmp_dir <- withr::local_tempdir() file_names <- c( "R25P01_01_002616.csv", "R25P01_01_002617.csv", "R25P02_01_002618.csv", "R25P02_02_002619.csv", "R26P01_03_002620.csv", "R26P01_04_002621.csv", "R26P02_05_002622.csv", "R27P01_01_002623.csv", "R27P02_01_002624.csv", "R27P02_02_002625.csv" ) # Simple helper to write a tiny two-row CSV write_one <- function(path) { dat <- data.frame( peptide_id = c("p1", "p2"), exist = c(1, 0), fold_change = c(1.2, 0.9), stringsAsFactors = FALSE ) utils::write.csv(dat, file.path(tmp_dir, path), row.names = FALSE) } invisible(lapply(file_names, write_one)) ``` # 2. Inspect one file so you know what’s inside ```{r} one_file <- file.path(tmp_dir, file_names[[1]]) read.csv(one_file, stringsAsFactors = FALSE) ``` You see two rows: one per `peptide_id` for this sample. All files have the same column layout; only the filename differs (and will become `sample_id`). # 3. Import all files in one call We point `convert_standard()` at the directory and set `sample_id_from_filenames = TRUE`. phiperio will: - read all CSVs in the directory, - derive `sample_id` from the filename stem (e.g., `R25P01_01_002616`), - union the rows into one DuckDB-backed table. ```{r} pd <- convert_standard( data_long_path = tmp_dir, sample_id_from_filenames = TRUE, peptide_library = FALSE, # set TRUE if you need peptide annotations materialise_table = FALSE, # view is fine for exploration auto_expand = FALSE ) ``` Check distinct sample IDs: ```{r} get_counts(pd) |> distinct(sample_id) |> arrange(sample_id) |> collect() ``` # 4. Derive run_id and plate_id from sample_id Our filenames have the shape `R P _ rest`. We can extract those parts with a couple of string splits: ```{r} pd_with_meta <- pd |> mutate( # Keep the part before first underscore: e.g., "R25P01" rp = regexp_replace(sample_id, '_.*$', ''), # run_id = chunk starting with R up to P run_id = regexp_extract(rp, 'R[^P]+'), # plate_id = chunk starting with P plate_id = regexp_extract(rp, 'P.+') ) get_counts(pd_with_meta) |> distinct(sample_id, run_id, plate_id) |> arrange(sample_id) |> collect() ``` Now you have per-sample `run_id` and `plate_id` derived from the filename, alongside the original measurements. # Summary - Put your per-sample CSV/Parquet files in one directory. - Call `convert_standard(..., sample_id_from_filenames = TRUE)` to ingest them all at once. - Parse `sample_id` to pull out run/plate metadata as needed. DuckDB keeps the workflow fast even with many files and millions of rows.