| Title: | PhIP-Seq Data Import and Validation Tools |
|---|---|
| Description: | Provides utilities to import, validate, and manage PhIP-Seq datasets, including standardized conversion pipelines, data checks, and access to cached peptide metadata. |
| Authors: | Mateusz Kolek [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-6470-4830>), Alon Alexander [ctb, cph] |
| Maintainer: | Mateusz Kolek <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.5.0 |
| Built: | 2026-06-04 07:51:24 UTC |
| Source: | https://github.com/Polymerase3/phiperio |
data_long
Appends/overwrites a column (default: "exist") filled with 1L on
the lazy data_long table. Preserves laziness; no collection is forced.
add_exist(phip_data, exist_col = "exist", overwrite = FALSE)add_exist(phip_data, exist_col = "exist", overwrite = FALSE)
phip_data |
A <phip_data> object. |
exist_col |
Name of the existence column to append/overwrite. |
overwrite |
If FALSE and the column exists, abort with a phiperio-style error. |
Modified <phip_data> with updated data_long.
pd <- load_example_data() pd <- add_exist(pd, overwrite = TRUE) # overwrites if presentpd <- load_example_data() pd <- add_exist(pd, overwrite = TRUE) # overwrites if present
Closes any open database connections held by a phip_data
object. This includes the main data_long backend connection and any
peptide-library connection stored in attributes or metadata. The method is
idempotent and safe to call multiple times.
## S3 method for class 'phip_data' close(con, ...)## S3 method for class 'phip_data' close(con, ...)
con |
A valid |
... |
Unused (for S3 generic compatibility). |
The input phip_data object, invisibly.
pd <- load_example_data() close(pd)pd <- load_example_data() close(pd)
convert_legacy() ingests the original three-file PhIP-Seq input
(binary exist matrix, samples metadata, optional timepoints map).
Paths can be supplied directly or via a single YAML config; explicit
arguments always override the YAML. The function normalises the chosen
DuckDB storage, validates every file, and returns a ready-to-use
phip_data object.
convert_legacy( exist_file = NULL, fold_change_file = NULL, samples_file = NULL, input_file = NULL, hit_file = NULL, timepoints_file = NULL, extra_cols = NULL, output_dir = NULL, peptide_library = TRUE, n_cores = 8, materialise_table = TRUE, config_yaml = NULL )convert_legacy( exist_file = NULL, fold_change_file = NULL, samples_file = NULL, input_file = NULL, hit_file = NULL, timepoints_file = NULL, extra_cols = NULL, output_dir = NULL, peptide_library = TRUE, n_cores = 8, materialise_table = TRUE, config_yaml = NULL )
exist_file |
Path to the exist CSV (peptide x sample binary
matrix). Required unless given in |
fold_change_file |
Path to the fold_change CSV (peptide x
sample numeric matrix). Required unless given in |
samples_file |
Path to the samples CSV (sample metadata).
Required unless given in |
input_file, hit_file
|
Paths to the raw_counts CSV (peptide x
sample integer matrix). Required unless given in |
timepoints_file |
Path to the timepoints CSV (subject <-> sample mapping). Optional for cross-sectional data. |
extra_cols |
Character vector of extra metadata columns to retain. |
output_dir |
Deprecated. Ignored with a warning. |
peptide_library |
logical, defining if the |
n_cores |
Integer >= 1. Number of CPU threads DuckDB may use while reading and writing files. |
materialise_table |
Logical. If |
config_yaml |
Optional YAML file containing any of the above parameters (see example). |
Input files are validated in two stages:
Fast-fail checks (paths, extensions, and required arguments) run during path resolution.
Data validation (required columns, uniqueness, value ranges, etc.) is
centralized in validate_phip_data().
A validated phip_data object whose data_long slot is backed by a
DuckDB connection.
## 1. Direct-path usage (package example files) ext <- system.file("extdata", package = "phiperio") pd <- convert_legacy( exist_file = file.path(ext, "exist.csv"), samples_file = file.path(ext, "samples_meta.csv"), timepoints_file = file.path(ext, "samples2ind_timepoints.csv"), peptide_library = FALSE ) ## 2. YAML-driven usage (explicit args override YAML) pd <- convert_legacy( config_yaml = file.path(ext, "config.yaml"), peptide_library = FALSE )## 1. Direct-path usage (package example files) ext <- system.file("extdata", package = "phiperio") pd <- convert_legacy( exist_file = file.path(ext, "exist.csv"), samples_file = file.path(ext, "samples_meta.csv"), timepoints_file = file.path(ext, "samples2ind_timepoints.csv"), peptide_library = FALSE ) ## 2. YAML-driven usage (explicit args override YAML) pd <- convert_legacy( config_yaml = file.path(ext, "config.yaml"), peptide_library = FALSE )
phip_data objectconvert_standard() ingests a "long" table of PhIPsSeq read
counts / enrichment statistics, optionally expands it to the full
sample_id x peptide_id grid, and registers the result in DuckDB.
The function returns a fully initialised phip_data object that can be
queried with the tidy API used throughout the package.
convert_standard( data_long_path, sample_id = NULL, peptide_id = NULL, subject_id = NULL, timepoint = NULL, exist = NULL, fold_change = NULL, counts_input = NULL, counts_hit = NULL, sample_id_from_filenames = FALSE, n_cores = 8, materialise_table = TRUE, auto_expand = FALSE, peptide_library = TRUE )convert_standard( data_long_path, sample_id = NULL, peptide_id = NULL, subject_id = NULL, timepoint = NULL, exist = NULL, fold_change = NULL, counts_input = NULL, counts_hit = NULL, sample_id_from_filenames = FALSE, n_cores = 8, materialise_table = TRUE, auto_expand = FALSE, peptide_library = TRUE )
data_long_path |
Character scalar. File or directory containing the
long-format PhIP-Seq data. Allowed extensions are |
sample_id, peptide_id, subject_id, timepoint, exist, fold_change, counts_input, counts_hit
|
Optional character strings. Supply these only if your column names differ
from the defaults ( |
sample_id_from_filenames |
Logical. If |
n_cores |
Integer >= 1. Number of CPU threads DuckDB may use while reading and writing files. |
materialise_table |
Logical. If |
auto_expand |
Logical. If
|
peptide_library |
Logical. If |
Paths are resolved to absolute form before any work begins, and explicit checks confirm existence as well as extension validity.
An S3 object of class phip_data containing:
data_longThe (possibly expanded) long-format table.
peptide_libraryLoaded peptide-library metadata (if
peptide_library = TRUE).
metaList with DuckDB connection handles.
create_data() for the object constructor.
dplyr::tbl() to query DuckDB tables lazily.
# Basic import, auto-detecting default column names phip_obj <- convert_standard( data_long_path = get_example_path("phip_mixture"), n_cores = 4, materialise_table = TRUE ) # Import a CSV and rename columns tmp_csv <- tempfile(fileext = ".csv") utils::write.csv( data.frame( sample = c("s1", "s1"), pep = c("p1", "p2"), exist = c(1, 0), stringsAsFactors = FALSE ), tmp_csv, row.names = FALSE ) phip_mem <- convert_standard( data_long_path = tmp_csv, sample_id = "sample", peptide_id = "pep", peptide_library = FALSE, materialise_table = FALSE )# Basic import, auto-detecting default column names phip_obj <- convert_standard( data_long_path = get_example_path("phip_mixture"), n_cores = 4, materialise_table = TRUE ) # Import a CSV and rename columns tmp_csv <- tempfile(fileext = ".csv") utils::write.csv( data.frame( sample = c("s1", "s1"), pep = c("p1", "p2"), exist = c(1, 0), stringsAsFactors = FALSE ), tmp_csv, row.names = FALSE ) phip_mem <- convert_standard( data_long_path = tmp_csv, sample_id = "sample", peptide_id = "pep", peptide_library = FALSE, materialise_table = FALSE )
Creates a fully-validated S3 object that bundles the tidy
PhIP-Seq counts (data_long), a peptide-library annotation table, and
other metadata. The data itself is validated via validate_phip_data().
create_data( data_long, peptide_library = TRUE, auto_expand = TRUE, materialise_table = TRUE, meta = list() )create_data( data_long, peptide_library = TRUE, auto_expand = TRUE, materialise_table = TRUE, meta = list() )
data_long |
A tidy data frame (or |
peptide_library |
A data frame with one row per |
auto_expand |
Logical. If
|
materialise_table |
Logical. If |
meta |
Optional named list of metadata flags to pre-populate the
|
An object of class "phip_data".
## minimal constructor call tidy_counts <- data.frame( sample_id = c("s1", "s1"), peptide_id = c("p1", "p2"), exist = c(1, 0), stringsAsFactors = FALSE ) pd <- create_data( data_long = tidy_counts, peptide_library = FALSE, materialise_table = FALSE )## minimal constructor call tidy_counts <- data.frame( sample_id = c("s1", "s1"), peptide_id = c("p1", "p2"), exist = c(1, 0), stringsAsFactors = FALSE ) pd <- create_data( data_long = tidy_counts, peptide_library = FALSE, materialise_table = FALSE )
sample_id * peptide_id gridCreate the full Cartesian product of samples and peptides and
join back per-sample metadata. For rows introduced by the expansion,
numeric/integer columns are filled with 0 and logical columns with
FALSE, unless overridden via fill_override.
expand_data( x, key_col = "sample_id", id_col = "peptide_id", fill_override = NULL, add_exist = FALSE, exist_col = "exist", validate = TRUE, ... )expand_data( x, key_col = "sample_id", id_col = "peptide_id", fill_override = NULL, add_exist = FALSE, exist_col = "exist", validate = TRUE, ... )
x |
A |
key_col |
Name(s) of the sample identifier column(s). Character scalar
or vector, e.g. |
id_col |
Name of the peptide identifier column. Default |
fill_override |
Optional named list of fill values for introduced
rows, e.g. |
add_exist |
If |
exist_col |
Name for the existence flag. If this column already exists, it will be overwritten. |
validate |
Logical; if |
... |
Reserved for future extensions; currently unused. |
Updates x$data_long in place (preserving laziness unless you later
compute() / collect()).
The updated <phip_data> object.
pd <- load_example_data() pd <- expand_data(pd, fill_override = list(fold_change = NA_real_))pd <- load_example_data() pd <- expand_data(pd, fill_override = list(fold_change = NA_real_))
Exports the data_long table from a phip_data object to disk in Apache
Parquet format.
export_parquet(x, path)export_parquet(x, path)
x |
A <phip_data> object or a data frame. |
path |
File path (character) to save the output |
NULL (invisibly).
The export is performed directly and efficiently from the database/lazy table without reading all data into memory.
pd <- load_example_data() out_path <- tempfile(fileext = ".parquet") export_parquet(pd, out_path) unlink(out_path)pd <- load_example_data() out_path <- tempfile(fileext = ".parquet") export_parquet(pd, out_path) unlink(out_path)
Quick accessor for the data_long slot of a phip_data
object.
get_counts(x)get_counts(x)
x |
A valid |
A tibble or lazy table with one row per peptide * sample pair.
pd <- load_example_data() tbl <- get_counts(pd)pd <- load_example_data() tbl <- get_counts(pd)
Path to example PhIP-Seq datasets shipped with phiperio
get_example_path(name = c("phip_mixture"))get_example_path(name = c("phip_mixture"))
name |
Character scalar. Name of the example dataset.
Currently supported: |
A character scalar with an absolute path to the file.
sim_path <- get_example_path("phip_mixture") # phip_obj <- convert_standard(sim_path)sim_path <- get_example_path("phip_mixture") # phip_obj <- convert_standard(sim_path)
Accesses the meta slot, which holds flags such as whether the
table is a full peptide * sample grid, the available outcome columns, etc.
get_meta(x)get_meta(x)
x |
A valid |
A named list.
pd <- load_example_data() meta <- get_meta(pd)pd <- load_example_data() meta <- get_meta(pd)
This function uses the phiperio logging utilities for
consistent, ASCII-only progress messages and timing. Long-running steps are
bracketed with .ph_with_timing(), and informational/warning/error
messages are emitted via .ph_log_info(), .ph_log_ok(), .ph_warn(),
and .ph_abort().
Downloads the RDS once, sanitizes types (logical, character, numeric), and writes into a DuckDB cache on disk.
Subsequent calls return a lazy tbl_dbi without loading into R memory.
get_peptide_library(force_refresh = FALSE)get_peptide_library(force_refresh = FALSE)
force_refresh |
Logical. If |
Caching: A persistent DuckDB database is created under the user cache
directory (via tools::R_user_dir("phiperio", "cache")). You can override
this location with options(phiperio.cache_dir = \"...\"). The
force_refresh argument bypasses the fast path and rebuilds the cache.
Sanitization: Columns are stripped of attributes, list-columns are
flattened, textual "NaN" and numeric NaN are coerced to NA. Binary 0/1
fields are converted to logical, "TRUE"/"FALSE" (case-insensitive) are
converted to logical, and numeric-looking character columns (beyond trivial
0/1) are converted to numeric. All other atomic types are preserved.
Integrity check: If a SHA-256 checksum is provided, a warning is logged when the downloaded file’s checksum does not match the expected value.
A dplyr::tbl_dbi pointing to the peptide_meta table. The returned
object carries an attribute "duckdb_con" with the open DBI connection.
dplyr::tbl(), DBI::dbConnect(), duckdb::duckdb()
lib <- get_peptide_library()lib <- get_peptide_library()
Convenience helper to quickly load a shipped example dataset
("phip_mixture") into a <phip_data> object, suitable for downstream
analysis and visualization. This function wraps
convert_standard, automatically supplying the correct
parameters for the included example data.
load_example_data(name = c("phip_mixture", "small_mixture"))load_example_data(name = c("phip_mixture", "small_mixture"))
name |
Character scalar. Name of the shipped example dataset.
Currently supported: |
A <phip_data> object created from the chosen example dataset.
# Load the example data shipped with the package: ex <- load_example_data() # ex is now a <phip_data> object ready for analysis # Specify the dataset name explicitly ex2 <- load_example_data("small_mixture") # Use with downstream analysis/plotting functions as needed# Load the example data shipped with the package: ex <- load_example_data() # ex is now a <phip_data> object ready for analysis # Specify the dataset name explicitly ex2 <- load_example_data("small_mixture") # Use with downstream analysis/plotting functions as needed
phip_data objectMerge or join a phip_data object
## S3 method for class 'phip_data' merge(x, y, ...)## S3 method for class 'phip_data' merge(x, y, ...)
x |
A |
y |
A data-frame-like object or another |
... |
Arguments forwarded to either |
A new phip_data whose data_long contains the merged / joined
tibble.
pd <- load_example_data() merged <- merge(pd, pd, by = c("sample_id", "peptide_id"))pd <- load_example_data() merged <- merge(pd, pd, by = c("sample_id", "peptide_id"))
phip_data
dplyr joins for phip_data
## S3 method for class 'phip_data' left_join(x, y, ...) ## S3 method for class 'phip_data' right_join(x, y, ...) ## S3 method for class 'phip_data' inner_join(x, y, ...) ## S3 method for class 'phip_data' full_join(x, y, ...) ## S3 method for class 'phip_data' semi_join(x, y, ...) ## S3 method for class 'phip_data' anti_join(x, y, ...)## S3 method for class 'phip_data' left_join(x, y, ...) ## S3 method for class 'phip_data' right_join(x, y, ...) ## S3 method for class 'phip_data' inner_join(x, y, ...) ## S3 method for class 'phip_data' full_join(x, y, ...) ## S3 method for class 'phip_data' semi_join(x, y, ...) ## S3 method for class 'phip_data' anti_join(x, y, ...)
x |
A |
y |
A |
... |
Passed to the corresponding |
A phip_data object with updated data_long.
pd <- load_example_data() joined <- dplyr::left_join(pd, pd, by = c("sample_id", "peptide_id"))pd <- load_example_data() joined <- dplyr::left_join(pd, pd, by = c("sample_id", "peptide_id"))