--- title: "Prevalence of Presence (POP) Analysis in PhIP-seq" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: false vignette: > %\VignetteIndexEntry{Prevalence of Presence (POP) Analysis in PhIP-seq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` ## Overview POP (Prevalence Of Presence) measures how often each feature is detected across the samples in a group. For each `(rank, feature, group pair)` the analysis asks: does the fraction of samples carrying this feature differ between the two groups? This vignette walks through the complete POP workflow in **phiper**: 1. Computing prevalence and p-values with `compute_pop()` 2. Visualising results as scatter plots with `scatter_static()` and `scatter_interactive()` 3. Volcano plots with `volcano_static()` and `volcano_interactive()` --- ## Setup ```{r load} library(phiper) ``` Load the bundled example dataset. It contains two patient groups (`A`, `B`) measured at two timepoints (`T1`, `T2`) across 1 000 simulated peptides. ```{r data} pd <- load_example_data() pd ``` > **Note.** The example data are entirely simulated and have no biological > meaning. They exist solely to demonstrate the API. --- ## Computing prevalence: `compute_pop()` ### Unpaired design — group comparison The workhorse function is `compute_pop()`. Supply a `phip_data` object, the rank(s) at which prevalence should be aggregated, and the binary grouping column(s). ```{r compute-basic} pop_group <- compute_pop( pd, rank_cols = "peptide_id", group_cols = "group" ) ``` The result is a plain `data.frame` with one row per `(rank, feature, group pair)`: ```{r result-head} head(pop_group) ``` ### What's in the output? | Column | Description | |---|---| | `rank` | Rank at which the feature is defined (e.g. `"peptide_id"`) | | `feature` | Feature identifier | | `group_col` | Name of the grouping column | | `group1`, `group2` | The two group labels being compared | | `n1`, `N1` | Positives and total samples in group 1 | | `prop1`, `percent1` | Prevalence in group 1 (proportion and percentage) | | `n2`, `N2` | Positives and total samples in group 2 | | `prop2`, `percent2` | Prevalence in group 2 | | `ratio` | `prop1 / prop2` (with small-sample epsilon correction) | | `delta_ratio` | Signed fold-change: `prop1/prop2 − 1` (or its negative) | | `p_raw` | Fisher's exact test p-value (unpaired) | | `n_peptides` | Number of peptides contributing to this feature | ```{r result-cols} names(pop_group) nrow(pop_group) ``` ### Multiple group columns Pass a character vector to `group_cols` to test several grouping variables in one call. ```{r compute-multi-group} pop_multi <- compute_pop( pd, rank_cols = "peptide_id", group_cols = c("group", "timepoint") ) # One block of rows per group column table(pop_multi$group_col) ``` ### Multiple ranks `rank_cols` accepts any column present in the peptide library. Pass `"peptide_id"` for peptide-level results, or taxonomic rank columns (e.g. `"family"`, `"genus"`) for higher-level aggregation. > **Note on this toy example.** The synthetic peptides do not map to real > library annotations, so all higher-rank columns are empty. In a real > PhIP-seq experiment, `family`- and `genus`-level rows reflect the > taxonomic breadth of each patient's reactivity. ```{r compute-multi-rank, eval=FALSE} pop_tax <- compute_pop( pd, rank_cols = c("peptide_id", "family", "genus"), group_cols = "group" ) table(pop_tax$rank) ``` ### k-of-n presence threshold By default a sample is called positive for a feature if at least one of its contributing peptides is present (`pop_k_min = 1`). Raise this threshold to require, say, at least two peptides before calling a sample positive — useful for filtering out singleton hits. ```{r compute-kmin} pop_k2 <- compute_pop( pd, rank_cols = "peptide_id", group_cols = "group", pop_k_min = 2L ) # Higher k_min → fewer positives mean(pop_k2$n1) < mean(pop_group$n1) ``` --- ## Paired design — timepoint comparison When the same subjects are measured at two timepoints, the paired design uses **McNemar's exact binomial test** instead of Fisher's exact test, which is more powerful because it accounts for within-subject correlation. Set `paired` to the name of the column that links samples from the same subject across timepoints. ```{r compute-paired} pop_paired <- compute_pop( pd, rank_cols = "peptide_id", group_cols = "timepoint", paired = "subject_id" ) head(pop_paired) ``` The paired output does not include `ratio` or `delta_ratio` — the test statistic is the McNemar discordant-pair ratio, which is reflected in `p_raw`. ```{r paired-cols} names(pop_paired) ``` --- ## Scatter plots Scatter plots compare the prevalence of every feature in group 1 (x-axis) against group 2 (y-axis). Points on the diagonal represent features with equal prevalence across groups; deviations indicate differential carriage. ### `scatter_static()` — ggplot2 The simplest call takes the `compute_pop()` result directly. Coloring is determined automatically from BH-corrected p-values: `"significant (BH)"`, `"nominal only"`, `"not significant"`. ```{r scatter-basic} scatter_static(pop_group) ``` Use `pair` to restrict the plot to a specific group contrast and `xlab`/`ylab` to label the axes. ```{r scatter-pair} scatter_static( pop_group, pair = c("A", "B"), xlab = "Group A (%)", ylab = "Group B (%)", alpha = 0.05 ) ``` Pass `rank` to display a single rank when the result contains multiple ranks. ```{r scatter-rank, eval=FALSE} scatter_static(pop_tax, rank = "family", pair = c("A", "B")) ``` Graphical parameters are passed via `...`: | Parameter | Default | Effect | |---|---|---| | `point_size` | 2 | Point diameter | | `point_alpha` | 0.85 | Point opacity | | `jitter_width_pp` | 0 | Horizontal jitter (percentage points) | | `jitter_height_pp` | 0 | Vertical jitter (percentage points) | | `font_size` | 12 | Base font size | ```{r scatter-custom} scatter_static( pop_group, pair = c("A", "B"), xlab = "Group A (%)", ylab = "Group B (%)", point_size = 1.5, point_alpha = 0.6, jitter_width_pp = 0.5, jitter_height_pp = 0.5 ) ``` > **`color_by`** — `scatter_static()` also accepts a `color_by` named vector > (e.g. `c("species" = "Staphylococcus aureus")`) to highlight points by > peptide-library metadata. This requires a real peptide library with matching > annotations and is not demonstrated here. ### `scatter_interactive()` — plotly The interactive version mirrors the static API and returns a plotly widget suitable for HTML reports and Shiny dashboards. Hovering over a point shows its feature identifier, raw counts, prevalence percentages, and p-values. ```{r scatter-interactive, eval=FALSE} scatter_interactive( pop_group, pair = c("A", "B"), xlab = "Group A (%)", ylab = "Group B (%)", alpha = 0.05 ) ``` An optional `background_df` argument lets you overlay a second set of points (e.g. all peptides from a different comparison) as a grey reference layer: ```{r scatter-interactive-bg, eval=FALSE} scatter_interactive( pop_group, pair = c("A", "B"), background_df = pop_multi[pop_multi$group_col == "timepoint", ], show_background = TRUE, background_name = "timepoint background" ) ``` --- ## Volcano plots Volcano plots display **log₂ ratio** (x-axis) against **−log₁₀(p)** (y-axis), making it easy to spot features with both large effect sizes and small p-values. ### `volcano_static()` — ggplot2 ```{r volcano-basic} volcano_static(pop_group) ``` Filter to a specific contrast and rank with `pair` and `rank`: ```{r volcano-pair} volcano_static( pop_group, pair = c("A", "B"), rank = "peptide_id" ) ``` #### Cutoffs `fc_cut` (absolute log₂ fold-change) and `p_cut` (p-value) control where the dashed reference lines are drawn and how significance categories are assigned. ```{r volcano-cutoffs} volcano_static( pop_group, pair = c("A", "B"), fc_cut = 1.5, p_cut = 0.01 ) ``` #### BH correction Set `p_mode = "bh"` to apply Benjamini–Hochberg correction per-plot. The y-axis then displays −log₁₀(p_BH). ```{r volcano-bh} volcano_static( pop_group, pair = c("A", "B"), p_mode = "bh", p_cut = 0.05 ) ``` > **`color_by`** — like the scatter plots, `volcano_static()` accepts a > `color_by` argument to highlight features by peptide-library metadata. > This requires real library annotations and is not demonstrated here. ### `volcano_interactive()` — plotly ```{r volcano-interactive, eval=FALSE} volcano_interactive( pop_group, pair = c("A", "B"), p_mode = "bh", p_cut = 0.05, fc_cut = 1 ) ``` Hovering over a point shows the feature identifier, rank, group labels, log₂ ratio, and −log₁₀(p). --- ## Putting it all together A typical POP analysis runs in three steps: ```{r full-pipeline, eval=FALSE} # 1. Compute prevalence pop <- compute_pop( pd, rank_cols = c("peptide_id", "family"), group_cols = "group" ) # Optional paired comparison across timepoints pop_paired <- compute_pop( pd, rank_cols = "peptide_id", group_cols = "timepoint", paired = "subject_id" ) # 2. Scatter: prevalence in group A vs group B scatter_static(pop, pair = c("A", "B"), xlab = "Group A (%)", ylab = "Group B (%)") scatter_interactive(pop, pair = c("A", "B"), xlab = "Group A (%)", ylab = "Group B (%)") # 3. Volcano: effect size and significance volcano_static(pop, pair = c("A", "B"), p_mode = "bh") volcano_interactive(pop, pair = c("A", "B"), p_mode = "bh") ``` --- ## Session info ```{r session-info} sessionInfo() ```