--- title: "Beta Diversity Analysis in PhIP-seq" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: false vignette: > %\VignetteIndexEntry{Beta Diversity Analysis in PhIP-seq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5, warning = FALSE, message = FALSE ) ``` ## Overview Beta diversity quantifies **between-sample** dissimilarity of antibody reactivity profiles. In PhIP-seq, each sample is represented by a vector of peptide presence/absence calls or continuous enrichment scores. Comparing these vectors reveals how similar or different individual immune repertoires are — and whether those differences are explained by group membership, disease status, or timepoint. This vignette walks through the complete beta diversity workflow in **phiper**: 1. Computing pairwise distances with `compute_distance()` 2. Unconstrained ordination (PCoA) with `compute_pcoa()`, `plot_pcoa()`, and `plot_scree()` 3. Feature–axis associations with `compute_pcoa_feature_associations()` 4. Constrained ordination (CAP / db-RDA) with `compute_capscale()` and `plot_cap()` 5. Permutation-based testing (PERMANOVA) with `compute_permanova()` 6. Homogeneity of dispersion with `compute_dispersion()` and `plot_dispersion()` 7. Non-linear embedding (t-SNE) with `compute_tsne()` and `plot_tsne()` --- ## Setup ```{r load} library(phiper) library(dplyr) ``` Load the bundled example dataset. It contains two patient groups (`A`, `B`) measured at two timepoints (`T1`, `T2`) across 1 000 simulated peptides. ```{r data} pd <- load_example_data() pd ``` > **Note.** The example data are entirely simulated and have no biological > meaning. They exist solely to demonstrate the API. Extract a one-row-per-sample metadata table — you will need it to annotate ordination plots later: ```{r meta} meta <- pd$data_long |> select(sample_id, subject_id, group, timepoint) |> distinct() |> collect() glimpse(meta) ``` --- ## Step 1 — Distance matrix `compute_distance()` builds a sample × feature abundance matrix from a `phip_data` object and computes pairwise distances between samples. The normalised abundance matrix is attached to the returned `dist` object as the `"abundances"` attribute — this is reused downstream by `compute_pcoa_feature_associations()` and `compute_capscale()`. ### Choosing a distance and normalisation The right combination depends on the type of data you want to compare: | Scenario | `value_col` | `method_normalization` | `distance` | |---|---|---|---| | Binary presence/absence | `"exist"` | `"auto"` (→ `"none"`) | `"jaccard"` | | Continuous enrichment | `"fold_change"` | `"hellinger"` | `"bray"` | | Raw counts | `"counts_hits"` | `"relative"` | `"bray"` | We use fold-change data with Hellinger normalisation and Bray-Curtis dissimilarity throughout this vignette. ```{r compute-distance} d <- compute_distance( pd, value_col = "fold_change", method_normalization = "hellinger", distance = "bray" ) class(d) # a standard dist object attr(d, "Size") # number of samples dim(attr(d, "abundances")) # rows = samples, cols = features ``` Parallelised computation via the optional `parallelDist` package is requested with `n_threads`: ```{r n-threads, eval=FALSE} d <- compute_distance( pd, value_col = "fold_change", method_normalization = "hellinger", distance = "bray", n_threads = 4L ) ``` --- ## Step 2 — Unconstrained ordination (PCoA) ### Computing PCoA `compute_pcoa()` wraps `stats::cmdscale()` and returns an S3 object of class `"beta_pcoa"` containing sample coordinates, eigenvalues, variance explained, and eigenvalue diagnostics. ```{r compute-pcoa} pcoa_res <- compute_pcoa(d, neg_correction = "none", n_axes = 5L) names(pcoa_res) # components of the result pcoa_res$var_explained # % variance per axis pcoa_res$eigen_diagnostics ``` The `eigen_diagnostics` tibble reports how many eigenvalues are negative and how large they are relative to the positive ones. Non-Euclidean distances (such as Bray-Curtis) routinely produce a small fraction of negative eigenvalues; a ratio below ~0.1 is generally acceptable. When negative eigenvalues are substantial, apply a Lingoes or Cailliez correction: ```{r neg-correction, eval=FALSE} pcoa_corr <- compute_pcoa(d, neg_correction = "lingoes", n_axes = 5L) pcoa_corr$correction_infos ``` ### Scree plot `plot_scree()` visualises how variance is distributed across axes. It helps decide how many axes are worth inspecting. ```{r scree-bar} plot_scree(pcoa_res, n_axes = 5L, type = "bar") ``` ```{r scree-line} plot_scree(pcoa_res, n_axes = 5L, type = "line") ``` ### Feature associations `compute_pcoa_feature_associations()` computes associations between individual features (peptides) and PCoA axes. It needs both the `dist` object (to access the `"abundances"` attribute) and the `pcoa_res` object. Call it **before** joining any metadata into `pcoa_res$sample_coords`, as the function expects only numeric coordinate columns. ```{r feature-assoc} feat_assoc <- compute_pcoa_feature_associations( dist_obj = d, pcoa_result = pcoa_res, top_features = 20L, association_method = "correlation" ) feat_assoc ``` Three association methods are available: | Method | Description | |---|---| | `"weighted_average"` | Abundance-weighted centroid of sample scores | | `"correlation"` | Pearson correlation between feature abundance and axis score | | `"regression"` | Regression slope of axis scores on feature abundance | ### Plotting PCoA `plot_pcoa()` requires group and/or time columns to already be present in `pcoa_res$sample_coords`. Join the metadata table before plotting: ```{r join-meta-pcoa} pcoa_res$sample_coords <- left_join( pcoa_res$sample_coords, meta, by = "sample_id" ) ``` #### Basic plot — colour by group ```{r plot-pcoa-group} plot_pcoa( pcoa_res, group_col = "group" ) ``` #### Adding a time factor When `time_col` is supplied, points are **shaped** by time in addition to being coloured by group. ```{r plot-pcoa-group-time} plot_pcoa( pcoa_res, group_col = "group", time_col = "timepoint" ) ``` #### Centroids and centroid trajectories `show_centroids = TRUE` (the default) overlays centroid points for each group × time combination. Set `connect_centroids = "group"` to draw paths that connect centroids along time within each group — useful for tracking longitudinal change. ```{r plot-pcoa-centroids} plot_pcoa( pcoa_res, group_col = "group", time_col = "timepoint", centroid_by = "group_time", connect_centroids = "group" ) ``` #### Confidence ellipses `ellipse_by` accepts a character vector, so multiple ellipse types can be overlaid simultaneously. Possible values are `"group"`, `"time"`, and `"group_time"`. ```{r plot-pcoa-ellipses} plot_pcoa( pcoa_res, group_col = "group", time_col = "timepoint", show_centroids = FALSE, ellipse_by = c("group", "group_time") ) ``` #### Plotting alternative axes ```{r plot-pcoa-axes23} plot_pcoa( pcoa_res, axes = c(2, 3), group_col = "group" ) ``` --- ## Step 3 — Constrained ordination (CAP / db-RDA) Constrained ordination asks: **how much of the distance structure is explained by known metadata variables?** `compute_capscale()` wraps `vegan::capscale()` (distance-based RDA) and runs per-term permutation tests via `vegan::anova.cca()`. ```{r compute-cap} cap_res <- compute_capscale( dist_obj = d, ps = pd, formula = ~ group + timepoint, permutations = 99L ) cap_res$variance_partition cat( "R\u00b2 =", round(cap_res$r2, 3), "| adj. R\u00b2 =", round(cap_res$r2_adj, 3), "\n" ) cap_res$perm_terms ``` The `variance_partition` tibble shows how much of the total inertia is constrained by the formula terms vs. left unconstrained. `perm_terms` gives the per-variable permutation test results. ### Plotting CAP As with PCoA, join metadata into `cap_res$sample_coords` before plotting: ```{r join-meta-cap} cap_res$sample_coords <- left_join( cap_res$sample_coords, meta, by = "sample_id" ) ``` ```{r plot-cap} plot_cap( cap_res, group_col = "group", time_col = "time", centroid_by = "group_time", connect_centroids = "group", ellipse_by = "group" ) ``` `plot_cap()` accepts the same `axes`, `centroid_by`, `connect_centroids`, and `ellipse_by` arguments as `plot_pcoa()`. --- ## Step 4 — PERMANOVA PERMANOVA (permutational ANOVA) tests whether group centroids differ significantly in multivariate distance space. `compute_permanova()` runs a global model and all pairwise post-hoc contrasts, with optional BH adjustment within each contrast scope. ```{r compute-permanova} perm_res <- compute_permanova( dist_obj = d, ps = pd, group_col = "group", time_col = "timepoint", subject_col = "subject_id", permutations = 99L, p_adjust = "BH" ) perm_res ``` The `scope` column distinguishes three levels of inference: | `scope` | Meaning | |---|---| | `"global"` | Full model (group, time, interaction) | | `"group_pairwise"` | All pairwise comparisons between group levels | | `"time_pairwise"` | All pairwise comparisons between time levels | `p_adjust` is applied **within** each scope separately, so global and pairwise p-values are adjusted independently. > **Repeated measures.** When `subject_col` is provided and subjects appear at > multiple timepoints, permutations for time contrasts are stratified by subject. > This is a simplified approximation; complex longitudinal designs may require > custom permutation schemes via `permute::how()`. --- ## Step 5 — Beta dispersion PERMANOVA assumes homogeneous within-group dispersion (equal spread around centroids). If groups differ in dispersion rather than in location, a significant PERMANOVA result can be misleading. `compute_dispersion()` tests this assumption using `vegan::betadisper()` and `vegan::permutest()`. ```{r compute-dispersion} disp_res <- compute_dispersion( dist_obj = d, ps = pd, group_col = "group", time_col = "timepoint", permutations = 99L, p_adjust = "BH" ) disp_res$tests # permutation test results per scope head(disp_res$distances) # per-sample distances to centroid ``` The returned object is of class `"beta_dispersion"` and contains two tibbles: | Component | Content | |---|---| | `$tests` | Permutation test p-values per scope and contrast | | `$distances` | Per-sample distance-to-centroid, scope, and contrast | ### Plotting dispersion `plot_dispersion()` overlays violins, boxplots, and jittered points. Select a `scope` and `contrast` that appear in `disp_res$distances`. #### Group dispersion ```{r plot-dispersion-group} plot_dispersion( disp_res, scope = "group", contrast = "" ) ``` #### Time dispersion ```{r plot-dispersion-time} plot_dispersion( disp_res, scope = "time", contrast = "" ) ``` Layer visibility is controlled individually: ```{r plot-dispersion-style} plot_dispersion( disp_res, scope = "group", contrast = "", show_violin = FALSE, show_box = TRUE, show_points = TRUE ) ``` --- ## Step 6 — t-SNE t-SNE is a non-linear embedding method that preserves local neighbourhood structure. It is useful for spotting clusters that linear methods miss. **Axes are not interpretable** and embeddings vary with the random seed and the `perplexity` parameter — always set a `seed`. ```{r compute-tsne} tsne_res <- compute_tsne( ps = pd, dist_obj = d, dims = 3L, # compute 3 dimensions for both 2D and 3D views perplexity = 15, meta_cols = c("group", "timepoint"), seed = 42L ) head(tsne_res) attr(tsne_res, "tsne_params") ``` ### 2D plot ```{r plot-tsne-2d-group} plot_tsne(tsne_res, view = "2d", colour = "group") ``` ```{r plot-tsne-2d-time} plot_tsne(tsne_res, view = "2d", colour = "timepoint") ``` Axis labels (`t-SNE 1`, `t-SNE 2`) carry no quantitative meaning — do not read them the way you would PCoA axes. ### 3D interactive plot `view = "3d"` returns a `plotly` widget. It is best explored interactively in an HTML report or the RStudio viewer. ```{r plot-tsne-3d, eval=FALSE} plot_tsne(tsne_res, view = "3d", colour = "group") ``` --- ## Putting it all together A complete beta diversity analysis from data object to inference: ```{r full-pipeline, eval=FALSE} library(phiper) library(dplyr) pd <- load_example_data() meta <- pd$data_long |> select(sample_id, subject_id, group, timepoint) |> distinct() |> collect() # 1. Distances d <- compute_distance(pd, value_col = "fold_change", method_normalization = "hellinger", distance = "bray") # 2. PCoA pcoa_res <- compute_pcoa(d, n_axes = 5L) plot_scree(pcoa_res) pcoa_res$sample_coords <- left_join(pcoa_res$sample_coords, meta, by = "sample_id") plot_pcoa(pcoa_res, group_col = "group", time_col = "time", centroid_by = "group_time", connect_centroids = "group") # 3. Feature associations compute_pcoa_feature_associations(d, pcoa_res, top_features = 30L) # 4. CAP cap_res <- compute_capscale(d, ps = pd, formula = ~ group + timepoint, permutations = 999L) cap_res$perm_terms cap_res$sample_coords <- left_join(cap_res$sample_coords, meta, by = "sample_id") plot_cap(cap_res, group_col = "group", time_col = "time") # 5. PERMANOVA compute_permanova(d, ps = pd, group_col = "group", time_col = "time", subject_col = "subject_id", permutations = 999L, p_adjust = "BH") # 6. Dispersion disp_res <- compute_dispersion(d, ps = pd, group_col = "group", time_col = "time", permutations = 999L, p_adjust = "BH") disp_res$tests plot_dispersion(disp_res, scope = "group", contrast = "") # 7. t-SNE tsne_res <- compute_tsne(pd, d, dims = 3L, perplexity = 15, seed = 42L, meta_cols = c("group", "timepoint")) plot_tsne(tsne_res, view = "2d", colour = "group") plot_tsne(tsne_res, view = "3d", colour = "group") ``` --- ## Session info ```{r session-info} sessionInfo() ```