---
title: "Alpha Diversity Analysis in PhIP-seq"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: false
vignette: >
  %\VignetteIndexEntry{Alpha Diversity Analysis in PhIP-seq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 7,
  fig.height = 4.5,
  warning   = FALSE,
  message   = FALSE
)
```

## Overview

Alpha diversity quantifies the **within-sample** variety of reactive peptides.
In PhIP-seq, each sample's enrichment profile can be characterised by how many
unique peptide species are enriched (richness), how evenly that enrichment is
distributed (evenness), or how dominated it is by a single peptide
(dominance). These properties change across disease conditions, timepoints, and
patient subgroups, making alpha diversity a useful summary statistic for
exploratory analysis and hypothesis testing.

This vignette walks through the complete alpha diversity workflow in **phiper**:

1. Computing diversity metrics with `compute_alpha()`
2. Visualising distributions with `plot_alpha()` and
   `plot_alpha_interactive()`
3. Testing group differences with `compute_alpha_significance()`
4. Summarising statistical results with `plot_alpha_significance()`

---

## Setup

```{r load}
library(phiper)
```

Load the bundled example dataset. It contains two patient groups (`A`, `B`)
measured at two timepoints (`T1`, `T2`).

```{r data}
pd <- load_example_data()
pd
```

---

## Computing alpha diversity

### Basic usage: peptide-level diversity by group

The workhorse function is `compute_alpha()`. At minimum, supply your
`<phip_data>` object, the grouping column(s), and the rank(s) at which
diversity should be computed.

```{r compute-basic}
alpha_group <- compute_alpha(
  pd,
  group_cols = "group",
  ranks      = "peptide_id"
)
```

The result is a **named list** with S3 class `"phip_alpha_diversity"`. Each
element is one data frame — one per `group_col` plus, optionally, an
interaction element.

```{r result-structure}
class(alpha_group)
names(alpha_group)
head(alpha_group$group)
```

### What's in the output?

Each row is one `(sample, rank)` pair. The five diversity columns are:

| Column | Metric | Notes |
|---|---|---|
| `richness` | Number of enriched peptides | Integer; 0 for empty samples |
| `shannon_diversity` | Shannon entropy H | Base configurable; `NA` impossible |
| `simpson_diversity` | Simpson index (1 − Σp²) | Range [0, 1) |
| `pielou_evenness` | Pielou's J = H / ln(S) | `NA` when richness ≤ 1 |
| `berger_parker_dominance` | max(p) | `NA` when richness = 0 |

```{r output-cols}
names(alpha_group$group)
```

### Multiple grouping columns

Pass a character vector to `group_cols` to get one result per group variable.

```{r compute-multi}
alpha_both <- compute_alpha(
  pd,
  group_cols = c("group", "timepoint"),
  ranks      = "peptide_id"
)
names(alpha_both)
```

### Multiple ranks

Ranks are columns in the peptide library — for example `peptide_id` for
peptide-level diversity, or `family` / `genus` for taxonomic-level diversity.

```{r compute-ranks}
alpha_tax <- compute_alpha(
  pd,
  group_cols = "group",
  ranks      = c("peptide_id", "family", "genus")
)
# Each element now has rows for all three ranks
dplyr::count(alpha_tax$group, rank)
```

### Group interactions

Set `group_interaction = TRUE` to also compute diversity on the combined
`group × timepoint` label. Set `interaction_only = TRUE` to skip per-variable
tables and return only the interaction.

```{r compute-interaction}
alpha_inter <- compute_alpha(
  pd,
  group_cols        = c("group", "timepoint"),
  ranks             = "peptide_id",
  group_interaction = TRUE
)
names(alpha_inter)

# Returns only the interaction table
alpha_inter_only <- compute_alpha(
  pd,
  group_cols        = c("group", "timepoint"),
  ranks             = "peptide_id",
  group_interaction = TRUE,
  interaction_only  = TRUE
)
names(alpha_inter_only)
```

### Selecting a metric subset

If you only need richness and Shannon diversity, pass the `metrics` argument.
This is faster and keeps the output tidy.

```{r compute-subset}
alpha_light <- compute_alpha(
  pd,
  group_cols = "group",
  ranks      = "peptide_id",
  metrics    = c("richness", "shannon")
)
names(alpha_light$group)
```

### Shannon base

By default, Shannon diversity is computed in natural log (nats). Use
`shannon_base = "log2"` (bits) or `"log10"` (hartleys) for comparability with
other tools.

```{r shannon-base}
alpha_ln   <- compute_alpha(pd, group_cols = "group", ranks = "peptide_id",
                             metrics = "shannon", shannon_base = "ln")
alpha_log2 <- compute_alpha(pd, group_cols = "group", ranks = "peptide_id",
                             metrics = "shannon", shannon_base = "log2")

# log2 values are ln values divided by ln(2)
head(alpha_ln$group$shannon_diversity / log(2) - alpha_log2$group$shannon_diversity, 3)
```

### Presence modes

`mode` controls what `n` represents in every diversity formula:

| Mode | Rule | Required argument |
|---|---|---|
| `"binary"` (default) | `exist > 0` | — |
| `"threshold"` | `abundance_col > threshold` | `threshold` |
| `"abundance"` | raw `abundance_col` value | `abundance_col` |

#### `mode = "binary"` (default)

Binary mode uses the pre-computed `exist` flag. A peptide counts as present
whenever `exist > 0`.

```{r mode-binary}
alpha_bin <- compute_alpha(
  pd,
  group_cols = "group",
  ranks      = "peptide_id",
  mode       = "binary",
  metrics    = "richness"
)
summary(alpha_bin$group$richness)
```

#### `mode = "threshold"`

Threshold mode re-defines "present" using a continuous abundance column
(`fold_change` by default). Only peptides whose fold-change exceeds `threshold`
are counted. Raising the threshold sharpens the definition of a "hit" and
typically reduces richness.

```{r mode-threshold}
# Loose threshold: fold_change > 10
alpha_thr10 <- compute_alpha(
  pd,
  group_cols = "group",
  ranks      = "peptide_id",
  mode       = "threshold",
  threshold  = 10,
  metrics    = "richness"
)

# Strict threshold: fold_change > 100
alpha_thr100 <- compute_alpha(
  pd,
  group_cols = "group",
  ranks      = "peptide_id",
  mode       = "threshold",
  threshold  = 100,
  metrics    = "richness"
)

# Richness drops as the threshold increases
data.frame(
  mode      = c("binary", "threshold > 10", "threshold > 100"),
  mean_richness = c(
    mean(alpha_bin$group$richness),
    mean(alpha_thr10$group$richness),
    mean(alpha_thr100$group$richness)
  )
)
```

Supply `abundance_col` to use a different column than `fold_change`:

```{r mode-threshold-col, eval=FALSE}
alpha_thr_counts <- compute_alpha(
  pd,
  group_cols    = "group",
  ranks         = "peptide_id",
  mode          = "threshold",
  abundance_col = "counts_hits",
  threshold     = 50
)
```

#### `mode = "abundance"`

Abundance mode does not binarise. Instead, `n` for each peptide is its raw
value in `abundance_col`. Diversity metrics are computed from these continuous
weights, making richness equivalent to the count of peptides with non-zero
abundance and Shannon diversity a weighted entropy.

> **Tip.** Abundance mode operates on a local data frame rather than a
> DuckDB-backed table. Collect the relevant columns first with `dplyr::collect()`
> before passing them to `compute_alpha()`.

```{r mode-abundance}
# Collect the long table to a local data frame
dl <- dplyr::collect(pd$data_long)

alpha_abund <- compute_alpha(
  dl,
  group_cols    = "group",
  ranks         = "peptide_id",
  mode          = "abundance",
  abundance_col = "fold_change",
  metrics       = c("richness", "shannon", "simpson")
)
head(alpha_abund$group[, c("sample_id", "group", "richness",
                            "shannon_diversity", "simpson_diversity")])
```

Compare mean Shannon diversity under binary vs. abundance mode: the
abundance-weighted version is typically lower because a few dominant peptides
drive the distribution.

```{r mode-compare}
data.frame(
  mode          = c("binary", "abundance (fold_change)"),
  mean_shannon  = c(
    mean(alpha_bin$group$shannon_diversity),
    mean(alpha_abund$group$shannon_diversity)
  )
)
```

When aggregating to a higher rank (e.g. `family`), `abundance_agg` controls
how peptide-level values are reduced within each rank category:

```{r mode-abundance-agg, eval=FALSE}
# Sum fold-changes across all peptides per family (collected data frame)
alpha_abund_fam <- compute_alpha(
  dl,
  group_cols    = "group",
  ranks         = "family",
  mode          = "abundance",
  abundance_col = "fold_change",
  abundance_agg = "sum"   # or "mean" (default) or "max"
)
```

### Accessing result attributes

Metadata about the computation is stored as attributes.

```{r attributes}
attr(alpha_group, "ranks")
attr(alpha_group, "mode")
attr(alpha_group, "metrics")
attr(alpha_group, "n_samples")
```

---

## Visualising alpha diversity

### `plot_alpha()` — static boxplots

Pass the result of `compute_alpha()` to `plot_alpha()`. The minimum
required argument beyond `x` is the name of the metric to plot.

```{r plot-basic}
plot_alpha(
  alpha_group,
  metric    = "richness",
  group_col = "group"
)
```

### All five metrics

```{r plot-all-metrics, fig.height=3}
for (m in c("richness", "shannon_diversity", "simpson_diversity",
            "pielou_evenness", "berger_parker_dominance")) {
  print(plot_alpha(alpha_group, metric = m, group_col = "group"))
}
```

### Faceting across multiple ranks

When diversity was computed at multiple ranks, set `facet_by_rank = TRUE` (the
default) to get one panel per rank.

> **Note on this toy example.** The peptides shipped with `load_example_data()`
> are synthetic identifiers that do not map to any entry in the peptide library.
> As a result, all taxonomic rank columns (`family`, `genus`, …) are empty for
> every peptide, so richness at those ranks is 0 for all samples. In a real
> PhIP-seq experiment where peptides carry proper library annotations, each rank
> would show non-zero diversity reflecting the taxonomic breadth of each
> patient's reactivity.

```{r plot-facets}
plot_alpha(
  alpha_tax,
  metric        = "richness",
  group_col     = "group",
  facet_by_rank = TRUE,
  ncol          = 3
)
```

Disable faceting and restrict to a single rank with `filter_ranks`.

```{r plot-filter-rank}
plot_alpha(
  alpha_tax,
  metric        = "richness",
  group_col     = "group",
  filter_ranks  = "peptide_id",
  facet_by_rank = FALSE
)
```

### Filtering and ordering groups

Use `filter_groups` to keep a subset and `x_order` to control axis order.

```{r plot-filter-order}
plot_alpha(
  alpha_both$timepoint,
  metric        = "shannon_diversity",
  group_col     = "timepoint",
  filter_groups = c("T1", "T2"),
  x_order       = c("T1", "T2"),
  x_labels      = c(T1 = "Baseline", T2 = "Follow-up")
)
```

### Customising appearance

Simpson diversity values in this dataset are tightly clustered near 1 (range
0.993–0.997). Setting `y_range` to the actual data range rather than the
theoretical `c(0, 1)` reveals the within-group spread that would otherwise be
invisible.

```{r plot-custom}
plot_alpha(
  alpha_group,
  metric        = "simpson_diversity",
  group_col     = "group",
  custom_colors = c(A = "#E41A1C", B = "#377EB8"),
  y_range       = c(0.99, 1),
  x_tickangle   = 30,
  show_grids    = FALSE,
  point_size    = 2.5,
  point_alpha   = 0.6
)
```

### Plotting the interaction table

When `group_interaction = TRUE`, the interaction element stores the combined
label in a column called `phip_interaction`.

```{r plot-interaction}
plot_alpha(
  alpha_inter,
  metric           = "richness",
  group_col        = "phip_interaction",
  interaction_only = TRUE
)
```

### `plot_alpha_interactive()` — plotly

The interactive version mirrors the static API and returns a plotly widget,
making it suitable for HTML reports and Shiny apps.

```{r plot-interactive, eval=FALSE}
plot_alpha_interactive(
  alpha_group,
  metric    = "richness",
  group_col = "group"
)
```

All display parameters — `custom_colors`, `x_order`, `y_range`, `filter_groups`,
`facet_by_rank` — behave identically to the static version.

---

## Statistical testing

`compute_alpha_significance()` runs **global** and **pairwise** hypothesis tests
on every `(rank, metric)` combination in the alpha diversity result.

### Default: Kruskal-Wallis + Wilcoxon

```{r significance}
sig <- compute_alpha_significance(
  alpha_group,
  group_col = "group"
)
sig
```

The return value is a list of class `"phip_alpha_significance"` with two
tibbles.

#### Global test

One row per `(rank, metric)`:

```{r sig-global}
sig$global
```

#### Pairwise comparisons

One row per `(rank, metric, pair)` with raw and adjusted p-values, Cohen's d,
and significance stars:

```{r sig-pairwise}
sig$pairwise
```

### Choosing the test

```{r sig-anova-tukey}
sig_anova <- compute_alpha_significance(
  alpha_group,
  group_col     = "group",
  global_test   = "anova",
  pairwise_test = "tukey"
)
sig_anova$global
```

### p-value adjustment

Any method accepted by `p.adjust()` is supported:

```{r sig-padj}
sig_bonf <- compute_alpha_significance(
  alpha_group,
  group_col       = "group",
  p_adjust_method = "bonferroni"
)
sig_bonf$pairwise[, c("metric", "group1", "group2", "p_raw", "p_adj", "stars")]
```

### Restricting to a subset of metrics

```{r sig-subset}
sig_rich <- compute_alpha_significance(
  alpha_group,
  group_col = "group",
  metric    = c("richness", "shannon_diversity")
)
sig_rich$global
```

### Multi-rank significance

When alpha was computed at multiple ranks, significance is tested at every
rank automatically:

```{r sig-multirank}
sig_tax <- compute_alpha_significance(
  alpha_tax,
  group_col = "group",
  metric    = "richness"
)
dplyr::select(sig_tax$global, rank, metric, statistic, p_value)
```

---

## Visualising significance

`plot_alpha_significance()` offers two display modes.

### Table mode

Returns the filtered pairwise tibble — useful for including in reports or
further processing:

```{r sig-table}
plot_alpha_significance(
  sig,
  metric      = "richness",
  type        = "table",
  p_threshold = 0.05
)
```

Set `p_threshold = 1` to retrieve all pairs regardless of significance.

```{r sig-table-all}
plot_alpha_significance(
  sig,
  metric      = "richness",
  type        = "table",
  p_threshold = 1
)
```

### Heatmap mode

Produces a Cohen's d heatmap with significance stars overlaid. Non-significant
pairs (p_adj > threshold) are shown in grey.

```{r sig-heatmap}
plot_alpha_significance(
  sig,
  metric      = "richness",
  type        = "heatmap",
  p_threshold = 1     # show all pairs; colour by effect size
)
```

### Heatmap across metrics

Loop over metrics to produce one heatmap per metric:

```{r sig-heatmap-loop, eval=FALSE}
for (m in attr(sig, "metrics")) {
  print(
    plot_alpha_significance(sig, metric = m, type = "heatmap", p_threshold = 1)
  )
}
```

---

## Significance brackets on boxplots

Pass a `"phip_alpha_significance"` object to `plot_alpha()` via the
`significance` argument, and set `show_significance = TRUE` to overlay pairwise
brackets automatically. The `ggsignif` package must be installed.

```{r brackets, eval=requireNamespace("ggsignif", quietly = TRUE)}
plot_alpha(
  alpha_group,
  metric            = "richness",
  group_col         = "group",
  significance      = sig,
  show_significance = TRUE,
  sig_p_threshold   = 0.05
)
```

Tune the bracket geometry with `sig_step_increase` and `sig_tip_length`:

```{r brackets-tuned, eval=requireNamespace("ggsignif", quietly = TRUE)}
plot_alpha(
  alpha_group,
  metric            = "shannon_diversity",
  group_col         = "group",
  significance      = sig,
  show_significance = TRUE,
  sig_p_threshold   = 0.1,
  sig_step_increase = 0.08,
  sig_tip_length    = 0.005
)
```

---

## Putting it all together

A typical analysis runs in four steps:

```{r full-pipeline, eval=FALSE}
# 1. Compute diversity at peptide and family level
alpha <- compute_alpha(
  pd,
  group_cols        = c("group", "timepoint"),
  ranks             = c("peptide_id", "family"),
  group_interaction = TRUE
)

# 2. Visualise
plot_alpha(alpha, metric = "richness",    group_col = "group")
plot_alpha(alpha, metric = "pielou_evenness", group_col = "timepoint",
                     x_order = c("T1", "T2"),
                     x_labels = c(T1 = "Baseline", T2 = "Follow-up"))

# 3. Test
sig <- compute_alpha_significance(alpha, group_col = "group")

# 4. Report
sig$global
plot_alpha_significance(sig, type = "heatmap", p_threshold = 1)
plot_alpha(alpha, metric = "richness", group_col = "group",
                     significance = sig, show_significance = TRUE)
```

---

## Session info

```{r session-info}
sessionInfo()
```