---
title: "Introduction to segtest Version 2.0"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to segtest Version 2.0}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(1)
```

```{r setup}
library(segtest)
```

The `segtest` package offers a suite of tools for testing segregation distortion in F1 polyploid populations across diverse meiotic models. These methods support autopolyploids (full polysomic inheritance), allopolyploids (full disomic inheritance), and segmental allopolyploids (partial preferential pairing). Double reduction is optionally modeled fully in tetraploids and partially (at simplex loci only) in higher ploidies. A user-specified maximum proportion of outliers allows the method to accommodate moderate double reduction at non-simplex loci. Offspring genotypes may be known or modeled using genotype likelihoods to account for genotype uncertainty. Parent data may or may not be provided, at your option. Parents can have different (even) ploidies, at your option. Details of the methods may be found in Gerard et al. (2025a) and Gerard et al. (2025b).

Additional functions include those that generate gamete and genotype frequencies under different models of meiosis, functions that simulate genotype (log) likelihoods, and "competing" tests for segregation distortion.

The main functions are:

- `seg_multi()`: Run the likelihood ratio test for segregation distortion in parallel at many loci.
- `multidog_to_g`: Format the genotyping output from `updog::multidog()` to be compatible with the input of `seg_multi()`.
- `seg_lrt()`: Test for segregation distortion for any even ploidy.
- `gamfreq()`: Gamete frequencies.
- `gf_freq()`: Genotype frequencies of an F1 population of polyploids.
- `drbounds()`: Upper bounds on the double reduction rate(s) based on two different
   extreme models of meiosis.
- `simgl()`: Simulate genotype log-likelihoods given a vector of genotype counts.

# Gamete genotype frequencies

`gamfreq()` will generate gamete frequencies under different models of meiosis. `gf_freq()` will generate genotype frequencies under the same models by convolving the output from `gamfreq()`. We focus on `gamfreq()`, as `gf_freq()` uses the same models but applied separately to each parent.

For autopolyploids, specify `type = "polysomic"` and, optionally, the amount of double reduction via `alpha`. `alpha` is a vector of length `floor(ploidy / 4)` where element `i` is the probability a gamete has `i` pairs of identical by double reduction alleles. The upper bounds for `alpha` can be found via `drbounds()`. E.g., for a parental octoploid with genotype 4 with no and moderate levels of double reduction:
```{r}
drbounds(ploidy = 8) ## DR bounds
gamfreq(g = 4, ploidy = 8, type = "polysomic") ## no DR
gamfreq(g = 4, ploidy = 8, alpha = c(0.1, 0.01), type = "polysomic") ## Some DR
```

For allopolyploids, the possible gamete frequencies can be found in `seg`. E.g., for a parental octoploid with genotype 4, the possible gamete frequencies are
```{r}
seg[seg$ploidy == 8 & seg$g == 4 & seg$mode %in% c("disomic", "both"), "p"]
```
Note that you also need to filter for the `mode` to be either `"disomic"` or `"both"` (both disomic and polysomic). The total number of possible allopolyploid distributions is `n_pp_mix()`.
```{r}
n_pp_mix(g = 4, ploidy = 8)
```
You can specify one of these distributions via a 1-of-3 vector. E.g.
```{r}
gamfreq(g = 4, ploidy = 8, gamma = c(1, 0, 0), type = "mix")
gamfreq(g = 4, ploidy = 8, gamma = c(0, 1, 0), type = "mix")
gamfreq(g = 4, ploidy = 8, gamma = c(0, 0, 1), type = "mix")
```

Segmental allopolyploids are mixtures of the possible allopolyploid segregation distributions. E.g., an equal mixture of the three for an octoploid with genotype 4 is
```{r}
gamfreq(g = 4, ploidy = 8, gamma = c(1, 1, 1)/3, type = "mix") 
```

At simplex, loci, there is only one possible allopolyploid segregation distribution:
```{r}
n_pp_mix(g = 1, ploidy = 8)
gamfreq(g = 1, ploidy = 8, gamma = 1, type = "mix")
n_pp_mix(g = 7, ploidy = 8)
gamfreq(g = 7, ploidy = 8, gamma = 1, type = "mix")
```
You can account for double reduction at these loci by including `beta`. The upper bound of which can be found via `beta_bounds()`.
```{r}
beta_bounds(ploidy = 8)
gamfreq(g = 1, ploidy = 8, gamma = 1, beta = 0.03, type = "mix")
gamfreq(g = 7, ploidy = 8, gamma = 1, beta = 0.03, type = "mix")
```

# Simulating data

Let's suppose we have some genotype frequencies we want to simulate individual data from:
```{r, fig.alt="Bar plot of genotype frequencies."}
gf <- gf_freq(
  p1_g = 2, 
  p1_ploidy = 6,
  p1_gamma = c(0.7, 0.3), 
  p1_type = "mix",
  p2_g = 4,
  p2_ploidy = 6, 
  p2_gamma = c(0.5, 0.5), 
  p2_type = "mix")
plot(gf, type = "h", xlab = "Genotype", ylab = "Frequency")
```

To simulate genotype counts, just use `multinom()` from the `stats` package. Let's simulate data from 10 individuals.
```{r}
x <- c(rmultinom(n = 1, size = 10, prob = gf))
x
```
To simulate genotype (log) likelihoods, insert these genotype counts into `simgl()`.
```{r}
gl <- simgl(nvec = x)
gl
```

# Testing for segregation distortion

You can test for segregation distortion using `seg_lrt()`. E.g., let's test for it using the data (both known genotypes and genotype likelihoods) we simulated from the previous section:
```{r}
## With known genotypes
sout1 <- seg_lrt(x = x, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)
sout1$p_value
## With genotype likelihoods
sout2 <- seg_lrt(x = gl, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)
sout2$p_value
```

My recommendation is to always use the genotype log-likelihoods. But `seg_lrt()` allows for known genotypes, if that situation works best for you.

The default (`model = "seg"`) is to assume your organism is a segmental allopolyploid, and to account for possible double reduction at simplex loci. But you should absolutely use other models if you have more information on your organism:

- `"seg"`: General segmental allopolyploid with possible double reduction at simplex loci.
- `"allo_pp"`: Segmental allopolyploid with complete bivalent pairing (no double reduction).
- `"allo"`: Pure allopolyploid (disomic inheritance).
- `"auto"`: Pure autopolyploid with complete bivalent pairing (no double reduction).
- `"auto_dr"`: Pure autopolyploid with possible multivalent pairing (some double reduction).
- `"auto_allo"`: Same null hypothesis as in polymapR.

We allow for some non-valid genotypes via the `ob` argument. This is the upper bound on the proportion of outliers. By default, this is set to 0.03. You can set this to `0` (or set `outlier = FALSE`) if you want any outliers to indicate segregation distortion.

Make sure that the log-likelihoods are base $e$. If they are base 10, you'll get the wrong $p$-value:
```{r}
gl10 <- gl / log(10)
seg_lrt(x = gl10, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)$p_value
```

Don't mess with the technical arguments (`ntry`, `opt`, `optg`, `df_tol`). These have to do with the optimization and how to approximate the degrees of freedom of the test. Except possibly `ntry`. You could increase that if you are seeing weird results. But then let me know, because I haven't seen any bad behavior with `ntry = 3` (the default).

## References

Gerard D, Thakkar M, & Ferrão LFV (2025a). "Tests for segregation distortion in tetraploid F1 populations." _Theoretical and Applied Genetics_, *138*(30), p. 1--13. [doi:10.1007/s00122-025-04816-z](https://doi.org/10.1007/s00122-025-04816-z).

Gerard, D, Ambrosano, GB, Pereira, GdS, & Garcia, AAF (2025b). "Tests for segregation distortion in higher ploidy F1 populations." _bioRxiv_, p. 1--20. [bioRxiv:2025.06.23.661114](https://doi.org/10.1101/2025.06.23.661114)