--- title: "Parallel Testing" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parallel Testing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) set.seed(1) ``` ```{r setup} library(segtest) ``` # Introduction We provide a walk-through of how to run tests for segregation distortion at many loci in parallel. Details of the methods may be found in Gerard et al. (2025a) and Gerard et al. (2025b). # Analysis Data sets `ufit`, `ufit2`, and `ufit3` contain the genotyping output of `updog::multidog()` using three different models. - `ufit`: Uses the `norm` model without specifying who the parents are. - `ufit2`: Uses the `f1pp` model, specifying the parents. - `ufit3`: Uses the `f1` model, specifying the parents. You can convert this genotyping output to what `seg_multi()` expects using `multidog_to_g()`. If you did *not* use the `f1pp` or `f1` models, use ether the `all_gl` (to run tests using genotype log-likelihoods) or `all_g` (to run tests assuming genotypes are known) options, and you must specify the ID's of the parents. ```{r} o1 <- multidog_to_g(ufit, ploidy = 4, type = "all_g", p1 = "indigocrisp", p2 = "sweetcrisp") o2 <- multidog_to_g(ufit, ploidy = 4, type = "all_gl", p1 = "indigocrisp", p2 = "sweetcrisp") ``` If you *did* use the `f1pp` or `f1` models, use either the `off_gl` (to run tests using genotype log-likelihoods) or `off_g` (to run tests assuming genotypes are known) options. ```{r} o3 <- multidog_to_g(ufit2, ploidy = 4, type = "off_g") o4 <- multidog_to_g(ufit2, ploidy = 4, type = "off_gl") o5 <- multidog_to_g(ufit3, ploidy = 4, type = "off_g") o6 <- multidog_to_g(ufit3, ploidy = 4, type = "off_gl") ``` We would recommend *always* using genotype log-likelihoods. But the option is there for known genotypes. Parallelization support is provided through the future package. You specify how to implement parallelization using `future::plan()`. Then you run `seg_multi()`. Then you shut down the parallelization with `future::plan()`. The most common plan is `future::multisession()`, where you specify the number of parallel processes with the `workers` argument. You can get the maximum number of workers via `future::availableCores()` ```{r} future::availableCores() ``` So a typically workload looks something like this: ```{r, echo = TRUE, eval = FALSE} future::plan(future::multisession(workers = 2)) mout1 <- seg_multi( g = o2$g, p1 = o2$p1, p2 = o2$p2, p1_ploidy = 4, p2_ploidy = 4, ret_out = TRUE) future::plan(future::sequential()) ``` ```{r, echo = FALSE, eval = TRUE} future::plan(future::sequential()) mout1 <- seg_multi( g = o2$g, p1 = o2$p1, p2 = o2$p2, p1_ploidy = 4, p2_ploidy = 4, ret_out = TRUE) ``` The output is a data frame. The most important parts of this are the `snp` and `p_value` columns. ```{r} mout1[c("snp", "p_value")] ``` It looks like only SNP `12_31029646` is in possible segregation distortion. Let's compare the expected frequencies (from the `q0` column) against the observed modes. ```{r} round(mout1$q0[mout1$snp == "12_31029646"][[1]], digits = 4) o1$g["12_31029646", ] / sum(o1$g["12_31029646", ]) ``` So SNP `12_31029646` likely got flagged because there are too many individuals with a genotype of 2, and not enough with a genotype of 0. If we would have run `outlier = FALSE`, then SNP `2_20070837` would also have been flagged for possible segregation distortion ```{r, echo = TRUE, eval = FALSE} future::plan(future::multisession(workers = 2)) mout2 <- seg_multi( g = o2$g, p1 = o2$p1, p2 = o2$p2, p1_ploidy = 4, p2_ploidy = 4, outlier = FALSE) future::plan(future::sequential()) ``` ```{r, echo = FALSE, eval = TRUE} future::plan(future::sequential()) mout2 <- seg_multi( g = o2$g, p1 = o2$p1, p2 = o2$p2, p1_ploidy = 4, p2_ploidy = 4, outlier = FALSE) ``` ```{r} mout2[, c("snp", "p_value")] ``` Let's look at the posterior mode genotypes of SNP `2_20070837`: ```{r} o1$g["2_20070837", ] o1$p1["2_20070837"] o1$p2["2_20070837"] ``` So SNP `2_20070837` likely got flagged because of two individuals that are very likely a genotype of 3, which is impossible if the parents have genotypes 0 and 2. Including `outlier = TRUE` previously made it so that we would ignore this discrepancy (up to a point). You can get the posterior probability that an individual is an outlier if you set `ret_out = TRUE` (which I previously did). ```{r, fig.alt="Scatterplot of SNP index (x-axis) by outlier probability (y-axis)."} graphics::plot( mout1$outprob[mout1$snp == "2_20070837"][[1]], ylim = c(0, 1), ylab = "Outlier Probability") ``` # `multi_lrt()` The older function `multi_lrt()`, only provides support for tetraploids. You can still use it instead of `seg_multi()` for tetraploids. ```{r, echo = TRUE, eval = FALSE} future::plan(future::multisession(workers = 2)) mout3 <- multi_lrt( g = o2$g, p1 = o2$p1, p2 = o2$p2 ) future::plan(future::sequential()) ``` ```{r, echo = FALSE, eval = TRUE} future::plan(future::sequential()) mout3 <- multi_lrt( g = o2$g, p1 = o2$p1, p2 = o2$p2 ) ``` It's a little faster, but it cannot account for outliers, so it will mostly be the same as setting `outlier = FALSE`. ```{r, fig.alt="Scatterplot of p-values from seg_lrt() (x-axis) by multi_lrt() (y-axis)."} plot( mout2$p_value, mout3$p_value, xlab = "seg_multi(outlier = FALSE)", ylab = "multi_lrt()") abline(0, 1, lty = 2, col = 2) ``` The one discrepancy above is caused by the new way we calculate the degrees of freedom: ```{r} i <- which.max(abs(mout2$p_value - mout3$p_value)) mout2$df[[i]] mout3$df[[i]] ``` The two approaches should give almost equivalent results most of the time, but the new way is a little better. ## References Gerard D, Thakkar M, & Ferrão LFV (2025a). "Tests for segregation distortion in tetraploid F1 populations." _Theoretical and Applied Genetics_, *138*(30), p. 1--13. [doi:10.1007/s00122-025-04816-z](https://doi.org/10.1007/s00122-025-04816-z). Gerard, D, Ambrosano, GB, Pereira, GdS, & Garcia, AAF (2025b). "Tests for segregation distortion in higher ploidy F1 populations." _bioRxiv_, p. 1--20. [bioRxiv:2025.06.23.661114](https://doi.org/10.1101/2025.06.23.661114)