---
title: "dpcR package - an overview"
author: "Micha&#322; Burdukiewicz, Stefan R&ouml;diger"
date: "2016-12-31"
output: 
  rmarkdown::html_vignette:
    toc: true
bibliography: "dpcr.bib"
vignette: >
  %\VignetteIndexEntry{dpcR package - an overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r eval=TRUE,echo=FALSE}
library(knitr)
opts_chunk$set(fig.width=6, fig.height=6)

library(ggplot2)
library(xtable)
library(dpcR)

size_mod <- -2
cool_theme <- theme(plot.background=element_rect(fill = "transparent",
                                                 colour = "transparent"),
                    panel.grid.major = element_line(colour="lightgrey", linetype = "dashed"),
                    panel.background = element_rect(fill = "white", colour = "black"),
                    legend.background = element_rect(fill="NA"),
                    legend.position = "bottom",
                    axis.text = element_text(size=12 + size_mod),
                    axis.title.x = element_text(size=15 + size_mod, vjust = -0.1), 
                    axis.title.y = element_text(size=15 + size_mod, vjust = 1),
                    strip.text = element_text(size=15 + size_mod, face = "bold"),
                    strip.background = element_rect(fill = "#9ecae1", colour = "black"),
                    legend.text = element_text(size=12 + size_mod), 
                    legend.title = element_text(size=15 + size_mod),
                    plot.title = element_text(size=20 + size_mod))

load("vig.RData")
```

**dpcR** is an R package designed to perform analysis of digital PCR (dPCR) experiments. This vignette covers important features of the package and should be used as an addendum to the manual.

Our **dpcR** package employs the nomenclature of the MIQE guidelines for dPCR (@huggett_digital_2013, @huggett_considerations_2014). $\lambda$ is the mean number of molecules per partition. Total number of partitions is given by $n$. $k$ is the number of positive partitions.

$$
\lambda = - \ln \left(1 - \frac{k}{n} \right)
$$

Firstly, we describe `dpcr` class, a parent class for all classes designed to contain dPCR data. Further, we closer inspect `adpcr` which is a class responsible for array-based dPCR experiments (all experiments, where output data has precise localization in two dimensions). Often these types of dPCR experiments are called champer digital PCR (cdPCR).

Real-time quantitative qPCR (qPCR) experiments do not follow the fundamental assumption of dPCR reactions (the mean number of template molecule per partition smaller than 1). However, it is possible to 'convert' the qPCR into a dPCR. We present a tool to analyze results from *high-throughput* qPCR systems by using the dPCR analysis methodology as implemented in the **dpcR** package. The `qpcr2pp` function converts qPCR amplification curve data to a `qdpcr` object (see section about `qdpcr`). The calculation of the Cq values from the 
amplification curves is internally done via functions from the **qpcR** package by @ritz_qpcr_2008.


# `dpcr` object

The key class of the **dpcR** package is `dpcr`. It has the following slots:  
* *.Data* - matrix containing data from dPCR runs (see '*.Data* slot' subsection). It is further specified by the *type* slot.  
* *n* - number of partitions read in each run.  
* *exper* - name of the experiment.  
* *replicate* - name (or more conveniently ID) of a replicate.  
* *assay* - name of the assay.  
* *type* - name of the data (see '*type* slot' subsection)  

Although, this class is designed to contain results from all dPCR experiments, the user will interact mostly with its inheritors as `adpcr` or `dpcr`.

`dpcr` is a S4 object. Below is shown how to extract elements from the slots of a S4 object:

```{r eval=TRUE}
# Below we have S4 object
s4 <- sim_adpcr(m = 100, n = 496, times = 100, pos_sums = FALSE, n_panels = 3)
# Is it a dpcr object?
class(s4)
# Yes, it is. Let's see what we have in type slot
slot(s4, "type")
# We can use also shorter notation
s4@type
```

## `dpcr` objects management (`bind_dpcr`, `extract_run`)

All `dpcr` objects should be managed using special functions provided by this package: `bind_dpcr` and `extract_run`. The former binds `dpcr` objects, the latter extracts parts of the `dpcr` object. It is important to use this functions, because they preserve other attributes important for `dpcr` objects as number of partitions, names of experiments, assays and technical replicates.

```{r eval=TRUE}
# Create single adpcr object. The following code is also true for 
# other objects inhering from dpcr, as dpcr or qdpcr
single_run <- sim_adpcr(m = 100, n = 765, times = 100, pos_sums = FALSE, n_panels = 1)
two_runs <- bind_dpcr(single_run, single_run)
three_runs <- bind_dpcr(single_run, single_run, single_run)
# It is also possible to bind a list of dpcr objects... 
three_runs_list <- bind_dpcr(list(single_run, single_run, single_run))
# ... which may be useful in do.call statements
dpcr_list <- do.call(bind_dpcr, lapply(5L:10*10, function(n_template)
  sim_adpcr(m = n_template, n = 765, times = 100, pos_sums = FALSE, n_panels = 1)))
```

`bind_dpcr` may be seen as the analogue of the R function [cbind](https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html). The main difference is the lack of recycling. If two objects with uneven number of data points are bound together, the shorter is completed with missing values (NA).

```{r eval=TRUE}
longer_run <- sim_adpcr(m = 10, n = 15, times = 100, pos_sums = FALSE, n_panels = 1)
shorter_run <- sim_adpcr(m = 10, n = 10, times = 100, pos_sums = FALSE, n_panels = 1)
shortest_run <- sim_adpcr(m = 10, n = 5, times = 100, pos_sums = FALSE, n_panels = 1)
# Expect informative message after binding
res <- bind_dpcr(longer_run, shorter_run, shortest_run)
# Print the whole data
slot(res, ".Data")
```

`extract_run` is an equivalent of [Extract](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.html). It extracts one or more runs from the `dpcr` objects preserving other properties (as an appropriate replicate ID and so on).

```{r eval=TRUE}
five_runs <- sim_adpcr(m = 2, n = 10, times = 100, pos_sums = FALSE, n_panels = 5)
print(five_runs)

# Extract runs by the index
only_first_run <- extract_run(five_runs, 1)
only_first_and_second_run <- extract_run(five_runs, c(1, 2))
# See if proper replicated were extracted
slot(only_first_and_second_run, "replicate")
no_first_run <- extract_run(five_runs, -1)
slot(no_first_run, "replicate")
# Extract runs by the name
run_Experiment1.3 <- extract_run(five_runs, "Experiment1.3")
slot(run_Experiment1.3, "replicate")
run_Experiment1.3and5 <- extract_run(five_runs, c("Experiment1.3", "Experiment1.5"))
slot(run_Experiment1.3and5, "replicate")
```

Since the *.Data* slot inherits from the matrix class, it is possible to also use normal *Extract* operator ('['). For more information about this topic, see '*.Data* slot' subsection.



## *.Data* slot

Digital PCR data is always a matrix (see `?matrix` in R). Columns and rows represent respectively individual runs and their data points. Since data points are not always equivalent to partitions, the true number of partitions for each run is always defined in the slot *n*.  

```{r eval=TRUE}
# Create two array dPCR experiments. Mind the difference in the n parameter.
sample_adpcr <- bind_dpcr(sim_adpcr(m = 100, n = 765, times = 100, pos_sums = FALSE, n_panels = 1), 
                          rename_dpcr(sim_adpcr(m = 100, n = 763, times = 100, pos_sums = FALSE, 
                                           n_panels = 1), 
                                 exper = "Experiment2"))
```

In the code chunk above, we created two array dPCR experiments with respectively 765 and 763 partitions. Inspect the last five data points:

```{r eval=TRUE}
# It's possible to manipulate data points from dpcr object using all functions that work for matrices
tail(sample_adpcr)
```

Both experiments have 765 data points. We see that in case of a shorter experiment, two data points have the value NA. If we analyze the *n* slot:

```{r eval=TRUE}
slot(sample_adpcr, "n")
```

We see the expected number of partitions. It is especially important in case of fluorescence dPCR data, where one droplet may have assignments of few data points.

The important feature of *.Data* is inheritance from `matrix` class, which opens numerous possibilities for data manipulation. 
```{r eval=TRUE}
# Quickly count positive partitions
colSums(sample_adpcr > 0)
# Baseline fluorescence data
sim_dpcr(m = 3, n = 2, times = 5, fluo = list(0.1, 0)) - 0.05
```


## *type* slot

Data from dPCR experiments may have several types:  
* **ct** (**c**ycle **t**hreshold): cycle threshold of each partition.  
* **fluo**: fluorescence intensity of each partition.  
* **nm** (**n**umber of **m**olecules): number of molecules in each partition (usually such precise data come only from simulations).  
* **np** (**n**umber of **p**ositives): status (positive (1) or negative(0)) of each partition.  
* **tnp** (**t**otal **n**umber of **p**ositives): total number of positive partitions in the run (*.Data* in this case is matrix with single row and number of columns equal to the number of runs).  

In case of **fluo** and **tnp** types, the number of data points in *.Data* slot is hardly ever equal to the real number of partitions `dpcr`.

```{r eval=TRUE}
# Inspect all types of data

# Cq
# Load package with qPCR data
library(chipPCR)
qpcr2pp(data = C127EGHP[, 1L:6], type = "ct")

# fluo
sim_dpcr(m = 3, n = 2, times = 5, fluo = list(0.1, 0)) - 0.05

# nm
sim_adpcr(m = 235, n = 765, times = 100, pos_sums = FALSE, n_panels = 3)

# np
binarize(sim_adpcr(m = 235, n = 765, times = 100, pos_sums = FALSE, n_panels = 3))

# tnp
sim_adpcr(m = 235, n = 765, times = 100, pos_sums = TRUE, n_panels = 3)
```

# dpcR workflow

<embed src="workflow.svg" type="image/svg+xml" />

## Read data

The `read_dpcr` function is responsible for importing data from external file sources into R working space. The additional parameters allow specification of the format type and other details.

Additionally, we advise to use other packages belonging to the pcRuniversum such as [RDML](https://cran.r-project.org/package=RDML) or dedicated packages such as ReadqPCR ([available from bioconductor.org](https://www.bioconductor.org/packages/release/bioc/html/ReadqPCR.html)). Further information can be found in @pabinger_survey_2014.

```{r eval=TRUE}
# Generate some data from 15x16 array. Let's presume, that we have results from two plates
sample_runs <- matrix(rpois(480, lambda = 1.5), ncol = 2)
# Check its class - it's a typical R structure
class(sample_runs)
# Save it to adpcr object
adpcr_experiments <- create_dpcr(sample_runs, n = c(240L, 240L), type = "nm", adpcr = TRUE)
class(adpcr_experiments)
```


## Summary method

The `Summary` method produces a tabular summary of any `dpcr` object. It calculates &#955; and its confidence intervals using both Dube's and Bhat's method. The estimated number of template molecules (m) is computed using the following relationship:

$$
m = n \lambda
$$

```{r eval=TRUE}
summary(six_panels)

# Save summary data without printing it
summ <- summary(six_panels, print = FALSE)
# Print only the summary table
summ[["summary"]]

# Extract results for Dube's method
summ[["summary"]][summ[["summary"]][["method"]] == "dube", ]
```

## Show method

Since the `dpcr` objects tends to have many data points, which would quickly clutter the console, the `show` function prints only first 5 rows of *.Data*, number of omitted rows and the type of the object.

```{r eval=TRUE}

sample_dpcr <- sim_dpcr(m = 3, n = 10, times = 5)
# Standard show method...
show(sample_dpcr)
# ... which is an equivalent of:
sample_dpcr
# If you want to see all data points:
slot(sample_dpcr, ".Data")
```

# `adpcr` class

If the data output of an dPCR system has exact locations in two-dimensional space, it belongs to the `adpcr` class. It is the case for all dPCR experiments conducted on panels, arrays and so on. The `adpcr` object inherits from `dpcr` objects, but has special slots specifying the dimensions and their names.

The planar representation of `adpcr` objects is created by the `adpcr2panel` function.

```{r eval=TRUE}
adpcr2panel(six_panels)[["Experiment3.1"]][1L:6, 1L:6]
```

## Array visualisation

Data from dPCR arrays can be easily visualized using the `plot_panel` function.

```{r eval=TRUE}
# Remember, you can plot only single panel at once 
plot_panel(extract_run(adpcr_experiments, 1), main = "Experiment 1")
```

The same data can be visualized easily in binarized form (positive/negative partitions).
```{r eval=TRUE}
plot_panel(binarize(extract_run(adpcr_experiments, 1)), main = "Experiment 1")
```

The `plot_panel` function returns invisibly coordinates, that are compatible with **graphics** and **ggplot2** packages.
```{r eval=TRUE}
# Extract graphical coordinates
panel_data <- plot_panel(extract_run(adpcr_experiments, 1), plot = FALSE)
ggplot_coords <- cbind(panel_data[["ggplot_coords"]], value = as.vector(extract_run(adpcr_experiments, 1)))

# Plot panel using different graphics package
library(ggplot2)
ggplot(ggplot_coords[, -5], aes(x = x, y = y, fill = value)) +
  geom_tile()

```


## Array testing

The `test_panel` function is useful for testing the randomness of the spatial distribution of template molecules over the array. This function was implemented to test if technical flaws may corrupt the result of an dPCR experiment. This may occur during incorrect filling of the chamber, defects of the chamber or contaminations.

```{r eval=TRUE}
# The test_panel function performs a test for each experiment in apdr object.
test_panel(six_panels)
```

Further tests are available in the **spatstat** package, which is also utilizing S4 object system. The data exchange between the **dpcR** and the **spatstat** packages is streamlined by the `adpcr2ppp` function, which converts `adpcr` object to [ppp](https://www.rdocumentation.org/packages/spatstat/versions/1.16-1/topics/ppp.object?) class taking into account spatial coordinates of positive partitions.

# `qdpcr` class

The unique feature of **dpcR** package is conversion of qPCR data to dPCR-like experiments. The qPCR data should be in a format as used by the **qpcR** package, where columns represents particular experiments and one column contains cycle number. For pre-processing of raw amplification curve data we recommend the **chipPCR** package (@roediger2015chippcr).

```{r eval=TRUE}
# Load chiPCR package to access C317.amp data
library(chipPCR)

# Convert data to qdpcr object
qdat <- qpcr2pp(data = C317.amp, type = "np", Cq_range = c(10, 30))
```

`qdpcr` inherits from `dpcr` objects and may be analyzed using above mentioned methods. Moreover, the converted data may visualized using the `plot` method.

```{r eval=TRUE}
plot(qdat)
```

# Data import

The import functions accessible under `read_dpcr` cover common dPCR systems. To extend the scope of our software, we introduced a universal dPCR data exchange format REDF (raw exchange digital PCR format). 

## REDF

REDF (Raw Exchange Digital PCR Format) has a tabular structure. Each dPCR run (represented by 
a row) is described using following properties:

1. **experiment**: name of the experiment to which run belongs.
2. **replicate**: name or ID of the replicate of the 
experiment.
3. **assay**: name of the assay to which experiment belongs. Often 
name of the channel used to detect positive partitions.
4. **k**: number of positive partitions (integer).
5. **n**: total number of partitions (integer).
6. **v**: volume of the partition (nL).
7. **uv**: uncertainty of partition's volume (nL).
8. **threshold**: partitions with **k** equal or higher than threshold are treated as positve.
9. **panel_id**: id or name of the panel. This column should be included only in case of array-based digital PCR experiments.

# Comparison of the experiments

## GLM

Generalized Linear Models (GLM) are linear models for data, where the response variables may have non-normal distributions (as for example binomial distributed positive partitions in dPCR experiments). Using GLM we can describe relationships between results of dPCR:
  
  $$\log{Y} = \beta^T X$$
  
  where $Y$ are counts, $X$ are experiments names (categorical data) and $\beta$ are linear model coefficients for every experiment. Moreover, $\exp{\beta} = \lambda$.

Estimated means copies per partitions obtained from the model are compared each other using multiple t-test.

```{r eval=TRUE}
# Compare experiments using GLM

# 1. Perform test
comp <- test_counts(six_panels)

# 2. See summary of the test
summary(comp)

# 3. Plot results of the test 
plot(comp, aggregate = FALSE)

# 4. Aggregate runs to their groups
plot(comp, aggregate = TRUE)

# 5. Extract coefficients for the further usage
coef(comp)
```

The Poisson regression on binary data (positive/negative partition) can be used only when the concentration of template molecules in samples is small (positive partitions contain rarely more than 1 template particle). Higher concentrations requires binomial regression.

## Multiple tests

The dPCR experiments may also be compared pairwise using the uniformly most powerful (UMP) ratio test [@fay_2010]. Furthermore, computed p-values are adjusted using Benjamini & Hochberg correction [@benjamini_1995] to control family-wise error rate.

The UMP ratio test has following null-hypothesis:
	
$$
H_0: \frac{\lambda_1}{\lambda_2} = 1
$$

The generally advised Wilson's confidence intervals [@brown_2001] are computed independently for every dPCR experiment. The confidence intervals are adjusted using Dunn -- &#352;id&#225;k correction to ensure that they simultaneously contain true value of $lambda$:

$$
\alpha_{\text{adj}} = 1 - (1 - \alpha)^\frac{1}{T}
$$

For example, the 0.95 significance levels means, that probability of the all real values being in the range of its respective confidence intervals is 0.95.

```{r eval=TRUE}
#1. Perform multiple test comparison using data from the previous example
comp_ratio <- test_counts(six_panels, model = "ratio")

#2. See summary of the test
summary(comp_ratio)

#3. Plot results of the test 
plot(comp_ratio, aggregate = FALSE)

#4. Aggregate runs to their groups
plot(comp_ratio, aggregate = TRUE)

#5. Extract coefficients for the further usage
coef(comp)

# Compare results of two methods
par(mfrow=c(2,1))
plot(comp, aggregate = FALSE)
title("GLM")
plot(comp_ratio, aggregate = FALSE)
title("Ratio")
par(mfrow=c(1,1))


```


## Comparison of frameworks
Two approaches presented above were compared in a simulation approach over 150.000 simulated array dPCR experiments. Each simulation contained six reactions. Three of them had roughly the same amount of molecules per plate and other three had experiments with 10 to 50 molecules more.
Experiments were compared using GLM and MT frameworks.

```{r eval=TRUE,echo=FALSE}
ggplot(data=madpcr_comp, aes(x = value, fill = method)) +
  geom_density(alpha = 0.3) + 
  scale_fill_discrete("Confidence intervals:") + 
  scale_y_continuous("Density") + 
  scale_x_continuous("Fraction of wrongly assigned experiments") + 
  cool_theme
```

On average, 2.03 and 1.98 reactions were assessed to a wrong group by respectively GLM and MT.

A single GLM comparison took roughly 183 times longer than MT (on average 1.10 seconds versus 0.006 seconds on the Intel i7-2600 processor). The difference grows with the number of experiments and number of partitions (data not shown).


## Probability coverage of confidence intervals

  Average coverage probability is the proportion of the time that the interval contains the true value of $\lambda$.
  
  In the example below, we simulated 1\e{6} droplet dPCR experiments (2\e{4} droplets each) for each level of $\lambda$ (1.2\e{7} experiments total). We computed the average probability coverage of CI obtained by three methods: Dube's[@dube_mathematical_2008], Bhat's[@bhat_single_2009] and by MT ($\alpha = 0.95$).
  
  To assess simultaneous coverage probability, we randomly divided experiments into 2000 groups (500 experiments each) for each possible value of $\lambda$. We counted frequency of groups in which all confidence intervals contain the true value of $\lambda$.
  
  
```{r eval=TRUE,echo=FALSE}
ggplot(m_coverage2, aes(x = prop, y = value, fill = method)) +
  geom_bar(stat="identity", position = "dodge") +
  scale_y_continuous("Probability coverage") + 
  scale_x_discrete(expression(lambda)) +
  scale_fill_discrete("Confidence intervals:") + 
  geom_hline(yintercept = 0.95, colour = "black", size = 1, linetype = 5) +
  facet_wrap(~ coverage, nrow = 2) + 
  cool_theme
```

The dashed black line marks 0.95 border.

```{r eval=TRUE,echo=FALSE,results="asis"}
dat <- as.data.frame(aggregate(value ~ method + coverage, m_coverage2, mean))

colnames(dat) <- c("Method name", "Type of coverage", "Value")
print(xtable(dat), 
      include.rownames = FALSE, type = "html")
```
  
  
# References