---
title: "Process inputs and generate harmonized outputs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Process inputs and generate harmonized outputs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In the vignette Simple example of data processing with Rmonize, we ran through 
an example with no processing errors. In reality, you might encounter processing 
errors and need to troubleshoot the process and/or inputs. This vignette focuses 
on running the main processing function which uses the prepared input elements to
produce harmonized datasets and identifying and correcting errors in the 
Data Processing Elements (DPE).

## Load packages

```{r eval=FALSE}
# Load relevant packages
library(Rmonize)
library(tidyverse) # Collection of R packages for data science
```

> **TIP:** You must install and load any packages whose functions are used in individual algorithms in the Data Processing Elements (see below), otherwise an error will be generated when processing data with `harmo_process()`.

## Read input elements

The input elements needed for processing are the input datasets, DataSchema, and 
DPEs, which are available from the examples included in the package. This example
uses cleaned input datasets and a DataSchema that has no errors. We will start 
with a version of the DPE that has the correct structure and required columns (see 
[online documentation](https://maelstrom-research.github.io/Rmonize-documentation/glossary/index.html)), 
but has some issues that will create errors in processing.

```{r eval=FALSE}
# Get the input datasets
dataset_study1 <- Rmonize_examples$input_dataset_study1
dataset_study2 <- Rmonize_examples$input_dataset_study2
dataset_study3 <- Rmonize_examples$input_dataset_study3
dataset_study4 <- Rmonize_examples$input_dataset_study4
dataset_study5 <- Rmonize_examples$input_dataset_study5

# Get the DataSchema
dataschema <- Rmonize_examples$DataSchema

# Get the Data Processing Elements
dpe_with_errors <- Rmonize_examples$`Data_Processing_Element_with errors`
# This version contains some examples of potential processing errors.
```

In the examples, the DPE contains processing instructions for all five datasets, 
but you can prepare separate DPE documents for each input dataset as well.

## Process data

If more than one input dataset is being processed at the same time, they are 
provided as a named list of data frames (which we refer to as a 'dossier') with 
names matching the input_dataset values specified in the DPE. The 
dataschema_variable values in the DPE must also match the DataSchema provided. 
The output of `harmo_process()` is a 'harmonized dossier', a list of harmonized 
dataset(s) with the same name(s) as the input dataset(s), with specific 
attributes/metadata.

If there are basic issues with any of the input elements (e.g., elements have 
incorrect structure or are missing required columns), the process will stop and 
print a message about the issue. If there are errors in running individual 
algorithms, as in this example, the overall process will run, but a message 
will be printed about 'error' statuses being present.

```{r eval=FALSE}
# Create an input dossier
input_dossier <- dossier_create(list(
  dataset_study1,
  dataset_study2,
  dataset_study3,
  dataset_study4,
  dataset_study5))

# Run processing function
harmonized_dossier_with_errors <- harmo_process(
  object = input_dossier,
  dataschema = dataschema,
  data_proc_elem = dpe_with_errors,
  harmonized_col_dataset = 'adm_study_id') # Identifies the harmonized variable 
                                           # to use as dataset identifiers
```

```{r,fig.cap="Subset of processing information printed in the console, including messages about errors in running individual algorithms.", out.width="80%", fig.align="center",echo=FALSE}
knitr::include_graphics("images/vig4_fig01.png")
```


> **TIP:** The function `harmo_process()` can be run on one study at a time 
(datasets and DPEs) or subsets of the variables (DataSchema and DPEs) to help 
isolate issues.

> **TIP:** The dataset names in the list in the input_dossier must match the 
input_dataset in the DPE.

## Check for and correct errors in the DPE

Typically, any processing errors at this stage come from errors in the DPEs 
(e.g., incorrectly written algorithms or misspecified input variables). If there 
are any processing errors, the associated harmonized dataset created is empty 
(no harmonized variable values) but has attributes from the input elements, which 
can be used to identify errors. Individual errors are printed in the console 
during processing and can also be extracted with `show_harmo_error()`.

```{r eval=FALSE}
# To identify processing errors to correct in the DPE
show_harmo_error(
  harmonized_dossier_with_errors,
  show_warnings = TRUE) # Can be informative, but can also be turned off, e.g., 
                        # if there are known warnings produced by processing algorithms
```

This prints the specific DataSchema variables and input datasets affected by the 
errors and information about the specific error.

```{r,fig.cap="Subset of output from show_harmo_error() printed in the console.", out.width="80%", fig.align="center",echo=FALSE}
knitr::include_graphics("images/vig4_fig02.png")
```

In this example, there are errors in generating the DataSchema variables 
preg_gestational_age_del and pm_birthweight in dataset_study4. These variables 
have the specific errors 'could not find function "flor"' and 'object 'bbb_weight' 
not found', respectively. In this case, inspecting the error messages and associated 
rows in the DPE indicate that the errors come from typos. The function 'flor' used 
in the algorithm for preg_gestational_age_del in dataset_study4 should instead be 
'floor', and the input_variable 'bbb_weight' used for pm_birthweight in dataset_study4 
should instead be 'bb_weight'.

You can use these error messages to make necessary corrections in the DPE Excel 
file and import the updated version of the DPE without errors (this approach is 
usually simpler and clearer for documenting versions of the DPE than making 
changes to the DPE in R).

```{r,fig.cap="Example of locating the errors in the DPE document.", out.width="80%", fig.align="center",echo=FALSE}
knitr::include_graphics("images/vig4_fig03.png")
```

The processing is then rerun with the updated DPE and checked for processing 
errors as many times as needed.

```{r eval=FALSE}
# Get corrected DPEs with changes made based on error messages
dpe_no_errors <- 
  Rmonize_examples$`Data_Processing_Element_no errors` %>%
  as_data_proc_elem()

# Run processing function
harmonized_dossier <- harmo_process(
  object = input_dossier,
  dataschema = dataschema,
  data_proc_elem = dpe_no_errors,
  harmonized_col_dataset = 'adm_study_id' # Identifies the harmonized variable 
                                          # to use as dataset identifiers
)

# Confirm there are no errors
show_harmo_error(
  harmonized_dossier,
  show_warnings = TRUE
)
```

Typos are common sources of errors in the DPE, as are mis-specifications in 
algorithms. Filling out the DPE can take some time and attention to detail, but 
has benefits in comprehensively documenting the processing done, providing a 
clear way to communicate with others about the processing, and making it easier
to identify issues and make revisions to the processing. 
[See further documentation](https://maelstrom-research.github.io/Rmonize-documentation/dpe/index.html) 
for more details about the DPE.

Once there are no processing errors, the harmonized dossier can be used to extract 
harmonization outputs (see the vignette Summarize and secure harmonized outputs).

## Save the harmonized dossier

> **TIP:** The harmonized dossier should be saved as an R file so that the structure 
and all metadata is preserved and it can be easily used in other Rmonize functions.

```{r eval=FALSE}
# Save the harmonized dossier as R file
# WARNING: This script creates a folder 'tmp'.
output_path <- paste0('tmp/',basename(tempdir()))
dir.create(output_path)
saveRDS(harmonized_dossier, paste0(output_path,"/harmonized_dossier.rds"))
```