MetabolomicsPipeline-vignette

Joel Parker

05/31/2024

Installation

You can install the MetabolomicsPipeline package directly from BioConductor using the following code. Note, this package is not yet available on BioConductor.

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("MetabolomicsPipeline")

You can additionally install the current version which is hosted on github using the following code.

if (!requireNamespace("devtools", quietly=TRUE))
    install.packages("devtools")
    
 devtools::install_github("datalifecycle-ua/MetabolomicsPipeline", build_vignettes = TRUE)

Once the MetabolomicsPipeline package is installed, we can load it into the environment. Note we are also loading the “table1” package. This package is not required by the MetabolomicPipeline, however, the table1 package contains great functions for creating tables. Specifically, it is useful for showing the number of samples in each of our experimental groups. We recommend installing this package using install.packages(“table1”). The “SummarizedExperiment” and “ggplot2” packages are required to install the MetabolomicsPipeline package and will being automatically installed (if not already installed) when installing the MetabolomicsPipline package using the commands above.

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 10,
    fig.height = 10,
    warning = FALSE
)

# Tables
library(table1)

# Load Metabolomics Pipeline
library(MetabolomicsPipeline)

library(SummarizedExperiment)

# Figures
library(ggplot2)

Introduction

The purpose of the MetabolomicPipeline package is to streamline the analysis for metabolomics experiments. In this vignette we demonstrate how to use MetabolomicsPipeline package for:

Data Description

In this vignette, we will use data which consists of 86 samples (42 males, 44 females), three treatment groups, and the samples were taken at three different time points.

Metabolomics data as a SummarizedExperiment

Metabolomics experiments leverage multiple data tables for analysis. The datasets needed for the downstream analysis are:

1.) Sample metadata

2.) Chemical annotation

3.) Peak data (samples x rows).

Data loading

The MetabolomicsPipeline package provides a convenient way to load each of these datasets together as a SummarizedExperiment using create_met_se(). In this chunk we load the demo sample metadata, chemical annotation, and peak data into a SummarizedExperiment.

  1. load the demo sample metadata, chemical annotation, and peak data into a SummarizedExperiment.

  2. Create a table of the sample distribution

################################################################################
### Load Data ##################################################################
################################################################################

# if sample metadata, chemical annotation and peak data are stored in .xlsx
# you can use.

# dat <- load_met_excel(
  # path,
  # raw_sheet = "Peak Area Data",
  # chemical_sheet = "Chemical Annotation",
  # sample_meta = "Sample Meta Data",
  # normalized_peak = "Log Transformed Data",
  # sample_names = "PARENT_SAMPLE_NAME",
  # chemicalID = "CHEM_ID"
# )

# Get demo sample metadata
data("demoSampleMeta", package = "MetabolomicsPipeline")

# Get demo chemical annotation file
data("demoChemAnno", package = "MetabolomicsPipeline")

# Get demo peak data
data("demoPeak", package = "MetabolomicsPipeline")

dat <- create_met_se(chemical_annotation = demoChemAnno,
                     sample_metadata = demoSampleMeta,
                     peak_data = demoPeak,
                     chemical_id = "CHEM_ID",
                     sample_names = "PARENT_SAMPLE_NAME")


################################################################################
### Create Table 1 #############################################################
################################################################################
# Create table 1
tbl1 <- table1(~ GROUP_NAME + TIME1 | Gender,
  data = colData(dat)
)

# Display table 1
tbl1
Female
(N=44)
Male
(N=42)
Overall
(N=86)
GROUP_NAME
Control 14 (31.8%) 14 (33.3%) 28 (32.6%)
treat1 15 (34.1%) 14 (33.3%) 29 (33.7%)
treat2 15 (34.1%) 14 (33.3%) 29 (33.7%)
TIME1
End 14 (31.8%) 15 (35.7%) 29 (33.7%)
Onset 15 (34.1%) 14 (33.3%) 29 (33.7%)
PreSymp 15 (34.1%) 13 (31.0%) 28 (32.6%)

Data Processing

The MetabolomicsPipline package contains tools for processing the peak data to prepare it for downstream analysis:

1.) Median standardization (median_standardization())

2.) Minimum value imputation (min_val_impute())

3.) Log transformation (log_transformation())

# Median standardization
dat <- median_standardization(dat, assay = "peak")

# Min value imputation
dat <- min_val_impute(dat, assay = "median_std") 

# log transformation
dat <- log_transformation(dat, assay = "min_impute")

Exploratory Analysis

In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods of data exploration:

A.) Principal component analysis

B.) Heatmaps

Principal Component Analysis (PCA)

In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component (PC) accounts for the largest amount of variance within the data. The second PC accounts for the largest remaining variance, and so on. Additionally, each of the PCs produced by PCA is uncorrelated with the other principal components. PCA can allow us to visualize sources of variation in the data. The metabolite_pca function will enable us to specify a sample metadata variable to label the points in the plot. The metabolite_pca function has three arguments:

###############################################################################
### Run PCA ###################################################################
###############################################################################

# Define PCA label from metadata
meta_var <- "Gender"

# Run PCA
pca <- metabolite_pca(dat,
    meta_var = meta_var
)


# Show PCA
pca

plot of chunk ExploratoryAnalysis_PCA

Suppose you notice a variable with clearly separated groups that is not a variable of interest. In that case, consider stratifying your downstream analysis by the values of that variable. For example, we will stratify the downstream analysis by male/female in our vignette data.

Heatmaps

For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log normalized peak values for each metabolite in each observation. We can group the observations by the experimental conditions. Grouping the experimental conditions in a heatmap is another way of visualizing sources of variation within our data.

We can use the metabolite_heatmap function to create the heatmaps, which requires the following arguments.

In the chunk below, we create a PCA plot labeled by Gender. Then, we make three heatmaps increasing by complexity.

################################################################################
### Run Heatmaps ###############################################################
################################################################################

# Heatmap with one group
metabolite_heatmap(dat,
    top_mets = 50,
    group_vars = "GROUP_NAME",
    strat_var = NULL,
    caption = "Heatmap Arranged By Group",
    Assay = "normalized",
    GROUP_NAME
)

plot of chunk ExploratoryAnalysis_heatmap





# Heatmap with two groups
metabolite_heatmap(dat,
    top_mets = 50,
    group_vars = c("GROUP_NAME", "TIME1"),
    strat_var = NULL,
    caption = "Heatmap Arranged By Group and TIME",
    Assay = "normalized",
    GROUP_NAME, desc(TIME1)
)

plot of chunk ExploratoryAnalysis_heatmap



# Heatmap with 2 group and stratified
metabolite_heatmap(dat,
    top_mets = 50,
    group_vars = c("GROUP_NAME", "TIME1"),
    strat_var = "Gender",
    caption = "Heatmap Arranged By Group and TIME",
    Assay = "normalized",
    GROUP_NAME, desc(TIME1)
)
#> [[1]]

plot of chunk ExploratoryAnalysis_heatmap

#> 
#> [[2]]

plot of chunk ExploratoryAnalysis_heatmap

Subpathway Analysis

In the chemical annotation file, we will see that each metabolite is within a subpathway, and each subpathway is within a superpathway. There are several metabolites within each subpathway and several subpathways within each superpathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level. However, since multiple metabolites are within a subpathway, it is challenging to test if the treatment affected the peak data at the subpathway level. For this, we utilize a combined Fisher probability test. The combined Fisher test combines the p-values from independent tests to test the hypothesis for an overall effect. The Combined Fisher Probability is helpful for testing a model at the subpathway level based on the pvalues from the model at the metabolite level.

Combined Fished Analysis

We will test at the subpathway level by combining the p-values for each metabolite within the subpathway for each model. We use a combination function given by \(\tilde{X}\) which combines the pvalues, resulting in a chi-squared test statistic.

$$ \tilde{X} = -2\sum_{i=1}^k ln(p_i) $$ where \(k\) is the number of metabolites in the subpathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\). You will notice that smaller p-values will lead to a larger \(\tilde{X}\).

Assumptions

Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,

In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumption between the metabolites within the subpathway.

For more about the Combined Fisher Probability and other methods that can address this problem, see:

Loughin, Thomas M. “A systematic comparison of methods for combining p-values from independent tests.” Computational statistics & data analysis 47.3 (2004): 467-485.

Models

To test our hypothesis at the subpathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.

1.) Interaction: \(log Peak = Treatment Group + Time + Treatment*Time\)

2.) Parallel: \(log Peak = Treatment Group + Time\)

3.) Single: \(log Peak = Treatment\)

For the interaction model, we are focusing only on the interaction term “Treatment*Time” to test if there is a significant interaction between our treatment and the time variable. The parallel model is testing if the time variable is explaining a significant amount of the metabolite variance with treatment included, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.

We test at the subpathway level using the Combined Fisher Probability method to combine the p-values from each model for all metabolites within the subpathway. To run the subpathway analysis, we use the “subpathway_analysis” function, which requires the following arguments.

Results Summaries

With the MetabolomicsPipeline package, we provide three different ways to summarize the results from the subpathway analysis.

  1. Number of significant subpathways by model type (subpath_by_model)

  2. Percentage of significant subpathways within superpathways (subpath_within_superpath)

  3. Metabolite model results within a specified subpathway (met_within_sub)

################################################################################
## Stratified Analysis #########################################################
################################################################################

# Stratified Analysis
stratified <- subpathway_analysis(dat,
    treat_var = "GROUP_NAME",
    block_var = "TIME1",
    strat_var = "Gender",
    Assay = "normalized"
)


################################################################################
### Results Plots ##############################################################
################################################################################

# 1. significant subpathways by model type
subpath_by_model(stratified)
Sigificant Pathways by Model
Model Type Female Male
Interaction 6 84
Parallel 21 12
Single 19 2
None 64 12

# 2. Percentage of signficant subpathways within superpathways
subpath_within_superpath(stratified)
Proportion of significant subpathways within superpathways
Super Pathway Percent Significant (Female) |Percent Significant (Male
Xenobiotics 5 / 5 (100%) 5 / 5 (100%)
Amino Acid 13 / 16 (81.25%) 16 / 16 (100%)
Peptide 3 / 5 (60%) 3 / 5 (60%)
Nucleotide 3 / 8 (37.5%) 8 / 8 (100%)
Lipid 17 / 53 (32.08%) 46 / 53 (86.79%)
Carbohydrate 2 / 8 (25%) 7 / 8 (87.5%)
Cofactors and Vitamins 2 / 11 (18.18%) 9 / 11 (81.82%)
Energy 0 / 2 (0%) 2 / 2 (100%)
Partially Characterized Molecules 0 / 1 (0%) 1 / 1 (100%)

# 3. Metabolites within subpathway
tables <- met_within_sub(stratified,
    subpathway = "Partially Characterized Molecules"
)

### Females
tables[[1]]
Metabolites within Partially Characterized Molecules (Female)
Metabolite Name Interaction_pval P-Value Parallel_pval P-Value Single_pval P-Value
glutamine_degradant* 0.488 0.630 0.285
glucuronide of C14H22O4 (1)* 0.431 0.970 0.332
glucuronide of C10H18O2 (11)* 0.927 0.288 0.671
glucuronide of C10H18O2 (12)* 0.996 0.436 0.390
glycine conjugate of C10H14O2 (1)* 0.999 0.061 0.781
pentose acid* 0.154 0.772 0.104
branched-chain, straight-chain, or cyclopropyl 10:1 fatty acid (1)* 0.891 0.405 0.909
branched-chain, straight-chain, or cyclopropyl 10:1 fatty acid (2)* 0.962 0.708 0.999
glycine conjugate of C6H10O2 (2)* 0.659 0.979 0.090
glycine conjugate of C6H10O2 (3)* 0.981 0.509 0.706
branched-chain, straight-chain, or cyclopropyl 12:1 fatty acid* 0.899 0.274 0.717
bilirubin degradation product, C17H18N2O4 (1)** 0.189 0.238 0.193
bilirubin degradation product, C17H18N2O4 (2)** 0.410 0.143 0.243
bilirubin degradation product, C17H18N2O4 (3)** 0.591 0.140 0.261
bilirubin degradation product, C17H20N2O5 (1)** 0.821 0.098 0.771
bilirubin degradation product, C17H20N2O5 (2)** 0.881 0.155 0.675

### Males
tables[[2]]
Metabolites within Partially Characterized Molecules (Male)
Metabolite Name Interaction_pval P-Value Parallel_pval P-Value Single_pval P-Value
glutamine_degradant* 0.100 0.011 0.844
glucuronide of C14H22O4 (1)* 0.505 0.388 0.377
glucuronide of C10H18O2 (11)* 0.413 0.546 0.533
glucuronide of C10H18O2 (12)* 0.013 0.739 0.837
glycine conjugate of C10H14O2 (1)* 0.028 0.691 0.403
pentose acid* 0.058 0.039 0.482
branched-chain, straight-chain, or cyclopropyl 10:1 fatty acid (1)* 0.258 0.805 0.351
branched-chain, straight-chain, or cyclopropyl 10:1 fatty acid (2)* 0.045 0.516 0.161
glycine conjugate of C6H10O2 (2)* 0.322 0.360 0.917
glycine conjugate of C6H10O2 (3)* 0.160 0.469 0.748
branched-chain, straight-chain, or cyclopropyl 12:1 fatty acid* 0.015 0.502 0.001
bilirubin degradation product, C17H18N2O4 (1)** 0.225 0.003 0.778
bilirubin degradation product, C17H18N2O4 (2)** 0.070 0.037 0.846
bilirubin degradation product, C17H18N2O4 (3)** 0.031 0.135 0.647
bilirubin degradation product, C17H20N2O5 (1)** 0.294 0.278 0.510
bilirubin degradation product, C17H20N2O5 (2)** 0.229 0.626 0.333

Pairwise Analysis

We can look at the pairwise comparisons for all experimental groups at the metabolite level. We will use the metabolite_pairwise function within the MetabolomicsPipeline package, which requires the following arguments:

Log Fold-Change Heatmap

We will produce a heatmap of the log fold changes for the metabolites with a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis). The heatmap colors will only show if the log fold-change is greater than log(2) or less than log(.5). Therefore, this heatmap will only focus on comparisons with a fold change of two or greater. The met_est_heatmap function will produce an interactive heatmap using the results from the pairwise analysis.

P-Value Heatmap

Similar to the pairwise estimate heatmap, we will produce a heatmap where the heatmap will only include metabolites with a significant overall p-value, and the values in the heat map will only be colored if the pairwise comparison is significant. We use the met_p_heatmap function to create an interactive p-value heatmap.

For both the log fold-change heatmap and the p-value heatmap, there is an option to produce an interactive plot using plotly or to create a static heatmap using pheatmap. To produce these heatmaps we use the following arguments:

################################################################################
#### Run Pairwise Comparisons ##################################################
################################################################################

strat_pairwise <- metabolite_pairwise(dat,
    form = "GROUP_NAME*TIME1",
    strat_var = "Gender"
)


###############################################################################
## Create Estimate Heatmap #####################################################
################################################################################

met_est_heatmap(strat_pairwise$Female, dat,
    interactive = FALSE,
     SUB_PATHWAY = "SUB_PATHWAY",
    CHEMICAL_NAME = "CHEMICAL_NAME",
    plotlyTitle = "Estimate Heatmap",
    main = "Log fold change heatmap", show_rownames = FALSE
)

plot of chunk PairwiseAnalysis



################################################################################
## Create P-value Heatmap ######################################################
################################################################################
# Female
met_p_heatmap(strat_pairwise$Female, dat,
    interactive = FALSE, show_rownames = FALSE,
    plotlyTitle = "Pvalue Heatmap",
    main = "Pvalue Heatmap"
)

plot of chunk PairwiseAnalysis

Boxplots and Lineplots

Visualizations of the data can help us see the underlying trends. Two useful visualizations are boxplots and line plots, we will be using the subpathway_boxplots and subpathway_lineplots functions to create them. The main utility of these functions is it allows you for focus on the metabolites within a subpathway. For both functions, the arguments are:

Boxplots and Lineplots steps

################################################################################
### BoxPlots ###################################################################
################################################################################

subpathway_boxplots(dat,
    subpathway = "Lactoyl Amino Acid", block_var = TIME1,
    treat_var = GROUP_NAME, Assay = "normalized",
    CHEMICAL_NAME = "CHEMICAL_NAME",
   SUB_PATHWAY = "SUB_PATHWAY", Gender == "Female"
)

plot of chunk BoxPlotsAndLinePlots



################################################################################
## Line plots ##################################################################
################################################################################

# Set up data
dat$TIME1 <- as.numeric(factor(dat$TIME1,
    levels = c("PreSymp", "Onset", "End")
))

# Create line plots
subpathway_lineplots(dat,
    subpathway = "Lactoyl Amino Acid",
    block_var = TIME1, treat_var = GROUP_NAME,
    Assay = "normalized",
    CHEMICAL_NAME = "CHEMICAL_NAME",
   SUB_PATHWAY = "SUB_PATHWAY", Gender == "Female"
) +
    xlab("Time")
#> `geom_smooth()` using formula = 'y ~ x'

plot of chunk BoxPlotsAndLinePlots

Session Info

sessionInfo()
#> R version 4.5.0 (2025-04-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ggplot2_3.5.2               SummarizedExperiment_1.39.0
#>  [3] Biobase_2.69.0              GenomicRanges_1.61.0       
#>  [5] GenomeInfoDb_1.45.4         IRanges_2.43.0             
#>  [7] S4Vectors_0.47.0            BiocGenerics_0.55.0        
#>  [9] generics_0.1.4              MatrixGenerics_1.21.0      
#> [11] matrixStats_1.5.0           MetabolomicsPipeline_0.99.1
#> [13] table1_1.4.3               
#> 
#> loaded via a namespace (and not attached):
#>  [1] sandwich_3.1-1       readxl_1.4.5         rlang_1.1.6         
#>  [4] magrittr_2.0.3       multcomp_1.4-28      compiler_4.5.0      
#>  [7] mgcv_1.9-3           systemfonts_1.2.3    vctrs_0.6.5         
#> [10] reshape2_1.4.4       stringr_1.5.1        pkgconfig_2.0.3     
#> [13] crayon_1.5.3         fastmap_1.2.0        backports_1.5.0     
#> [16] XVector_0.49.0       labeling_0.4.3       rmarkdown_2.29      
#> [19] UCSC.utils_1.5.0     purrr_1.0.4          xfun_0.52           
#> [22] jsonlite_2.0.0       flashClust_1.01-2    DelayedArray_0.35.1 
#> [25] broom_1.0.8          cluster_2.1.8.1      R6_2.6.1            
#> [28] stringi_1.8.7        RColorBrewer_1.1-3   car_3.1-3           
#> [31] cellranger_1.1.0     estimability_1.5.1   Rcpp_1.0.14         
#> [34] knitr_1.50           zoo_1.8-14           Matrix_1.7-3        
#> [37] splines_4.5.0        tidyselect_1.2.1     rstudioapi_0.17.1   
#> [40] dichromat_2.0-0.1    abind_1.4-8          codetools_0.2-20    
#> [43] lattice_0.22-7       tibble_3.2.1         plyr_1.8.9          
#> [46] withr_3.0.2          coda_0.19-4.1        evaluate_1.0.3      
#> [49] gridGraphics_0.5-1   survival_3.8-3       xml2_1.3.8          
#> [52] pillar_1.10.2        ggpubr_0.6.0         carData_3.0-5       
#> [55] DT_0.33              plotly_4.10.4        scales_1.4.0        
#> [58] xtable_1.8-4         leaps_3.2            glue_1.8.0          
#> [61] pheatmap_1.0.12      emmeans_1.11.1       scatterplot3d_0.3-44
#> [64] lazyeval_0.2.2       tools_4.5.0          data.table_1.17.4   
#> [67] ggsignif_0.6.4       fs_1.6.6             mvtnorm_1.3-3       
#> [70] grid_4.5.0           tidyr_1.3.1          nlme_3.1-168        
#> [73] Formula_1.2-5        cli_3.6.5            kableExtra_1.4.0    
#> [76] textshaping_1.0.1    S4Arrays_1.9.1       viridisLite_0.4.2   
#> [79] svglite_2.2.1        dplyr_1.1.4          gtable_0.3.6        
#> [82] rstatix_0.7.2        yulab.utils_0.2.0    digest_0.6.37       
#> [85] SparseArray_1.9.0    ggrepel_0.9.6        ggplotify_0.1.2     
#> [88] TH.data_1.1-3        FactoMineR_2.11      htmlwidgets_1.6.4   
#> [91] farver_2.1.2         htmltools_0.5.8.1    factoextra_1.0.7    
#> [94] lifecycle_1.0.4      httr_1.4.7           multcompView_0.1-10 
#> [97] MASS_7.3-65