%\VignetteIndexEntry{The TargetSearch Package} %\VignetteDepends{TargetSearchData} %\VignetteKeywords{preprocess, GC-MS, analysis} %\VignettePackage{TargetSearch} %\VignetteEngine{knitr::knitr} \documentclass{article} <>= BiocStyle::latex() @ % missing definitions \definecolor{fgcolor}{rgb}{0.251, 0.251, 0.251} \providecommand{\hlsng}[1]{\textcolor[rgb]{0.251,0.627,0.251}{#1}} \providecommand{\hldef}[1]{\textcolor[rgb]{0.251,0.251,0.251}{#1}} \usepackage{booktabs} \definecolor{NoteBlue}{RGB}{45, 28, 156} \newcommand{\Note}[2][Note]{\textsl{\textcolor{NoteBlue}{#1: #2}}} \title{The TargetSearch Package} \author{Alvaro Cuadros-Inostroza} \author{Jan Lisec} \author{Henning Redestig} \author{Matthew A Hannah} \affil{Max Planck Institute for Molecular Plant Physiology, Potsdam, Germany} \newcommand{\Rmethod}[1]{{\Rfunction{#1}}} \begin{document} \maketitle \begin{abstract} This document describes how to use \Biocpkg{TargetSearch} to preprocess GC-MS data. \end{abstract} \packageVersion{\Sexpr{BiocStyle::pkg_ver("TargetSearch")}} Report issues on \url{https://github.com/acinostroza/TargetSearch/issues} \newpage \tableofcontents \newpage \section{Supplied files} This section describes the files that have to be prepared before running \Biocpkg{TargetSearch}. These comprise a sample definition file, a reference library file and a retention marker definition file. Throughout this manual, we will use the example files provided by the package \Biocpkg{TargetSearchData}. \subsection{NetCDF files} \Biocpkg{TargetSearch} can currently read only NetCDF files. Many GC-MS software vendors are able to convert their proprietary raw chromatograms to NetCDF. There are however two issues you have to keep in mind: \begin{itemize} \item Make sure that you export the CDF files as nominal mass, not centroid nor profile mode, as \Biocpkg{TargetSearch} cannot handle this type of data. Nevertheless, if such an option is not available, there files can be converted to nominal mass automatically. \item It is recommended to baseline correct your chromatograms before exporting to NetCDF to increase processing speed. If not, \Biocpkg{TargetSearch} implements two methods for baseline correction that you can use at the cost of performance. \end{itemize} Though only NetCDF is supported, it is still possible to use other formats, such as mzXML, mzData or mzML using the package \Biocpkg{mzR}. This is not yet implemented but it will in the next release. \subsection{Sample Information} After exporting the CDF files into a convenient location, prepare a tab-delimited text file describing your samples. It must be contain (at least) the two columns entitled: ``CDF\_FILE'' and ``MEASUREMENT\_DAY'', or, column names that match ``cdf'' and ``day'', respectively. If more than one column matches the pattern, then the first (leftmost) is taken. Addictionally, if a column named ``SAMPLE\_NAME'' is found, this column will be used as sample names instead of the CDF files (useful if your CDF files are not human readable friendly). Other columns such as sample group, treatment, etc. may be additionally included to aid sample sub-setting and downstream analyses. An example is shown in table~\ref{sample:file}. \begin{table}[ht] \begin{tabular}{lll} \toprule CDF\_FILE & MEASUREMENT\_DAY & TIME\_POINT \\ \midrule 7235eg08.cdf & 7235 & 1 \\ 7235eg11.cdf & 7235 & 1 \\ 7235eg26.cdf & 7235 & 1 \\ 7235eg04.cdf & 7235 & 3 \\ 7235eg30.cdf & 7235 & 3 \\ 7235eg32.cdf & 7235 & 3 \\ ...&&\\ \bottomrule \end{tabular} %\caption{Sample file example, \file{samples.txt}} \caption{Sample file example} \label{sample:file} \end{table} To import the sample list into R, use the function \Rfunction{ImportSamples()} specifying the options \Robject{CDFpath} (directory containing the NetCDF files) and \Robject{RIpath} (directory where the transformed cdf files, containing the retention index corrected apex data, also called RI-files, will be saved). This function will warn you if CDF samples are not found in the chosen path. <>= library(TargetSearchData) library(TargetSearch) cdf.path <- tsd_data_path() sample.file <- tsd_file_path("samples.txt") samples <- ImportSamples(sample.file, CDFpath = cdf.path, RIpath = ".") @ \bioccomment{The functions \Rfunction{tsd\_data\_path()} and \Rfunction{tsd\_file\_path()} are convenience functions that retrieve file path to \Rpackage{TargetSearchData} data files.} Alternatively, you could create a \Robject{tsSample} object by using the sample class methods. <>= cdffiles <- dir(cdf.path, pattern="cdf$") # take the measurement day info from the cdf file names. days <- sub("^([[:digit:]]+).*$","\\1",cdffiles) # sample names smp_names <- sub("\\.cdf", "", cdffiles) # add some sample info smp_data <- data.frame(CDF_FILE =cdffiles, GROUP = gl(5,3)) # create the sample object samples <- new("tsSample", Names = smp_names, CDFfiles = cdffiles, CDFpath = cdf.path, RIpath = ".", days = days, data = smp_data) @ \subsection{Retention index markers} To align different chromatograms using RI markers, a tab-delimited definition file for these markers has to be provided. At a minimum it should contain three columns specifying the search window (lower and upper threshold) and the fixed standard value for each marker (table \ref{rim:file}). Further, a characteristic ion mass ($m/z$) has to be provided either in an additional column or as an argument \Robject{mass} to function \Rfunction{ImportFameSettings()}. <>= rim.file <- tsd_file_path("rimLimits.txt") rimLimits <- ImportFameSettings(rim.file, mass = 87) @ This will import the limits in \file{rimLimits.txt} file and set the marker mass to 87 for all markers. \begin{table}[ht] \begin{tabular}{rrr} \toprule LowerLimit & UpperLimit & RIstandard \\ \midrule 230 & 280 & 262320\\ 290 & 340 & 323120\\ 350 & 400 & 381020\\ \bottomrule \end{tabular} % \caption{Retention time definition file example, \file{rimLimits.txt}} \caption{Retention time definition file example} \label{rim:file} \end{table} If you do not use RI markers, you can skip this part by setting the parameter \Robject{rimLimits} to \Robject{NULL} in \Rfunction{RIcorrect()} function. Please note that in this case, no retention time correction will be performed. To make sure that you have correctly set the time ranges for the RI markers, the function \Rfunction{checkRimLim()} can be used. This function plots the $m/z$ of all the markers in (a subset of) the samples. By default, the option \Rcode{single} is \Rcode{TRUE}, which means only a (randomly chosen) sample will be used for the figure (the reason is we usually do not need to see all the samples, and it is computationally expensive to display too many samples at the same time). To display multiple samples, set the said option to \Rcode{FALSE}. In this case, it is recommended to use \emph{subsetting} to select specific sample to review (in particular in large datasets). The example below demonstrates the multiple display option restricted to the first three samples. Some basic plot styling is also possible by passing arguments to \Rfunction{matplot()}, such as, \Rcode{col} and \Rcode{lty}. The generated plot is shown in figure~\ref{fig:checkRimLim}. Note that the gray area indicates the search window of each marker. <>= checkRimLim(samples[1:3], rimLimits, single=FALSE, col=2:4, lty=1) @ \section{Pre-processing} \subsection{Conversion of CDF files to TargetSearch format} Starting from \Biocpkg{TargetSearch} 1.42, a CDF4 custom format was introduced. This format has many advantages over the typical CDF3 that is exported by commercial software, mainly compression and fast data access, which can speed up the internal \Biocpkg{TargetSearch} computations (specially in plotting functions). Another advantage is that you can store baseline corrected data in this file, which is not possible to do in CDF file. And finally, the conversion is required if the CDF files are not in nominal mass format. The main disadvantage is that this format is unique to \Biocpkg{TargetSearch} and it is unlikely to be used elsewhere, and the extra storage space required for these files. To transform the files, simply run the function \Rfunction{ncdf4Convert()}. In the example below, the files are converted to CDF4 format and saved to the working directory. If the parameter \Rcode{path} is missing, then the files will be placed in the same directory as the original CDF files. <>= samples <- ncdf4Convert(samples, path=".") @ If you need to perform baseline correction on the CDF files, please check the following section instead. \Note{Although this conversion is optional, it is highly recommended. It is still possible to use the classic workflow at the cost of computation speed.} \subsection{Baseline correction} \Biocpkg{TargetSearch} expects baseline corrected NetCDF files, which is usually performed by your GC-MS vendor's software. If that is the case, you may continue reading the following subsection. However, if your chromatograms are not baseline corrected, \Biocpkg{TargetSearch} provides two baseline correction methods. The first one is implemented in \Rfunction{baselineCorrection()} and is based on the work of Chang et al. \cite{chang07}. This method is referred as ``classic'' in the man-pages as it has been present since the beginning. The second method called ``quantiles'', introduced in \Biocpkg{TargetSearch} 1.42.0, estimates the baseline by computing a given quantile in a sliding window (the quantile is $50\%$ bby default). This method is implemented in \Rfunction{baselineCorrectionQuant()}. A wrapper function called \Rfunction{baseline()} is also provided, which has the advantage that it can take a CDF file name as input (the other functions take matrices). \Note{These functions are meant to be run internally and only exposed for advanced users.} To apply baseline correction to the chromatograms, we simply call the function \Rfunction{ncdf4Convert()}, similarly as the section above, but with extra parameters that are required so the baseline correction procedure takes place. An example of such as call is shown below. <>= samples <- ncdf4Convert(samples, path=".", baseline=TRUE, bsline_method="quantiles", ...) @ Here, the parameter \Robject{baseline} is set to \Robject{TRUE} (by default is \Robject{FALSE}) and we choose the method ``quantiles''. The $\ldots$ contain the parameters passed to \Rfunction{baselineCorrectionQuant()}. Similarly, if the chosen method is ``classic'', then the extra parameters will be passed to \Rfunction{baselineCorrectionQuant()}. Please refer to the manpages for details on these methods. \subsection{Peak detection} Initially, \Biocpkg{TargetSearch} identifies the local apex intensities in all chromatograms, finds the retention time (RT) of the RI markers and converts RT to RI using linear interpolation \cite{VandenDool1963}. This is done by the function \Rfunction{RIcorrect()}. Parameters to this function are sample and retention time limits objects, the intensity threshold (\Rcode{IntThreshold = 100}), the peak picking method (\Robject{pp.method}) and a \Robject{Window} parameter that will be used by said method. The function will return a matrix with the retention times of all RI markers and create a file with all extracted peaks of the respective NetCDF file, which will we call \emph{RI file}. This file can be a tab-delimited file (RI file) (old format) or a custom binary file (default format). The difference between them is that the binary file allows much faster parsing compared to the other format. \Note{Formerly, we would use this function for baseline correction, but as of \Biocpkg{TargetSearch} 1.42.0, this has to be done as explained in the section above.} <>= RImatrix <- RIcorrect(samples, rimLimits, IntThreshold = 100, pp.method = "ppc", Window = 15) @ There are three peak picking methods available: \emph{smoothing} implements the algorithm used by Tagfinder \cite{Luedemann2008}, \emph{gaussian} uses basically the same algorithm, but the difference is the smoother is gaussian based, and the \emph{ppc} algorithm, which is an adaptation of the function \Rfunction{ppc.peaks} from the package \CRANpkg{ppc}. \bioccomment{This package seems to be deprecated. It is not in CRAN anymore}. This method simply searches for local maxima within a given time window. The \Robject{Window} parameter sets the window size of the chosen peak picking method in terms of number of scan points. \warning[Note]{The number of points actually used is $2 \times Window + 1$}. \subsection{RI correction} After running \Rfunction{RIcorrect()}, your chromatograms have been already retention time corrected. However, it is necessary to check that the RI markers detection was actually correct. To do so, \Biocpkg{TargetSearch} examines the generated RI matrix \Robject{RImatrix} with the function \Rfunction{FAMEoutliers()}. It creates a PDF report of the RI markers including information about possible outliers that the user can remove or not. Alternatively, RI markers can be checked manually using the function \Rfunction{plotFAME} (Figure~\ref{fig:plotFAME}). <>= outliers <- FAMEoutliers(samples, RImatrix, threshold = 3) @ Here, \Robject{threshold} sets the number of standard deviations a value has to be away from the mean before it will be considered an outlier. In this example, however, no outliers were detected. <>= plotFAME(samples, RImatrix, 1) @ Note that \Biocpkg{TargetSearch} assumes that the RI markers are injected together with the biological samples. A different approach is to inject them separately. If this is the case, please look at the vignette \emph{RI correct extra} and its accompanying file \file{RetentionIndexCorrection.R} for a workaround. In case of outliers, there is no single method to fix them as this really depends on the particular measurement conditions. However, there are few possibilities i) If the time deviation from the mean is small, for example, less than one second, as a rule of thumb the outliers can usually be ignored. ii) if the outliers are due to chromatogram shift, then it is also fine to ignore them, because that is the point of the markers. iii) On the contrary, if the shift is because of a problem with the marker itself (eg, overloaded peak, low intense peak) and \textbf{\emph{not}} because of a chromatogram shift, then do not ignore the outliers. In this case, you can manually set the retention time of the outlier to the mean/median, or choose another peak (which is stable across samples) as a new marker, or even just remove the RI marker from the list. \section{Library Search} \subsection{Reference Library File} The \emph{reference library} file contains the information of the metabolites or mass spectral tags (MSTs) that will be searched for in the chromatograms. A public spectra database could be found here \url{http://gmd.mpimp-golm.mpg.de/} at \emph{The Golm Metabolome Database} \cite{Kopka2005,Hummel2007,Hummel2013}. Required information is the metabolite name (``Name''), expected retention time index (``RI''), selective masses (``SEL\_MASS''), most abundant masses (``TOP\_MASS''), spectrum (``SPECTRUM'') and RI deviations (``Win\_1'', ``Win\_2'', ``Win\_3''). See example in table~\ref{library:file}. The columns ``Name'' and ``RI'' are mandatory and you have at least to include one of the columns ``SEL\_MASS'', ``TOP\_MASS'' or ``SPECTRUM'' in the file (see below). The RI deviation columns are optional. \begin{table}[ht] \scalebox{0.85}{% \begin{tabular}{lllll} \toprule Name & RI & Win\_1 & SEL\_MASS & SPECTRUM\\ \midrule Pyruvic acid & 222767 & 4000 & 89;115;158;174;189 & 85:7 86:14 87:7 88:5 8... \\ Glycine (2TMS) & 228554 & 4000 & 86;102;147;176;204 & 86:26 87:19 88:8 89:4...\\ Valine & 271500 & 2000 & 100;144;156;218;246 & 85:8 86:14 87:6 88:5 8...\\ Glycerol (3TMS) & 292183 & 2000 & 103;117;205;293 & 85:14 86:2 87:16 88:13...\\ Leucine & 306800 & 1500 & 102;158;232;260 & 158:999 159:148 160:45...\\ Isoleucine & 319900 & 1500 & 102;103;158;163;218 & 90:11 91:2 92:1 93:1 9...\\ Glycine & 325000 & 2000 & 86;100;174;248;276 & 85:6 86:245 87:24 88:12...\\ \bottomrule \end{tabular} }%end scalebox% \caption{Reference Library example} \label{library:file} \end{table} In this file, masses and intensities must be positive integers. RIs and RI deviations can be any positive real number. The selective and most abundant masses list must be delimited by semicolon (;). The spectrum is described by a list of mass and intensity pair. Every mass-intensity pair is separated by colon (:) and different pairs are separated by spaces. The function \Rfunction{ImportLibrary()} imports the reference file. <>= lib.file <- tsd_file_path("library.txt") lib <- ImportLibrary(lib.file, RI_dev = c(2000,1000,200), TopMasses = 15, ExcludeMasses = c(147, 148, 149)) @ Here we set the RI window deviations to 2000, 1000 and 200 RI units. Since ``Win\_1'' column is already in the file, the first value (2000) is ignored. Also, the 15th most abundant masses are taken but excluding the masses 147, 148 and 149 (common confounding masses) \subsection{Library Search Algorithm} The library search is performed in three steps. First, for every metabolite, selective masses are searched in a given time window around the expected RI. This is done by the function \Rfunction{medianRILib()}. This function calculates the median RI of the selective masses and return new library object with the updated RI. The time deviation is given either in the library file (column ``Win\_1'') or when the library is imported (see \Rfunction{ImportLibrary()}). <>= lib <- medianRILib(samples, lib) @ It is also possible to examine visually the RI deviation of the metabolites by setting the parameter \Robject{makeReport=TRUE}, which creates a pdf report like the one shown in figure~\ref{fig:medianLib}. This may help to set or update the expected RI deviation. <>= resPeaks <- FindPeaks(RIfiles(samples), refLib(lib, w = 1, sel = TRUE)) plotRIdev(lib, resPeaks, libID = 1:9) @ In the second step, the function \Rfunction{sampleRI()} searches the selective masses again, but using the updated RI and the RI deviation defined in the library object (``Win\_2''). After that, the intensities of the selected masses are normalised to the median of the day, and then used to extract other masses with correlated apex profiles. The masses for which the Pearson correlation coefficient is above \Robject{r\_thres} are taken as metabolite markers and their RIs are averaged on a per sample basis. This average RI represents the exact position where the metabolite elutes in the respective sample, which is returned in a matrix form. <>= cor_RI <- sampleRI(samples, lib, r_thres = 0.95, method = "dayNorm") @ The third step will look up for all the masses (selective and most abundant masses) in all the samples. This is done by the function \Rfunction{peakFind()}. It returns a \Rclass{tsMSdata} object with the intensities and RI of every mass (rows) and every sample (columns) that were search for. <>= peakData <- peakFind(samples, lib, cor_RI) @ The intensity and RI slots can be accessed by using the \Rmethod{Intensity} and \Rmethod{retIndex} methods. Each slot is a list, where every component of the list is a matrix, which is related to a metabolite in the library. There should be as many components as metabolites in the library. In every matrix, columns are samples and rows are masses. <>= met.RI <- retIndex(peakData) met.Intensity <- Intensity(peakData) # show the intensity values of the first metabolite. met.Intensity[[1]] @ \section{Metabolite Profile} The function \Rfunction{Profile} makes a profile of the MS data by averaging all the normalised mass intensities whose Pearson coefficient is greater that \Robject{r\_thresh}. <>= MetabProfile <- Profile(samples, lib, peakData, r_thres = 0.95, method = "dayNorm") @ A \Rclass{msProfile} object is returned. The averaged intensities and RI matrices that can be obtained by \Rmethod{Intensity} and \Rmethod{retIndex} methods. The profile information is represented by a \Rclass{data.frame} in the \Rmethod{info} slot (accessible by \Rmethod{profileInfo} method). The columns are: \begin{description} \item[Name] The metabolite/analyte name. \item[Lib\_RI] The expected RI (library RI). \item[Mass\_count] The number of correlating masses. \item[Non\_consecutive\_Mass\_count] Same as above, but not counting the consecutive masses. \item[Sample\_Count\_per\_Mass] The number of samples in which a correlating mass was found. The numbers are separated by semi-colon (;) and each number corresponds to one correlating masses (same order). If all the numbers are the same, only that unique number is shown. \item[Masses] The correlating masses. \item[RI] The average RI. \item[Score\_all\_masses] The similarity score calculated using the average intensity of all the masses that were searched for, regardless of whether they are correlating masses. \item[Score\_cor\_masses] Same as above, but only correlating masses are considered. \end{description} As metabolites with similar selective masses and RIs can be present in metabolite libraries, it is necessary to reduce redundancy. This is performed by the function \Rfunction{ProfileCleanUp} which selects peaks for which the RI gap is smaller than \Robject{timeSplit} and computes the Pearson correlation between them. When two metabolites within such a time-group are tightly correlated (given by \Robject{r\_thres}) only the one with more correlated masses is retained. <>= finalProfile <- ProfileCleanUp(MetabProfile, timeSplit = 500, r_thres = 0.95) @ The function returns a \Rclass{msProfile} object. The \Rmethod{info} slot is similar as described above, but extra columns with a ``Cor\_'' prefix (e.g., ``Cor\_Name'') are included. They provide information about metabolite redundancy. \section{Peaks and Spectra Visualisation} Finally, it may be of interest to check the chromatographic peak of selected metabolites and compare the median spectra of the metabolites, i.e., the median intensities of the selected masses across all the samples, with the reference spectra of the library. There are two functions to do so: \Rfunction{plotPeak} and \Rfunction{plotSpectra}. For example, we can compare the median spectrum of ``Valine'' against its spectrum reference. Here we look for the library index of ``Valine'' and plot the spectra comparison in a ``head-tail'' plot (figure~\ref{fig:plotSpectra}). <>= grep("Valine", libName(lib)) plotSpectra(lib, peakData, libID = 3, type = "ht") @ To look at the chromatographic peak of ``Valine'' in a given sample, we use the functions \Rfunction{peakCDFextraction} to extract the raw chromatogram and \Rfunction{plotPeak} to plot the peak (figure~\ref{fig:plotPeak}). <>= # we select the first sample sample.id <- 1 cdf.file <- tsd_file_path(cdffiles[sample.id]) rawpeaks <- peakCDFextraction(cdf.file) # which.met=3 (id of Valine) plotPeak(samples, lib, MetabProfile, rawpeaks, which.smp=sample.id, which.met=3, corMass=FALSE) @ Refer to the documentation of the functions \Rfunction{plotPeak} and \Rfunction{plotSpectra} for further options not covered here. \section{TargetSearch GUI} \warning{The graphical user interface (GUI) has been removed from \Biocpkg{TargetSearch}. Please use the regular functions.} Nevertheless, the old GUI is still available in my github repository. Please follow the instructions if you want to install it. \url{https://github.com/acinostroza/TargetSearchGUI} \section{Untargeted search} Although \Biocpkg{TargetSearch} was designed to be targeted oriented, it is possible to perform untargeted searches. The basic idea is to create a library that contains evenly distributed ``metabolites'' in time and every ``metabolite'' uses the whole range of possible masses as selective masses. An example: <>= metRI <- seq(200000, 300000, by = 5000) metMZ <- 85:250 metNames <- paste("Metab",format(metRI,digits=6), sep = "_") @ Here we define a set of metabolites located every 5000 RI units from each other in the RI range 200000-300000, with selective masses in the range of 85-300, and assign them a name. After that, we create an \Robject{tsLib} object. <>= metLib <- new("tsLib", Name = metNames, RI = metRI, selMass = rep(list(metMZ), length(metRI)), RIdev = c(3000, 1500, 500)) @ Now we can use this library object to perform a targetive search with this library on the \emph{E. coli} samples as we did before. <>= metLib <- medianRILib(samples, metLib) metCorRI <- sampleRI(samples, metLib) metPeakData <- peakFind(samples, metLib, metCorRI) metProfile <- Profile(samples, metLib, metPeakData) metFinProf <- ProfileCleanUp(metProfile, timeSplit = 500) @ The \Robject{metFinProf} object can be used to create a new library by taking only the metabolites that have, for example, more than 5 correlating masses and using the correlating masses as selective ones. <>= sum( profileInfo(metFinProf)$Mass_count > 5) tmp <- profileInfo(metFinProf)$Mass_count > 5 metRI <- profileInfo(metFinProf)$RI[tmp] metNames <- as.character( profileInfo(metFinProf)$Name[tmp] ) metMZ <- sapply(profileInfo(metFinProf)$Masses[tmp], function(x) as.numeric(unlist(strsplit(x,";"))) ) metLib <- new("tsLib", Name = metNames, RI = metRI, selMass = metMZ, RIdev = c(1500, 750, 250)) @ After the new library object is created, the process can be repeated as shown above. Finally, by using the function \Rfunction{writeMSP}, it is possible to export the spectrum of the unknown metabolites to MSP format used by NIST mass spectra search (\url{http://www.nist.gov/srd/mslist.htm}), so the unknown spectra can be search against known metabolite spectra databases. \bibliography{biblio} \section{Session info} Output of \Rfunction{sessionInfo()} on the system on which this document was compiled. <>= toLatex(sessionInfo()) @ \end{document} % vim: set ts=4 sw=4 et: