%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \VignetteIndexEntry{Sample Size Calculation} % \VignetteKeywords{SELDI, MALDI, Protein Expression Analysis, Clinical Proteomics, Mass Spectrometry, Sample Size, Multiple Testing, Technical replicates} % \VignettePackage{clippda} \documentclass{article} \usepackage{graphicx} % standard LaTeX graphics tool \usepackage{Sweave} \usepackage{amsfonts} % these should probably go into a dedicated style file \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\textit{#1}} %%%%%%%%%%%%%%%%%%%%%%%%% % Our added packages and definitions \usepackage{hyperref} \usepackage{amsmath} \usepackage{color} \usepackage{comment} \usepackage[authoryear,round]{natbib} \parindent 0in \definecolor{red}{rgb}{1, 0, 0} \definecolor{green}{rgb}{0, 1, 0} \definecolor{blue}{rgb}{0, 0, 1} \definecolor{myblue}{rgb}{0.25, 0, 0.75} \definecolor{myred}{rgb}{0.75, 0, 0} \definecolor{gray}{rgb}{0.5, 0.5, 0.5} \definecolor{purple}{rgb}{0.65, 0, 0.75} \definecolor{orange}{rgb}{1, 0.65, 0} \def\RR{\mbox{\it I\hskip -0.177em R}} \def\ZZ{\mbox{\it I\hskip -0.177em Z}} \def\NN{\mbox{\it I\hskip -0.177em N}} \newtheorem{theorem}{Theorem} \newtheorem{procedure}{Procedure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \title{\Rpackage{clippda}: A package for clinical proteomic profiling data analysis} \author{Stephen Nyangoma} \maketitle \begin{center} Cancer Research UK Clinical Trials Unit, Institute for Cancer Studies, School of Cancer Sciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} This package is still under development but it is intended to provide a range of tools for analysing clinical genomics, methylation, and proteomics data. The method used is suitable for analysing data from single-channel microarray experiments, and mass spectrometry data (especially in the situation where there is not one-to-one correspondence between the cases and controls), with non-standard repeated expression measurements arising from technical replicates. Its main application will be to mass spectrometry data analysis. Mass spectrometry approaches, such as the Matrix-Assisted Laser Desorption/Ionization (MALDI) and Surface-Enhanced Laser Desorption and Ionisation Time-of-Flight (SELDI-TOF), are increasingly being used to search for biomarkers of potentially curable early-stage cancer. Currently, I am more concerned with the problem of finding the number of biological samples required for designing clinical proteomic profiling studies to ensure adequate power for differential peak detection. But we have included a number of tools for the pre-processing of repeated peaks data, including tools for checking the number of replicates for each sample, the consistency of the peaks between replicate spectra, and for data formatting and ``averaging''. In the future, \Rpackage{clippda} package will offer additional tools, including: functions for differential-expression analysis of peaks from studies with technically replicated assays, and for pre-processing the 3-dimensional, Liquid Chromatography (LC) - Mass Spectrometry (MS) datasets. To be able to detect clinically-relevant differences between cases and controls with adequate statistical power, a given number of biological samples must be analysed. While mass spectrometry technology is imperfect, and is affected by multiple sources of variation, little consideration has been given to sample size requirements for studies using this technology. Sample size calculations are typically based on the distribution of the measurement of interest, the between-subject (biological) variability, the clinically meaningful size of the difference in expression values between the cases and controls, the power required to detect this difference, and the p-value (i.e. false positive rate). In proteomic profiling studies, sample size determination needs to consider several additional factors which make sample size calculations less straightforward than sample size determination for many other studies. We need to account for experimental error, and make adjustments for the number of experimental replicates. In addition, proteomic profiling studies seek to simultaneously detect the association of multiple peaks with a given outcome; thus, adjustments, which control for the type I error rate, must be made to the sample size. However, the problem of controlling for the number of false positives is complicated in a proteomic profiling study. The number of peaks (and of differentially-expressed proteins) is normally much lower, often between 45 and 150, than the number of genes considered in microarray studies (typically, 1000's). Thus the parameters, such as the significance level, the power and the proportion of true positives, which control the FDR, must be specially defined for proteomic profiling studies: parameters appropriate to microarray studies cannot be transferred dirrectly to proteomic profiling studies. It is common to find that a significance cut-off of $\alpha =0.05$ is used. For proteomics datasets this level of significance has been shown to adequately control for the tail probability of the proportion of false positives, (TPPFP) (\cite{Birkneretal2006}). TPPFP is a resampling-based empirical Bayes method which controls the ratio of false positives to total rejections. It does not require the use of a fixed value for the clinically important difference for all peaks (as required in microarrays; see, e.g. \cite{DobinandSimon2005}), and is suitable for comparative analyses where these differences may vary across chip-types, for example. Our applications focus on the clinical proteomics of cancer. Most of the gene expression and proteomics analyses involving human specimens are observational case-control by design, and this fact immediately raises the key issue of possible biases and confounding factors in the populations from which the samples are drawn (\cite{BoguskiandMcIntosh2003}). Moreover, most clinical studies are unbalanced in that the number of biological samples having specific phenotypic attributes differs between the cases and controls. Sample imbalance has been found to have a huge impact on sample size requirements in microarray studies (\cite{Kunetal2006}) but has not been a subject of study in proteomic profiling studies. Unfortunately, in the majority of clinical genomic and proteomic profiling studies, the distinction between observational and experimental designs is not made, and the methods for analyzing experimental (randomized) clinical studies are used, arbitrarily. This software implements a method (\cite{Nyangomaetal2009}) for calculating the sample size required for planning proteomic profiling studies which adjusts for the effect of sample imbalances and confounding factors. Perhaps the main challenge in constructing a sample size formula for planning proteomic profiling studies is the fact that confounders must be appropriately adjusted for. However, typically no confounder information is available at the design stage of a study. In this case, two options are available: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. Ignore the confounder information and use the sample size calculation formula such as that adopted by \cite{Cairnsetal2009}. However, this underestimates the sample size required, which may have an adverse effect on the quality of the conclusions drawn from the study. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2. Introduce a method which makes it simple to include adjustments for expected heterogeneities in sample size calculations. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% We use a sample size formula which is based on a modified version of the Generalize Estimating Equations (GEE) method (\cite{Nyangomaetal2009}). It takes into account the experimental design of repetitions of peak measurements, and it adequately adjusts for all sources of variability, heterogeneity and imbalances in the data. In this method, it is assumed that the joint distribution of the peak data and the covariates is known. Thus the usual Fisher information matrix is replaced by the expected (with respect to covariates) Fisher information matrix and the covariate information enters the formula as a function of the multinomial proportions of subjects with specific attributes, while the repeated peak information enters as a linear function of the intraclass correlations between the intensities of the replicates. This method makes it straightfoward for a clinician to plan what proportion of samples with given attributes to include in an experiment. Typically, the covariate information need not be available for the confounding effects to be specified in a sample size calculation. For any proteomic profiling study, the effect of covariates on sample size calculation may be simulated from an appropriate univariate normal distribution with mean $Z=2+c$, where $0 \leq c\leq 1$. In these simulations, values of $Z \approx 2$ (and a small variance $\approx 0.03$, indicate that the study is balanced. Departures from these values represents various degrees of imbalance. As far as we know, there is only one reference on the sample size requirements for proteomic profiling studies: \cite{Cairnsetal2009}; but there is an extensive literature for sample size in microarray studies (e.g. \cite{DobinandSimon2005, Jorstadetal2007, Kunetal2006, WeiandBumgarner2004}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Defining guidelines for sample size requirements in proteomic profiling studies is complex, since there are multiple ways of setting up experiments: for example, there are many SELDI chip-types (e.g. IMAC, MC10, H50, and Q10), several biofluids (e.g. serum, urine, and plasma), a number of possible control objects (e.g. healthy or diseased controls), and various sample-types (e.g. cancers which are localalized in different body organs). This results in a difficult {\em multifactorial}, problem and the use of pilot data is warranted. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This method is implemented in the \Rpackage{clippda} package, which offers a number of functions for data pre-processing, formatting, inference, and sample size calculations for unbalanced SELDI or MALDI proteomic data with technical replicates. As opposed to microarrays, it requires multiple parameters, including the intraclass correlations, protein variance, and the clinically important differences. The intraclass correlation is usually fixed in many applications involving repeated measurements, and we follow this approach. We avoid defining the clinically important differences as fold-changes. We rely on a well-defined interval to determine what is clinically important. The sample sizes required for planning clinical proteomic profiling studies are the elements of the plane described by the clinically important differences versus protein variances. There is an option for plotting sample size parameters from past experiments on the grid described by the the clinically important differences versus protein variances, onto which sample size contours have been superimposed. This information is crucial for establishing general guidelines for sample size calculations in proteomic profiling studies as it gives the bounds for sample size requirements. Thus the grid may be used as tool for integrating sample size information from disparate sources. The control of biological variance is found to be critical in determining sample size. In particular, the parameters used in our calculations are derived from summary statistics of peaks with ``medium'' biological variation. Even though, the method for specifying the confounder effect (\cite{Nyangomaetal2009}) is particularly important at the design stage of a clinical proteomics study, it may also be employed when one wants to reuse the data from a previous study (in which there is no complete record of covariates) as the ``pilot'' data for a new study. The data from previous studies can be used to determine the range for possible values for the intraclass correlation, the within and between sample variances, and the size of differences to be estimated. Then the method given in Nyangoma {\em et al.} (2009), may be used to to specify the range for possible values for the confounding effects, which can be used in sample size calculations to set guidelines for the required sample size when designing a proteomic profiling study. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% From a clinical proteomic profiling study we have protein expression measurements, and separate records of phenotypic characteristics of the samples under study. However, these datasets are stored in formats which cannot be used directly in statistical analyses. Moreover, there is often missing phenotypic information for certain individuals. Thus data management and storage become key issues in proteomic profiling data analyses. We define a new object class for clinical proteomic profiling data, known as \texttt{aclinicalProteomicData}, which can be used to combine both expression data and phenotypic information from a clinical proteomic study into a single coherent and well-documented data structure which can be used directly in analyses. An \texttt{aclinicalProteomicData} class \texttt{object} has \texttt{slots} for storing: ``raw'' SELDI data, covariates, phenotypic data, classes for phenotypic variables, and the number of peaks of interest (e.g. peaks detected by Biomarkers wizard software). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% It should be noted that there are two packages in \Rpackage{Bioconductor} for handling \texttt{SELDI} datasets. The \Rpackage{PROcess} package (\cite{li2005}), provides functions for pre-processing raw \texttt{SELDI} mass spectra. \Rpackage{caMassClass} (\cite{tuszynski2003}), provides tools for pre-processing, and classification of \texttt{SELDI} datasets. Thus, the mass spectrometry datasets pre-processed using these packages could be analysed using the \Rpackage{clippda} package, to provide information on differential protein expression and the sample size required to plan a proteomic profiling study. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Getting started} {\bf Installing the package.} There are instructions for installing packages at \url{http://www.bioconductor.org/docs/install/}. For Windows users, first download the appropriate file for your platform (e.g. a .zip file) from the Bioconductor website \url{http://www.bioconductor.org/}. Next start R and select the \texttt{Packages} menu. Next, select \texttt{Install package from local zip file...}. Find and highlight the location of the zip file and click on {\tt open}. Alternatively, you can download the package from your nearest \texttt{CRAN} mirror site. Simply set the \texttt{CRAN} mirror appropriately, and then use the command \texttt{install.packages} from within an R session.\\ {\bf Loading the package.} To load the \Rpackage{clippda} package in your R session, type \texttt{library(clippda)}. \\ {\bf Help files.} Detailed information on \Rpackage{clippda} package functions can be obtained in the help files. For example, to view the help file for the function \texttt{clippda} in a browser, use \texttt{help.start} followed by \texttt{?clippda}.\\ {\bf Case study.} We provide data from sera samples from liver patients.\\ {\bf Sweave.} This document was generated using the \Robject{Sweave} function from the R \Rpackage{tools} package. The source (.Rnw) file is in the \texttt{/inst/doc} directory of the \Rpackage{clippda} package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Software Application: proteomic profiling of liver serum} \subsection{Introduction} The package for calculating the sample size required for planning a proteomic profiling study is called \Rpackage{clippda}. In it, there are a number of user level functions for pre-processing proteomic data with repeated peak measurements to put them in a format amenable to analysis using conventional tools for repeated measures analysis. The function for sample size calculations is called \Robject{sampleSize}. There are plots for visualizing the results of your calculations: the \Robject{sampleSizeContourPlots} draws a grid for calculating sample size using parameter values that we have found to be clinically-relevant, while \Robject{sampleSize3DscatterPlots} displays the sample sizes corresponding to the range of the clinically important parameters. We begin by loading the necessary package <>= library(clippda) <>= options(width=60) @ We use the \Robject{install.packages}, or the function, \Robject{bioLite}, to get the necessary analysis and data packages from the R and Bioconductor repositories. \subsection{The \Rpackage{clippda} package and initial data filtering} The dataset used for illustration is from the proteomic profiling of liver cancer patients vs non-cancer controls. There were $60$ sera samples from patients with liver tumors, and $69$ from non-cancer controls. In addition to the information on tumor class (exposure) of the samples, gender information was available. The samples were divided into two aliquots, pre-processed using the SELDI protocol, and then assayed on IMAC chips. Although all the samples (except the QC, or quality control, samples) were meant to be assayed in duplicate, there were cases of mislabelling, resulting in some samples having no replicates and others having more than two replicates. So we developed a pre-processing tool to detect samples with incorrect numbers of replicates. The samples without replicates were discarded; while samples having more than two replicates were checked for consistency of peaks. The two replicates with the most similar peak information were used in further analyses. We first illustrate how to pre-process the repeated peak data to get duplicate peak measurements which are used in sample size calculations. The \Robject{liverRawData} is the raw data from the Biomarkers wizard, while the \Robject{liverdata} is its pre-processed version and \Robject{liver.pheno} is a dataframe of sample phenotypic information. \subsubsection{Data pre-processing} <>= data(liverdata) data(liverRawData) data(liver_pheno) liverdata[1:4,] liverRawData[1:4,] @ The short description of the data is <>= names(liverdata) dim(liverdata) @ Here are the samples from the ``raw'' data, for which the number of replicates is not 2 (which is due to mislabelling, in most cases), the intended number of replicates: <>= no.peaks <- 53 no.replicates <- 2 checkNo.replicates(liverRawData,no.peaks,no.replicates) @ These samples must be pre-processed to: (i) discard the information on samples which have no replicate data, and (ii) for samples with more than 2 replicate expression data, only duplicates with most similar peak information are retained for use in subsequent analyses. We can use the wrapper for pre-processing functions, \Robject{preProcRepeatedPeakData}, to pre-process the repeated raw mass spectrometry data from the Biomarker wizard. It identifies and discards information on samples that have no replicates. Then for the samples with two or more replicates, it selects and returns data for two replicates with most similar expression pattern. This is done as follows: <>= threshold <- 0.80 Data <- preProcRepeatedPeakData(liverRawData, no.peaks, no.replicates, threshold) @ Only sample with ID 250 has no replicates and has been omitted from the data to be used in subsequent analyses. This fact may varified by using the \texttt{R} command, \texttt{setdiff}: <>= setdiff(unique(liverRawData$SampleTag),unique(liverdata$SampleTag)) setdiff(unique(Data$SampleTag),unique(liverdata$SampleTag)) @ Now filter out the samples with conflicting replicate peak information using the \Robject{spectrumFilter} function: <>= TAGS <- spectrumFilter(Data,threshold,no.peaks)$SampleTag NewRawData2 <- Data[Data$SampleTag %in% TAGS,] dim(Data) dim(liverdata) dim(NewRawData2) @ The output is similar to the liverdata that is included in this package. In the case of this data (the liver data), all technical replicates have coherent peak information, since no sample information has been discarded by spectra filter. Let us have a look at what the pre-processing does to samples with more than 2 replicate spectra. Both samples with IDs 25 and 40 have more than 2 replicates. <>= length(liverRawData[liverRawData$SampleTag == 25,]$Intensity)/no.peaks length(liverRawData[liverRawData$SampleTag == 40,]$Intensity)/no.peaks @ Take correlations of the log-intensities to find which of the 2 replicates have the most coherent peak information. <>= Mat1 <- matrix(liverRawData[liverRawData$SampleTag == 25,]$Intensity,53,3) Mat2 <-matrix(liverRawData[liverRawData$SampleTag == 40,]$Intensity,53,4) cor(log2(Mat1)) cor(log2(Mat2)) @ We see that the first two columns of both \texttt{Mat1} and \texttt{Mat1}, which contain the the raw intensities for samples with IDs 20 and 40, respectively, have the most similar peaks information (have the most highly correlated spectrum). The function that picks the most similar duplicate spectra is \Robject{mostSimilarTwo}. Let us check that it correctly identifies the first two columns of both \texttt{Mat1} and \texttt{Mat1}, as having the most coherent peak information: <>= Mat1 <- matrix(liverRawData[liverRawData$SampleTag == 25,]$Intensity,53,3) Mat2 <-matrix(liverRawData[liverRawData$SampleTag == 40,]$Intensity,53,4) sort(mostSimilarTwo(cor(log2(Mat1)))) sort(mostSimilarTwo(cor(log2(Mat2)))) @ Next, check that the pre-processed data, \Robject{NewRawData2}, contains similar information to \Robject{liverdata} (the allready pre-processed data, included in the \Rpackage{clippda}). <>= names(NewRawData2) dim(NewRawData2) names(liverdata) dim(liverdata) setdiff(NewRawData2$SampleTag,liverdata$SampleTag) setdiff(liverdata$SampleTag,NewRawData2$SampleTag) summary(NewRawData2$Intensity) summary(liverdata$Intensity) @ \subsection{Data formatting} Most computations in this package are performed on data which are arranged in a format which can be averaged by the \Robject{limma} package function, \Robject{dupcor}, which is generated using the function called \Robject{sampleClusteredData}. <>= JUNK_DATA <- sampleClusteredData(NewRawData2,no.peaks) head(JUNK_DATA)[,1:5] @ Two consecutive rows of this dataframe are expression values of same sample. The first column of this dataframe, for example, is given by: <>= as.vector(t(matrix(liverdata[liverdata$SampleTag %in% 156,]$Intensity,53,2))[,1:5]) length(as.vector(t(matrix(liverdata[liverdata$SampleTag %in% 156,]$Intensity,53,2)))) as.vector(t(matrix(NewRawData2[NewRawData2$SampleTag %in% 156,]$Intensity,53,2))[,1:5]) length(as.vector(t(matrix(NewRawData2[NewRawData2$SampleTag %in% 156,]$Intensity,53,2)))) @ This is the vector of duplicate expression measurements of the 53 peaks on the sample with ID 156. We recommend that both the expression data and phenotypic information should be combined into a single coherent and well-documented data structure, by storing them in objects of \texttt{aclinicalProteomicData} class. We describe this in detail in the next section. \section{Combining expression data and phenotypic information into a single object of aclinicalProteomicData class} A clinical proteomic profiling study results in protein expression data, but there are often available records of phenotypic characteristics for the samples under study. Thus it is convenient to combine the datasets into a single well-annotated data structure with coherrent dimensions across the data slots, e.g. objects of \texttt{ExpressionSet} class, in the \Rpackage{Biobase} package. In the \Rpackage{clippda} package, we have defined a new \texttt{object} class for clinical proteomic profiling data, known as \texttt{aclinicalProteomicData}, which can be used to combine both expression data and phenotypic information from a clinical proteomic study into a single data structure which can be used directly in analyses. An \texttt{aclinicalProteomicData} class \texttt{object} has \texttt{slots} for storing: a matrix of raw SELDI data (\texttt{rawSELDIdata}), a character vector of sample covariates (\texttt{covariates}), a matrix of phenotypic data (\texttt{phenotypicData}), a character vector storing classes for phenotypic variables (\texttt{variableClass}), and a numeric value for the number of peaks of interest (\texttt{no.peaks}). I will now explain how to construct objects of a \texttt{aclinicalProteomicData} class, which are the input data for many generic functions in the \Rpackage{clippda} package. The following steps may be followed: (i) First, we need to pre-process the ``raw'' replicate expression data from the Biomarkers wizard software to remove the inconsistencies caused by sample mislabelling. The codes for doing this were given in the last section, and the pre-processed data was stored in the object \texttt{NewRawData2}. Below is how to create data objects of a \texttt{aclinicalProteomicData} class: <>= OBJECT=new("aclinicalProteomicsData") OBJECT@rawSELDIdata=as.matrix(NewRawData2) #OBJECT@rawSELDIdata=as.matrix(liverdata) OBJECT@covariates=c("tumor" , "sex") OBJECT@phenotypicData=as.matrix(liver_pheno) OBJECT@variableClass=c('numeric','factor','factor') OBJECT@no.peaks=no.peaks OBJECT @ An object of \texttt{aclinicalProteomicsData} class takes as arguments: a matrix of expression values, a matrix of phenotypic information, a character vector of covariates of interest, a character vector of classes for phenotypic variables, and a numeric value for the number of peaks of interest. \subsection{Extracting data from an \texttt{aclinicalProteomicsData} class object} We can extract data from various slots of an object of a \texttt{aclinicalProteomicsData} class using the command: \texttt{object@slotname}. The \texttt{rawSELDIdata} may be obtained from an \texttt{aclinicalProteomicsData} class object using the command: \texttt{proteomicsExprsData(object)}. The \texttt{phenotypicData} may be obtained from that object using the command: \texttt{proteomicspData(object)}. Here is an example of how to extract the expression and phenotypic datasets from an \texttt{aclinicalProteomicsData} object. We only show the first few rows of these datasets. <>= head(proteomicsExprsData(OBJECT)) head(proteomicspData(OBJECT)) @ \section{Sample size calculations} \subsection{Obtaining the parameters} Now we will use the data from an \texttt{aclinicalProteomicsData} class object to calculate the sample size required when planning clinical proteomic profiling studies. There are two possible sources of the data used in sample size calculations: 1. A pilot clinical proteomic profiling study. 2. {\em Reuse} data from previous clinical proteomic profiling studies. Because of the difficulty with reproducibility in proteomic profiling studies, I find it more reasonable to {\em reuse} data from previous studies (with similar hypotheses to the proposed study) as the pilot data. This is especially useful if it is possibe to obtain data from multiple ``previous studies''. In this way, there is ``replication'' and one can use these multiple datasets to determine the range for plausible estimates of the parameters, which may be used to set up guidelines on the sample size required when planning a proteomic profiling study. The down-side of this method is that for some studies, the data on certain risk factors may be missing or incomplete, giving rise to incomplete sets of covariates. In this case, one can employ the method proposed by Nyangoma {\em et al.} (2009) to estimate the admissible effects of confounders. %%%%%%%%%%XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX%%%%%%% First, the data is transformed by taking logarithms-to-base-two, and then used in calculating sample size parameters. The parameters required in sample size calculations are: the biological variance, the technical variance, the difference to be estimated, the intraclass correlation between technical replicates, significance level, and the power that need to be attained. The power and significance level, and the intraclass correlation (if known) are usually pre-specified. The rest of the parameters are computed from the pilot data (or data from previous studies) within the procedure for computing sample size (\texttt{sampleSize}) as follows: (i) The biological variance, the differences to be estimated, and the significance of the peaks, may be obtained using the command: \Robject{betweensampleVariance(object)}. (ii) The technical variance is calculated using the command: \texttt{sampletechnicalVariance(object)}. \subsection{Graphical representation to aid in selection of adjustment factors for the effect sample imbalance} Some graphics may aid in understanding the consequences of using certain designs when planning a clinical proteomics study. A Plot of the $Z$ values against the ratios of the corresponding proportions of cases in study to the proportions of the controls (i.e. odds of being a case), is the simplest tool that highlights the effects of sample imbalance on the sample size required when planning a proteomics study. The odds is a measure of the fold-change in the proportions of cases to controls. We feel that a choice in the proportions of cases to those of controls which gives odds greater than a 3-fold change indicate extreme imbalances which are unlikely to be of practical use. This would lead to choices of $Z=2+c$, where $0\leq c\leq 1$. We have plotted several $Z$ values against their corresponding odds and highlighted on this plot some values of odds and corresponding $Z$ from some designs which are likely to be used in practice when planning a profiling study, for example: balanced design (i.e. 1:1) and unbalanced designs (i.e 1:3, 1:4 and 1:5). This plot can be done in \texttt{clippda} using the commad: <>= probs=seq(0,1,0.01) # provide hypothetical proportions of cases vs controls ZvaluescasesVcontrolsPlots(probs) @ This command saves the plot as a pdf file: ZvaluescasesVcontrolsPlots.pdf in a working directory. We append this plot here \begin{center} \includegraphics{ZvaluescasesVcontrolsPlots.pdf} \end{center} If other parameters (e.g. intraclass correlation, protein variance, significance level) are held constant, a balanced design gives rise to the minimum $Z$ value, $Z=2$, and hence the smallest sample size compared to unbalanced designs. This means you get same power at a smaller sample size if a study is balanced. The visualization method becomes more complex as the dimension of the covariates space increases. When we have binary exposure and confounder, then a cross-classification of individuals into these categories gives rise to 3-dimensional (3D) multinomial distribution of the number of cases. When repeated multinomial sampling is done under a specific scheme, then the proportions of elements in various cells vary in the 3D space, and $Z$ values describe ellipsoid with respect to any 2D subspace of the 3D multinomil space. Three hundred individuals were proportionately sampled 10000 times in the ratio 1:1:1:1. The same individuals were again sampled 10000 times following an unbalanced design: 1:2:1:7. We first plotted the densities of the $Z$ values generated from these experiments using the following command <>= nsim=10000;nobs=300;proposeddesign=c(1,2,1,7);balanceddesign=c(1,1,1,1) ZvaluesfrommultinomPlots(nsim, nobs, proposeddesign, balanceddesign) @ The density plot indicates that the $Z$ values from a balanced design cluster around $Z=2$, while those of an unbalanced design cluster around $Z=2+c$, where here $c\approx 2$. \begin{center} \includegraphics{ZvaluesDensityPlots.pdf} \end{center} Thus as in the simplest unbalanced design when there is a binary exposure and one binary confounder, the choice of $Z=2$ leads to the smallest sample size, which happens to be the case only if a study is balanced. If some degree of unbalancedness is expected, then larger values of $Z$ must be chosen. The $Z$ values plotted against elements of any 2D subspace of the 3D space of multinomial probabilities lead to ellipsoids shown in the figure below \begin{center} \includegraphics{Zvalues3DPlots.pdf} \end{center} Again, we can clearly see $Z=2$ for a balanced study and attains higher values for various degrees of unbalancedness. \subsection{Sample size calculations} The sample size can be computed using the function \Robject{sampleSize}. This function, first computes the input parameters for sample size calculation, using the function, \texttt{sampleSizeParameters}. The other useful input parameter is the heterogeneity correction factor, $Z$. If full covariate information is known then $Z$ is the second diagonal element of the output matrix expected Fisher information obtained using the command: \texttt{fisherInformation(object)}. However, if there is incomplete set of sample covariates (or if no covariates can be found), it would be useful to derive a range of plausible of $Z$ using the method proposed by Nyangoma {\em et al.} (2009). Here, we use the covariates that were available to us (i.e. cancer class, sex, and age) to gauge the degree of data imbalance. This estimate may be used as a guideline when setting up the range of possible values of adjustment factors for data imbalance (i.e. the $Z$ values). In these computations, it is assumed that only duplicate data are available. <>= intraclasscorr <- 0.60 signifcut <- 0.05 Data=OBJECT sampleSizeParameters(Data, intraclasscorr, signifcut) Z <- as.vector(fisherInformation(Data)[2,2])/2 Z sampleSize(Data, intraclasscorr, signifcut) @ Note that the value of $Z$ is only about $4\%$ above the nominal $Z$ value, i.e. $Z=2$, which is used in the classical sample size calculations, but it has a noticeable effect on sample size. For example, At $70\%$ power and $5\%$ level of significance, the sample size has increased from 411 to 426. We have seen worse imbalances in many studies where heterogeneities are above $40\%$ of the nominal value. For this data set, the biological variance (\texttt{bioVar} = 3.2), is particularly large and this has inflated the number of biological samples required. We believe that the sample size required could be much lower, if the biological variance could be appropriately controlled. \section{Display of the results} The following commands will produce a contour plot of sample sizes over a range of values for clinically-important differences versus protein variances. The plot will be saved in your working directory. The sample size contours are generated from the outer product of the differences (0.13 - 0.5) and variances (0.2 - 4.0). Combinations of the parameters in this region give sample size values that are likely to be achieved in practice. In contrast, the parameters outside these ranges result in sample sizes that are too large to be of practical use. Intraclass correlations used are 0.70 and 0.90, the powers used are 0.70 and 0.90; and the significance level considered is 0.05. On the grid, we have plotted a number of sample sizes we computed from real-life data from several SELDI and MALDI proteomic profiling studies, for example: late-stage (MALDI - green), early-cancer (SELDI - blue, MALDI - red), colorectal serum (SELDI with different chip-type: IMAC - purple, CM10 - gray, Q10 - orange). The description of the data can be found in \cite{Nyangomaetal2009}. These results can be used to formulate a general guideline for the number of samples required to plan a proteomic profiling study. You will immediately see that using fewer than 50 samples, will not result in any meaningful estimation of differences. For late-stage cancer, you would need the fewest number of samples, even using a very variable sample-types such as urine. You typically need more samples (over 200) to estimate differences between early stage cancer and non-cancer controls. The output will contain four contour plots describing various combinations of the above parameters. You will notice that, as expected, for more power, you need larger sample sizes. <>= m <- 2 DIFF <- seq(0.1,0.50,0.01) VAR <- seq(0.2,4,0.1) beta <- c(0.90,0.80,0.70) alpha <- 1 - c(0.001, 0.01,0.05)/2 Corr <- c(0.70,0.90) Z <- 2.4 Indicator <- 1 observedPara <- c(1,0.4) #the variance you computed from pilot data #observedPara <- data.frame(var=c(0.7,0.5,1.5),diFF=c(0.37,0.33,0.43)) sampleSizeContourPlots(Z,m,DIFF,VAR,beta,alpha,observedPara,Indicator) @ You may input parameters from your pilot study into the plot, e.g. the following hypothetical values <>= observedVAR=1 observedDIFF=0.4 @ The result will be plotted as a triangle in your plot. You may set these values to 0, if you do not have pilot data, in which case the values computed from my pilot studies (dotted on the plot) may be used to draw guidelines for sample size required to plan a cancer proteomic profiling study. If you have what is believed to be appropriate clinically important differences, you can also determine the required sample sizes from the sample size contour plots. For example, if we assume that the clinically important difference is 0.35 then the sample sizes for given values of protein variances, are the values at the intersection of the sample size contours and the horizontal line with an intercept of 0.35 on the ``difference'' axis on the grid of differences versus variances (i.e. the blue dotted line). We include the plot generated using the above commands: \begin{center} \includegraphics{SampleSizeContourPlot.pdf} \end{center} You may also visualize your results using 3D scatterplots through the \Robject{sampleSize3DscatterPlots} as follows <>= Z <- 2.460018 m <- 2 DIFF <- seq(0.1,0.50,0.01) VAR <- seq(0.2,4,0.1) beta <- c(0.90,0.80,0.70) alpha <- 1 - c(0.001, 0.01,0.05)/2 observedDIFF <- 0.4 observedVAR <- 1.0 observedSampleSize <- 80 Indicator <- 1 Angle <- 60 sampleSize3DscatterPlots(Z,m,DIFF,VAR,beta,alpha,observedDIFF,observedVAR,observedSampleSize,Angle,Indicator) @ A plot from our examples is given below \begin{center} \includegraphics{SampleSize3DscatterPlot.pdf} \end{center} \bibliography{LittBib} \bibliographystyle{abbrvnat} \end{document}