Information

What to do if microarray t-test, ANOVA, SAM and LImma show various selected significant genes?


Need advice: how to approach discrepancy in differential microarray gene expression test results: what to do if ANOVA, ttest, SAM and Limma procedures show different results and especially more discrepant when using log transformations and normalization operations?


I wouldn't expect different methods to give the same results. Further, why are you even testing non-normalized datasets, the results of that are completely and utterly useless for any purpose other than showing that normalization is important. In addition, an T-test is a special case of an ANOVA (and of course limma is itself using a moderated T-test, though it's going to have significantly more power than the others), so I have to ask myself exactly what sort of design you're using that it's appropriate to substitute one for the other and still get apparently vastly different results.

In general, I would strongly encourage you to work with a local bioinformatician or statistician for what is presumably your first microarray analysis, particularly if you don't have a strong statistics background.


Tessa A. Morris General Microarray Data Analysis

The purpose of today's lab was to normalize the within and between chip data and then perform a statistical analysis on each time point in order to determine if there is anything significant in the data.

Helpful Links from Dr. Dahlquist

Methods

Setup Wiki

Your User Page: Set up your individual user page on this wiki (accessible via your username at the top of the page). Your user page should take the form of a résumé or, in academic circles, a curriculum vitae. Dr. Dahlquist has such pages, both within this wiki (user page) and online in general. You may use those as starting points. As students, your information may be different from ours. OpenWetware automatically fills in your user page with automated content that may not apply to you. You will need to delete any unneeded information from the automated content and add the following:

  1. Name
  2. Contact Information (e-mail address and LMU mail address)
  3. Personal Information: Education, Major, Expected graduation year, Upper division courses taken, Career interests and goals
  4. Independent research projects: Title of project, Mentor's name, Presentation, Publications
  5. Work experience: Position/title, employer, dates, responsibilities
  6. Personal interests/hobbies: What is your favorite aspect of biology and why? What is your favorite aspect of mathematics and why?
Practice your Wiki Skills

The previous sections listed the content that you need to provide on the wiki. In formatting your pages, demonstrate all of the following skills. Find a way to make integrate them naturally into the content (e.g., do not say “Here is an image.” and put just any image on the page).

  1. Every time you edit a page (whether it is a content page or discussion page), enter a meaningful description of your change in the Summary field at the bottom of the editor. This allows other users to easily see (say via the Special:RecentChanges or history pages) what has happened to the page since they last visited it.
  2. Create a new Wiki page: [[new page title]] — When you include a non-existent link in a page (say, your user page), the software can tell that this page doesn't exist and colors it red instead of blue/purple. When you click on the red link, you are then given the option to edit (and thus create) the page.
    • We suggest you practice this by creating your Week 2 journal entry page. The name for the page should be in the format "username Week 2" (i.e., that is the text you put between the square brackets when you link to this page).
  3. Link to a page within our Wiki: [[page title|optional visible label]]
    • Go to the People and link your name to your own user page.
  4. Link to an external Web page: http://address or [http://address visible label]
    • The second form of the link is preferred because it looks neater on the page.
  5. Use headings: === title === (number of equals signs indicates heading level)
    • By convention, start your largest heading with two equals signs. The single equals sign is for the title of the page and is automatically created when you create the page.
  6. Create a bulleted list: *
    • Note that you can create sub-bullets underneath by using multiple asterisks, e.g., **, ***, etc.
  7. Create a numbered list: #
    1. Note that you can create numbered sub-lists by using multiple number signs, e.g., ##, ###, etc.
    2. You can also mix bullets and numbers, e.g., *#, #*, or even #*#, etc.
    3. Do not skip lines between your bulleted or numbered lists, or the wiki will not interpret your syntax correctly.
    • Use the image on your page: [[Image:exact-name-of-image-file]]
    • REMEMBER: DO NOT SUBMIT COPYRIGHTED WORK WITHOUT PERMISSION! We suggest you include an image of yourself that would be suitable for a professional resume.
    • Link to the file you uploaded on your Wiki page: [[Media:exact-name-of-uploaded-file|visible label]]
    • REMEMBER: DO NOT SUBMIT COPYRIGHTED WORK WITHOUT PERMISSION! We suggest that you include something professional, such as the Word or PDF version of your paper resume, a scientific paper you have written, etc.
    • Throughout the course, you will use the category [[Category:BIOL98-04/S15]] for all of the pages you create.
    • You can fulfill this by posting your comment on Dr. Dahlquist's user talk page.
    • Create your template page like you would create any other new wiki page, but using the prefix Template: as part of the page name. For example, your template should be called [[Template:username]].
    • Click on the link and put content on this page that you will want to use over and over again. At the minimum, you should use it to create a set of navigation links that you will use in each week's journal entry. Each week as part of your journal assignment, you will be asked to create a link to your user page, the assignment page, your journal entry page, and the shared journal page, as well as add the category "BIOL368/F14" to your page. If you put these links on your template and then invoke the template on your journal page, this will automatically be taken care of for you. You may also wish to include any other links that you would find useful.
    • Once you have added and saved the content to your Template page, you need to use your template on your user page. To do so, invoke the template by using the following syntax: <> in the place you wish the content of the template page to appear. This will "expand" the template to its full contents on the actual page.
    Electronic Lab

    For all of the computer work that you do for research, you will keep an electronic laboratory notebook that records all the manipulations you perform on the data, the results you get, and the interpretations of the data. You should name this page with your username and the phrase Electronic Lab Notebook. You can keep one long page, organized by the date that you did the work. It is preferable to copy and paste protocols to this page, and then modify them with the specifics of what you did. The idea behind any lab notebook is that you or someone else can reproduce what you did using only the information contained in your notebook.

    File Extensions Note

    It is important to make the file extension type visible before starting work with different file types. Before starting work on the computer:

    Set Google Chrome to Prompt for the Location to Save Downloaded Files
    • Open the Settings window.
      • Click on the link at the bottom of the page that says "Advanced Settings".
      • Check the box that says "Ask where to save each file before downloading".
      • You could also change the default Download location to your Desktop, so that will be the first choice when it prompts you where to save the file.
      • Your settings are automatically saved.
      Steps 1-3: Generating Log2 Ratios with GenePix Pro
      • The protocol for gridding and generating the intensity (log2 ratio) data with GenePix Pro 6.1 is found on this page.
      • This protocol will generate a *.gpr file for each chip which is then fed into the normalization protocol below.
      Steps 4-5: Within- and Between-chip Normalization

      A more detailed protocol can be found on this page.

      Installing R 3.1.0 and the limma package

      The following protocol was developed to normalize GCAT and Ontario DNA microarray chip data from the Dahlquist lab using the R Statistical Software and the limma package (part of the Bioconductor Project).

      • The normalization procedure has been verified to work with version 3.1.0 of R released in April 2014 (link to download site) and and version 3.20.1 of the limma package ( direct link to download zipped file) on the Windows 7 platform.
        • Note that using other versions of R or the limma package might give different results.
        • Note also that using the 32-bit versus the 64-bit versions of R 3.1.0 will give different results for the normalization out in the 10 -13 or 10 -14 decimal place. The Dahlquist Lab is standardizing on using the 64-bit version of R.

        Running the Normalization Scripts

        • Create a folder on your Desktop to store your files for the microarray analysis procedure.
        • Download the zipped file that contains the .gpr files and save it to this folder (or move it if it saved in a different folder).
          • Unzip this file using 7-zip. Right-click on the file and select the menu item, "7-zip > Extract Here".

          Within Array Normalization for the Ontario Chips

          • Launch R x64 3.1.0 (make sure you are using the 64-bit version).
          • Change the directory to the folder containing the targets file and the GPR files for the Ontario chips by selecting the menu item File > Change dir. and clicking on the appropriate directory. You will need to click on the + sign to drill down to the right directory. Once you have selected it, click OK.
          • In R, select the menu item File > Source R code. and select the Ontario_Chip_Within-Array_Normalization_modified_20150514.R script.
            • You will be prompted by an Open dialog for the Ontario targets file. Select the file Ontario_Targets_wt-dCIN5-dGLN3-dHAP4-dHMO1-dSWI4-dZAP1-Spar_20150514.csv and click Open.
            • Wait while R processes your files.

            Within Array Normalization for the GCAT Chips and Between Array Normalization for All Chips

            • These instructions assume that you have just completed the Within Array Normalization for the Ontario Chips in the section above.
            • In R, select the menu item File > Source R code. and select the Within-Array_Normalization_GCAT_and_Merged_Ontario-GCAT_Between-Chip_Normalization_modified_20150514.R script.
              • You will be prompted by an Open dialog for the GCAT targets file. Select the file GCAT_Targets.csv and click Open.
              • Wait while R processes your files.
              • Save these files to LionShare and/or to a flash drive.

              Visualizing the Normalized Data

              Create MA Plots and Box Plots for the GCAT Chips

              Input the following code, line by line, into the main R window. Press the enter key after each block of code.

              • If you get a message saying "NaNs produced" this is OK, proceed to the next step.
              • If you get these Warning messages, it's OK:
              • Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality. ). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter
              • Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality. ). Once the graphs have been saved, close the window. To continue with the rest of the code, type the letter "c" and press Enter.
              • Maximize the window in which the plots have appeared. You may not want to actually maximize them because you might lose the labels on the x axis, but make them as large as you can. Save the plots as a JPEG (File>Save As>JPEG>100% quality. ). Once the graphs have been saved, close the window.

              Create MA Plots and Box Plots for the Ontario Chips

              Input the following code, line by line, into the main R window. Press the enter key after each block of code.

              • Warning message: "NaNs produced" is OK.
              • Warning messages are OK:
              • After entering the call browser() below, maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality. ). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
              • To continue generating plots, type the letter c and press enter.
              • Remember, that after entering the call readline('Press Enter to continue'), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality. ). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
              • To continue generating plots, type the letter c and press enter.
              • To continue generating plots, type the letter c and press enter.
              • Warnings are OK.
              • Zip the files of the plots together and upload to LionShare with file name 20150518_Microarray_Analysis_TM.zip
                • Also include the output files GCAT_and_Ontario_Final_Normalized_Data and GCAT_and_Ontario_Within_Array_Normalization
                Step 6: Statistical Analysis
                • For the statistical analysis, we will begin with the file "GCAT_and_Ontario_Final_Normalized_Data.csv" that you generated in the previous step.
                • Open this file in Excel and Save As an Excel Workbook *.xlsx . It is a good idea to add your initials and the date (yyyymmdd) to the filename as well.
                • Rename the worksheet with the data "Compiled_Normalized_Data".
                  • Type the header "ID" in cell A1.
                  • Insert a new column after column A and name it "Standard Name". Column B will contain the common names for the genes on the microarray.
                    • Copy the entire column of IDs from Column A.
                    • Paste the names into the "Value" field of the ORF List <-> Gene List tool in YEASTRACT. Then, click on the "Transform" button.
                    • Select all of the names in the "Gene Name" column of the resulting table.
                    • Copy and paste these names into column B of the *.xlsx file. Save your work.
                    • Type a "1" in cell A2 and a "2" in cell A3.
                    • Select both cells. Hover your mouse over the bottom-right corner of the selection until it makes a thin black + sign. Double-click on the + sign to fill the entire column with a series of numbers from 1 to 6189 (the number of genes on the microarray).
                    • Copy the first three columns of the "Compiled_Normalized_Data" sheet and paste it into the first three columns of the "Rounded_Normalized_Data" sheet.
                    • Copy the first row of the "Compiled_Normalized_Data" sheet and paste it into the first row of the "Rounded_Normalized_Data" sheet.
                    • In cell D2, type the equation =ROUND(Compiled_Normalized_Data!D2,4) .
                    • Copy and paste this equation in the rest of the cells of row 2.
                    • Select all of the cells of row 2 and hover your mouse over the bottom right corner of the selection. When the cursor changes to a thin black "plus" sign, double-click on it to paste the equation to all the rows in the worksheet. Save your work.
                    • Go back to the "Rounded_Normalized_Data" sheet and Select All and Copy.
                    • Click on cell A1 of the "Master_Sheet" worksheet. Select Paste special > Paste values to paste the values, but not the formulas from the previous sheet. Save your work.
                    • There will be some #VALUE! errors in cells where there was missing data for genes that existed on the Ontario chips, but not the GCAT chips.
                      • Select the menu item Find/Replace and Find all cells with "#VALUE!" and replace them with a single space character. Record how many replacements were made to your electronic lab notebook. Save your work.

                      Within-strain ANOVA

                      • The purpose of the witin-stain ANOVA test is to determine if any genes had a gene expression change that was significantly different than zero at any timepoint.
                      • Each student in the lab group will be assigned one strain to analyze from this point forward.
                        • Anu & Natalie (wt)
                        • Tessa & Trixie (dcin5)
                        • McGee & Nicole (dgln3)
                        • Monica & Grace (dhap4)
                        • Wyllie & Dr. D (dswi4)
                        1. Create a new worksheet, naming it "dCIN5_ANOVA"
                        2. Copy all of the data from the "Master_Sheet" worksheet for your strain and paste it into your new worksheet.
                        3. At the top of the first column to the right of your data, create five column headers of the form dCIN5_AvgLogFC_(TIME) where (TIME) is 15, 30, 60, 90, and finally 120.
                        4. In the cell below the dCIN5_AvgLogFC_t15 header, type =AVERAGE(
                        5. Then highlight all the data in row 2 associated with dCIN5 and t15, press the closing paren key (shift 0),and press the "enter" key.
                        6. This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
                        7. Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
                        8. Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
                        9. Now in the first empty column to the right of the dCIN5_AvgLogFC_t120 calculation, create the column header dCIN5_ss_HO.
                        10. In the first cell below this header, type =SUMSQ(
                        11. Highlight all the LogFC data in row 2 for your dCIN5 (but not the AvgLogFC), press the closing paren key (shift 0),and press the "enter" key.
                        12. In the next empty column to the right of dCIN5_ss_HO, create the column headers dCIN5_ss_(TIME) as in (3).
                        13. Make a note of how many data points you have at each time point for your strain. For dHAP4, it will be "3", but for the wild type it will be "4" or "5". Count carefully. Also, make a note of the total number of data points. For dHAP4, this number will be 15, but for wt it should be 23 (double-check).
                        14. In the first cell below the header dCIN5_ss_t15, type =SUMSQ(<range of cells for logFC_t15>)-<number of data points>*<AvgLogFC_t15>^2 and hit enter.
                          • The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
                          • The phrase <number of data points> should be replaced by the number of data points for that timepoint (4).
                          • The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
                          • Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
                        15. Repeat this computation for the t30 through t120 data points. Again, be sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copy the formula to the whole column for each computation.
                        16. In the first column to the right of dCIN5_ss_t120, create the column header dCIN5_SS_full.
                        17. In the first row below this header, type =sum(<range of cells containing "ss" for each timepoint>) and hit enter.
                        18. In the next two columns to the right, create the headers dCIN5_Fstat and dCIN5_p-value.
                        19. Recall the number of data points from (13): call that total n.
                        20. In the first cell of the dCIN5_Fstat column, type =((20-5)/5)*(<dCIN5_ss_HO>-<dCIN5_SS_full>)/<dICN5_SS_full> and hit enter.
                          • Replace the phrase dCIN5_ss_HO with the cell designation.
                          • Replace the phrase <dCIN5_SS_full> with the cell designation.
                          • Copy to the whole column.
                        21. In the first cell below the dCIN5_p-value header, type =FDIST(<dCIN5_Fstat>,5,n-5) replacing the phrase <dCIN5_Fstat> with the cell designation and the "n" as in (13) with the number of data points total. Copy to the whole column.
                        22. Before we move on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
                          • Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
                          • Click on the drop-down arrow on your dCIN5_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
                          • Excel will now only display the rows that correspond to data meeting that filtering criterion. A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.

                        Calculate the Bonferroni and p value Correction

                        1. Now we will perform adjustments to the p value to correct for the multiple testing problem. Label the next two columns to the right with the same label, dCIN5_Bonferroni_p-value.
                        2. Type the equation =<dCIN5_p-value>*6189 , Upon completion of this single computation, use the Step (10) trick to copy the formula throughout the column.
                        3. Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dCIN5_Bonferroni_p-value header: =IF(AL2>1,1,AL2) . Use the Step (10) trick to copy the formula throughout the column.

                        Calculate the Benjamini & Hochberg p value Correction

                        1. Insert a new worksheet named "dCIN5_B&H".
                        2. Copy and paste the "MasterIndex" and "ID" columns from your previous worksheet into the first two columns of the new worksheet.
                        3. For the following, use Paste special > Paste values. Copy your unadjusted p values from your ANOVA worksheet and paste it into Column C.
                        4. Select all of columns A, B, and C. Sort by ascending values on Column C. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column C, smallest to largest.
                        5. Type the header "Rank" in cell D1. We will create a series of numbers in ascending order from 1 to 6189 in this column. This is the p value rank, smallest to largest. Type "1" into cell D2 and "2" into cell D3. Select both cells A2 and A3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
                        6. Now you can calculate the Benjamini and Hochberg p value correction. Type dCIN5_B-H_p-value in cell E1. Type the following formula in cell E2: =(C2*6189)/D2 and press enter. Copy that equation to the entire column.
                        7. Type "dCIN5_B-H_p-value" into cell F1.
                        8. Type the following formula into cell F2: =IF(E2>1,1,E2) and press enter. Copy that equation to the entire column.
                        9. Select columns A through F. Now sort them by your MasterIndex in Column A in ascending order.
                        10. Copy column F and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.
                        • Upload the .xlsx file that you have just created to LionShare. Send Dr. Dahlquist an e-mail with the link to the file. File name GCAT_and_Ontario_Final_Normalized_Data_20150518_TM.xlsx

                        Sanity Check: Number of genes significantly changed

                        • Go to your dCIN5_ANOVA worksheet.
                        • Select row 1 (the row with your column headers) and select the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
                        • Click on the drop-down arrow for the unadjusted p value. Set a criterion that will filter your data so that the p value has to be less than 0.05.
                          • How many genes have p < 0.05? and what is the percentage (out of 6189)?
                          • How many genes have p < 0.01? and what is the percentage (out of 6189)?
                          • How many genes have p < 0.001? and what is the percentage (out of 6189)?
                          • How many genes have p < 0.0001? and what is the percentage (out of 6189)?
                          • How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
                          • How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?

                          Data & Observations

                          • 1995 / 6189 32.23%
                          • 1157 / 6189 18.69%
                          • 566 / 6189 9.15%
                          • 280 / 6189 4.52%
                          • 109 / 6189 1.76%
                          • 1117 / 6189 18.05%
                          • unadjusted p-value: 7.49666059264864E-08
                          • Bonferroni-corrected p-value: 0.000463968324079024
                          • B-H-corrected p-value: 0.0000386640270065854
                          • dCIN5_AvgLogFC_15: 4.046975
                          • dCIN5_AvgLogFC_30: 3.39825
                          • dCIN5_AvgLogFC_60: 4.2347
                          • dCIN5_AvgLogFC_90: -2.8035
                          • dCIN5_AvgLogFC_120: -0.948275

                          Files Uploaded or Updated

                          Initial_Statistical_Analysis_Slide_TM.pptx statistical analysis for timepoints in general
                          GCAT_and_Ontario_Final_Normalized_Data_20150518_TM.xlsx (Lionshare) contains Within-strain ANOVA, Bonferroni and p value Correction, and Benjamini & Hochberg p value Correction


                          1 Background

                          Microarray technology is a powerful genomic approach that enables researchers to quantify the expression levels of large numbers of genes simultaneously in one single experiment. Arrays can be single-channel (one-colour, cf. Affymetrix technology), which quantify the absolute expression of genes in specific experimental conditions, or two channel (two-colour, cf. cDNA technology). A key purpose of a two-colour microarray experiment is the identification of genes which are differentially expressed in two samples. Although this technology has given an enormous scientific potential in the comprehension of gene regulation processes, many sources of systematic variation can affect the measured gene expression levels. The purpose of data normalisation is to minimise the effects of experimental and/or technical variations, so that meaningful biological comparisons can be made and true biological changes can be found within one and among multiple experiments. Several approaches have been proposed and shown to be effective and beneficial in the reduction of systematic errors within and between arrays, both for single- and for double-channel technology [1-3]. Some authors proposed normalisation of the hybridisation intensities, while others preferred to normalise the intensity ratios. Some used global, linear methods, while others used local, non-linear methods. Some suggested using spike-in controls, or housekeeping genes, or invariant genes, while others preferred all the genes on the array.

                          In general, microarray normalisation can be divided into normalisation within arrays, for the correction of dye effects, and across arrays, for the balance of the distribution differences among experiments. Several pre-processing techniques recently proposed for two-channel technology allow the joint normalisation within and across experiments, as reported in the original papers ([4] for the vsn/glog and [5] for the q-splines). Glog and q-spline transformations, in fact, are performed on the gene expression matrix where the two channels are considered separately, allowing systematic bias reduction within and across arrays. Although several normalisation procedures have been proposed, it is still unclear which method uniformly outperforms the others under different experimental conditions. Recent works [6-8] compare, through simulated data, normalisation methods in terms of bias, variance, mean square error or leave-one-out cross-validation classification error. If we consider the two-channel technology, Park et al. [7] show that, in some cases, intensity dependent normalisation performs better than the simpler global normalisation, while [3,9] raised the concern that removal of spatial effects may add additional noise to normalised data, suggesting that a safe alternative is to remove the intensity effect only at a local level. Thus, the evaluation of normalisation's effects in microarray data analysis is still an important issue, since subsequent analyses, such as tests for differential expression, could be highly dependent on the choice of the normalisation procedure. For example, Durbin et al. [10] show that the log-transformed expression ratio has a greatly inflated variance for expression values close to 0. This effect penalises differential expression, especially for high expression levels. Hypothesis tests for differential expression may in fact be more effectively performed on data that have been transformed so as to have constant variance. Hoffman and colleagues [11] compare the effect of different normalisations on the identification of differentially expressed genes within Affymetrix technology and using a real dataset. They observe, by comparing lists of genes, that the normalisation has a profound influence on the detection of differentially expressed genes.

                          Moreover, the MicroArray Quality Control (MAQC) [12] project, which is specifically designed to address reproducibility of microarray technology by comparing results obtained across different array platforms, chooses the statistical analysis on the base of the normalisation and gene selection technique as the crucial steps in order to improve reproducibility [13].

                          When microarray experiments are adopted with diagnostic purposes, this result appears to be fundamental because scientists are looking for a list of a few pathology marker genes. Marker genes can be defined as genes whose expression profiles are discriminating between case and control samples. It is likely to suppose that the complete list of markers of a condition is composed by hundreds of genes, highly correlated and mostly implicated in few signalling cascades. Only few of them lie upstream these signalling cascades and are responsible of the differential expression of all the others genes. Hence, if different pre-processing have an impact on the identification of differentially expressed genes, they could lead to different lists of markers. The aim of this work is to compare and evaluate the impact of various normalisation procedures proposed for two-channel array technology on the identification of marker genes. We shall use both simulated and real data derived by cDNA and oligo microarray (two-colour technique).

                          The use of a simulation approach allows us to study the sensitivity and specificity of the tests after normalisation and to compare different approaches' performances. However, simulation of DNA microarray data can be questioned, mainly because (i) the relation between expression and experimental factors involved is not theoretically established, and (ii) the statistical distribution of differential expression given by various causes across genes is still controversial. In order to address such issues, we adopt two different classes of simulation models.

                          Although we found a limited difference of sensitivities and specificities for the tests after each normalisation, the study highlights a strong impact in terms of gene ranking, resulting in different levels of agreement between competing normalisations. Finally, we show that the combination of two normalisations, such as glog and lowess, that handle different aspects of microarray data, is able to outperform the other individual techniques.


                          Methods

                          Here the extension of the EB-LNN model assuming gene-specific variances[8, 30] by β-divergence, which we have called the β-EB approach, for the identification of DE genes, is discussed. The simulated and real microarray gene expression datasets that we have analyzed to investigate the performance of the proposed method are also described.

                          Empirical Bayes hierarchical model

                          If the transcript-specific parameter θ t = ( μ t , θ t ∗ ) , where μ t and θ t ∗ are the location and scale parameters respectively, then the conditional likelihood of the t th transcript’s expression measurement y t= (yt 1,yt 2,…,y tn) can be expressed as ∏ i = 1 n f obs y ti | θ t (t=1,2,…,T). The location parameter μ t follows the prior distribution, Π(μ t|θ), where θ is the hyper-parameter specifying the prior distribution. The predictive likelihood of y t (unconditional on the location parameter μ t) is obtained by integrating over the location parameter, μ t, as follows:

                          When expression measurements between two groups (for example, different cell types) are compared for transcript t, the measurements are partitioned into two user defined groups G1 and G2 of sizes n1 and n2 respectively, where n1 + n2 =n. If there is no significant difference between the means of the two groups, the gene is assumed to be equivalently expressed (EE) otherwise, it is assumed to be a DE gene. If the t th transcript is DE, the two groups will have different mean expression levels, μ t ( j ) , j = 1 , 2 . Given the values of μ t ( j ) , j = 1 , 2 and θ t ∗ , the conditional likelihood of y t = y t ( 1 ) : y t ( 2 ) is written as follows:

                          because components of y tare independent of each other. Assuming that the group means μ t ( j ) , j = 1 , 2 (such that μ t ( 1 ) ≠ μ t ( 2 ) ) independently originate from Π(μ t|θ), then the predictive likelihood of y t (unconditional on the location parameters μ t ( j ) , j = 1 , 2 ) is obtained as a mean of the conditional likelihood of y t(2) over the prior distribution of μ t ( 1 ) and μ t ( 2 ) as follows:

                          Because it is unknown whether the t th gene is EE or DE between the two groups, the final likelihood of y t(unconditional on the location parameters) becomes a mixture of two distributions (1) and (3) as follows:

                          Here, p0 and p1 are the mixing proportions of the EE and DE transcripts in the two user defined groups respectively, such that p0 + p1 = 1. The posterior probability of differential expression (PPDE) is calculated by Bayes rule using the estimates of p0, f0 and f1 as follows:

                          It should be noted here that θ and θ t ∗ in equations (1)-(5) are assumed to be exactly the same.

                          Maximum β-likelihood estimation of mixture distribution using an EM-like algorithm to calculate β-posterior probabilities of differential expressions

                          Box and Cox[38] proposed a family of power transformations of the dependent variable in regression analysis to robustify the normality assumption. By choosing an appropriate value of λ in the transformation,

                          the standard linear regression model with the normality assumption fits well to a wide range of data. Inspired by this idea, Basu et al[36] and Minami and Eguchi[37] proposed a robust and efficient method for estimating model parameter θ by minimizing a density power divergence in a general framework of statistical modeling and inference. They[36, 37] have also shown that minimizer of density power divergence is equivalent to the maximizer of β-likelihood function. According to the current problem in this paper, the β-likelihood function for θ given the values of the mixing parameter p0 = 1 − p1 and the gene specific scale parameter θ t ∗ for all t can be written as

                          where f(.) is the mixture of distributions as defined in (4) and l β ( θ ) = 1 1 + β ∫ f β + 1 ( y | θ , θ t ∗ , p 0 ) d y − β − 1 β which is independent of observations. Because the gradient of (6) can be converted as follows,

                          the maximum β-likelihood estimator (β-MLE) of θ can be regarded as a weighted (quasi-) likelihood estimator. Then the weight of gene t is described as a power function of its likelihood, f β ( y t | θ , θ t ∗ , p 0 ) , where f(.) is defined by equation (4). Thus, the genes with low likelihoods have unexpected expression patterns and have low weights because the normal density function produces smaller outputs for larger inputs. By assigning low weights to outliers, the inference becomes robust. It is obvious from (7) that β-MLE reduces to the classical MLE for β = 0. Because the expression pattern (EE or DE) of each gene is unknown, it is difficult to optimize both the classical log-likelihood function and the proposed β-likelihood function for directly estimating θ. To overcome this problem, we consider the EM-like algorithm to obtain β-MLE of θ treating the mixture distribution (4) as an incomplete-data density. The hyper-parameters θ and the mixing proportion p0 are estimated by EM algorithm as follows:

                          The hyperparameters, θ,p0 are estimated by the EM algorithm in two steps. E-step: Compute the Q-function which is defined by the conditional expectation of the complete-data β-likelihood with respect to the conditional distribution of missing data (Z) given the observed data (Y) and the current estimated parameter value θ β ( j ) as follows:

                          where k = 0 for y t belongs to EE pattern and k = 1 for y t belongs to DE pattern. Here

                          which does not depend on observations,

                          is the posterior probability of k th pattern for gene t and the value of p1 = 1 − p0 is updated by a separate EM formulation as follows:

                          For β → 0 , the proposed Q-function Q β ( θ | θ ( j ) ) reduces to the standard Q-function Q(θ|θ (j) ) of the standard empirical Bayes approaches[8, 30].

                          M-step: Find θ (j + 1) by maximizing the proposed Q-function as defined in (8). Continue EM iterations up to the convergence of successive estimates of θ. The estimate of θ after convergence is taken to be the β-MLE of θ according to the EM properties.

                          The tuning parameter, β, controls the balance between the robustness and efficiency of the estimators. By setting a tentative value for β0, the optimal value is estimated by maximizing the predictive β0-likelihood via a five-fold cross validation. The dataset is divided into five subsets by transcripts. For each value of β, the predictive β0-likelihood of each subset is calculated based on the maximum β-likelihood estimates of the parameters based on the rest of the data. Finally, the β value that maximizes the average predictive β0-likelihood is selected as the optimal value of β. For more information about β-selection, please see[39, 40].

                          Then, based on the estimate values of the model parameters, we can compute the PPDE between two groups of y t using equation (5) for all t. However, PPDE of contaminated gene using equation (5) might be produced misleading result, since PPDE of y t depends on the estimate values of parameters and measurements of y t. To overcome this problem, we detect contaminated genes using β-weight function and replace the contaminated measurements in y tby its group means. Then we compute the PPDE of contaminated y t using equation (5) also. The PPDE based on β-MLE, we call β-PPDE in this paper. The detail discussion for computation of β-PPDE under LNN model is discussed below in the LNN model.

                          The LNN model

                          In this paper, we use the LNN (log-normal-normal) hierarchical model for computing the posterior probability of differential expressions. In the LNN model, log-transformed gene expression measurements are assumed to follow normal distribution for each gene with the transcript-specific parameter θ t = ( μ t , θ t ∗ ) , where μ t is the transcript-specific mean and θ t ∗ = σ t 2 is the transcript-specific variance for gene t[8, 30]. A conjugate prior for μ t is assumed to follow the normal with some underlying mean μ0and variance τ 0 2 that is, Π μ t | θ ∼ N ( μ 0 , τ 0 2 ) , where θ = ( μ 0 , τ 0 2 ) . By integrating as in (1), the density f0(·) for an n-dimensional input becomes Gaussian with the mean vector μ0 = ( μ 0 , μ 0 , … , μ 0 ) t and an exchangeable covariance matrix as follows:

                          where I n is an n × n identity matrix and M nis a matrix of ones.

                          The gene specific variance σ t 2 is computed separately assuming prior distribution for σ t 2 as scale-inverse χ 2 ( ν ∗ , σ ∗ 2 ) , where ν is the degrees of freedom and σ ∗ 2 is the scaled parameter. Yang et al.[30] proposed that σ t 2 could be estimated by a Bayes estimator defined as,


                          2 METHODS

                          2.1 Definition of the model

                          In the problem we are considering there are normally two or more variables of interest. One of them is typically the time, which is a quantitative variable (in the type of experiments considered for this approach, time is usually the independent variable, however the methodology would accept as well other experimental continuous variables, such as a quantified physiological parameter). The other variables are usually qualitative variables (e.g. different treatments, strains, tissues, etc.) and represent the experimental groups for which temporal gene expression differences are sought. For clarity in the exposition, only one qualitative variable or factor will be considered here.

                          Let there be I experimental groups described by the qualitative variable evaluated at J time points for each particular condition ij (i = 1, … , I and j = 1, … , J). Assume that gene expression is measured for N genes in Rij replicated hybridizations.

                          We define I − 1 dummy variables (binary variables) to distinguish between each group and a reference group ( Table 1).

                          Definition of experimental groups with dummy variables

                          Definition of experimental groups with dummy variables

                          This model defines implicitly as many models as experimental groups. For example, the model for the first group is y 1 j r = β 0 + δ 0 T 1 j r + γ 0 T 1 j r 2 + ⋯ + λ 0 T 1 j r J − 1 + ɛ 1 j r ⁠ , since in this group all the dummies are 0 and for the second group is y 2 j r = ( β 0 + β 1 ) + ( δ 0 + δ 1 ) T 2 j r + ( γ 0 + γ 1 ) T 2 j r 2 + ⋯ + ( λ 0 + λ 1 ) T 2 j r J − 1 + ɛ 2 j r ⁠ . In this example β1, δ1, γ1, … ,λ1 measure the differences between the second and first (reference) groups related to linear, quadratic, etc. and (J − 1)-th time order effects respectively.

                          2.2 First regression model: gene selection

                          This first analysis generates N ANOVA tables as shown in Table 2, one for each gene. A gene with different profiles between the reference group and any other experimental group will show some statistically significant coefficient, and its corresponding regression model will be statistically significant. The P-value associated to the F-Statistic in the general regression model is used to select significant genes. This P-value is corrected for multiple comparisons by applying the linear step-up (BH) false discovery rate (FDR) procedure (Reiner et al., 2002). Therefore, genes with a FDR lower than a predetermined threshold will be selected.

                          ANOVA table. y ^ is the predicted expression value, y ¯ is the average expression value and p is the number of variables in the model, (polynomial order +1) I − 1 = JI − 1

                          Source . Sum of squares (SC) . Degrees of freedom . Mean square error . F-Statistic .
                          Regression SCR = ∑ i j r ( y ^ i j r − y ¯ ) 2 pSCR/p ( SCR / p ) SCE / [ ∑ i , j R i , j − ( p + 1 ) ]
                          Error SCE = ∑ i j r ( y i j r − y ^ i j r ) 2 ∑ i , j R i j − ( p + 1 ) SCE ∑ i , j R i j − ( p + 1 )
                          Total S C T = ∑ i j r ( y i j r − y ¯ ) 2 ∑ i , j R i j − 1
                          Source . Sum of squares (SC) . Degrees of freedom . Mean square error . F-Statistic .
                          Regression SCR = ∑ i j r ( y ^ i j r − y ¯ ) 2 pSCR/p ( SCR / p ) SCE / [ ∑ i , j R i , j − ( p + 1 ) ]
                          Error SCE = ∑ i j r ( y i j r − y ^ i j r ) 2 ∑ i , j R i j − ( p + 1 ) SCE ∑ i , j R i j − ( p + 1 )
                          Total S C T = ∑ i j r ( y i j r − y ¯ ) 2 ∑ i , j R i j − 1

                          ANOVA table. y ^ is the predicted expression value, y ¯ is the average expression value and p is the number of variables in the model, (polynomial order +1) I − 1 = JI − 1

                          Source . Sum of squares (SC) . Degrees of freedom . Mean square error . F-Statistic .
                          Regression SCR = ∑ i j r ( y ^ i j r − y ¯ ) 2 pSCR/p ( SCR / p ) SCE / [ ∑ i , j R i , j − ( p + 1 ) ]
                          Error SCE = ∑ i j r ( y i j r − y ^ i j r ) 2 ∑ i , j R i j − ( p + 1 ) SCE ∑ i , j R i j − ( p + 1 )
                          Total S C T = ∑ i j r ( y i j r − y ¯ ) 2 ∑ i , j R i j − 1
                          Source . Sum of squares (SC) . Degrees of freedom . Mean square error . F-Statistic .
                          Regression SCR = ∑ i j r ( y ^ i j r − y ¯ ) 2 pSCR/p ( SCR / p ) SCE / [ ∑ i , j R i , j − ( p + 1 ) ]
                          Error SCE = ∑ i j r ( y i j r − y ^ i j r ) 2 ∑ i , j R i j − ( p + 1 ) SCE ∑ i , j R i j − ( p + 1 )
                          Total S C T = ∑ i j r ( y i j r − y ¯ ) 2 ∑ i , j R i j − 1

                          2.3 Second regression step: variable selection

                          Once statistically significant gene models have been found, the regression coefficients of the models can be used to identify the conditions for which genes shows statistically significant profile changes. To do this, a new model is obtained only for selected genes, applying a variable selection strategy (stepwise regression, Draper and Smith, 1998). Stepwise regression is an iterative regression approach that selects from a pool of potential variables the ‘best’ ones (according to a specified criterion) to fit the available data. In this process, the statistical significance of the regression coefficients of the variables present in the model at each iteration is computed and only those variables with a P-value under a given threshold (type I risk) are maintained. In this case, applying FDR for multiple comparisons is not easy due to the fact that P-values associated to each coefficient vary as the model evolves. Therefore, we apply a threshold that must be fixed by the researcher. We recommend to correct the desired level of significance for the total possible number of variables in the model. The variables included in these new models will be those that indicate the differences in profiles. The maSigPro package provides different types of stepwise regression: backward, forward, stepwise backward and stepwise forward. This variable selection approach has a double effect: on one hand it provides the significant differences between experimental groups, and on the other hand, it generates an adequate regression model for the data. This implies that for each gene and experimental group, polynomial regressions of different degree (up to the maximum initially given in the formulation of the model) can be obtained. The method will therefore generate a matrix with so many rows as significant genes and so many columns as parameters in the complete regression model [ Equation (1)]. This results matrix contains information (estimated coefficient and its P-value) for those variables that remained in the model of each gene. Table 3 is an illustrating example of such a results matrix.

                          Results matrix of regression coefficients for the variable selection fit. Genes are given in rows and model parameters in columns. Regression coefficients associated to the same dummy variables are labelled with the same number. NA value for regression coefficients indicates that the variable was not statistically significant for that gene (under a given threshold, type I risk)

                          Results matrix of regression coefficients for the variable selection fit. Genes are given in rows and model parameters in columns. Regression coefficients associated to the same dummy variables are labelled with the same number. NA value for regression coefficients indicates that the variable was not statistically significant for that gene (under a given threshold, type I risk)

                          This matrix provides the framework for selecting significant genes for each variable of the complete model and for each experimental group. For example, to find genes that have significant differences in group 2 respect to the reference group, those genes having statistically significant coefficients for the variables associated to the Dummy1 (D1, Time × D1, … , Time J −1 × D1) must be selected, i.e. genes which have a significant β1, δ1, … , λ1 coefficients (columns labelled as 2 in Table 3). In addition, the study of individual model variables allows focussing on the evaluation of specific pattern differences. For example, the analysis of the regression coefficients of the variable Time × D1 allows the classification of genes for their different behaviour in the linear model component (i.e. induction or repression) of group 2 with respect to the reference group. The maSigPro package includes functions to easily perform different types of gene selection at this stage.

                          Until now, the goodness of fit (R-squared) of the new models has not been considered. This means that all significant genes are selected genes. The researcher might however be interested only in genes with clear trends as this may reflect biologically meaningful behaviours. In such case, maSigPro allows an additional gene selection step based on the R-squared value of the second regression model.

                          2.4 Visualization

                          The maSigPro package provides a number of functions for the visual analysis of the results. Individual plots of expression profiles by experimental group can easily be generated for each significant gene. Computed regression curves can also be superimposed to visualize the modelling obtained for the data. When the number of selected genes is large, cluster algorithms may be used to split the data into groups of similar expression patterns. maSigPro incorporates a number of traditional clustering algorithms to do so. These algorithms typically use gene expression data to compute clusters. In additional, maSigPro provides a clustering alternative that uses the estimated regression coefficients rather than the original data. This option will group genes on the basis of their statistically significant profile changes, discarding the noise of the data that has been removed by the estimated model. Once clusters have been obtained, maSigPro displays both the continuous expression profile along all experimental conditions and the average expression profile by experimental group for each cluster. The first representation helps to analyze the homogeneity of the clusters while the second provides a useful visualization of the between-groups differences for the genes of each cluster.


                          Discussion

                          Development of accurate prognostic gene expression signatures is a central challenge in clinical cancer research. We aimed to analyse and compare commonly employed single-gene and gene-set methods for prognostic classification alongside the more recent network-based approaches involving integration of PPI networks with gene expression information [8, 11]. To focus our research on the potential value of these newer PPI network-based methods we excluded methods requiring additional information beyond a priori PPI knowledge. For the same reason, we also limited our comparison to the gene-set approach (median expression, [30, 39]) which is a direct analog of the single-gene moderated-t statistic method [30] also considered herein. Moreover, and for the first time, we conducted an analysis at the level of individual patients as well as with respect to the classes being compared (good versus poor prognosis). We undertook our analyses using gene expression microarray information from two previously reported studies in separate cancers (melanoma [13] and ovarian [15]) and utilised PPI information from two different networks (MetaCore™ and iRefWeb [50]). Each of the methods analysed comprised a feature-selection algorithm which was followed by classification in R [53] using each of three different classifiers: RF, SVM and DLDA. All classification error rates were estimated using 100 rounds of 5-fold CV [54].

                          Single-gene, gene-set, and network-based methods achieved similar error-rates, confirming prior observations and emphasizing questions about their value

                          Our first finding - that the different approaches achieved similar error rates with respect to each other - validates prior observations by [19–21]. Specifically, none of the classifiers employing composite features derived from secondary PPI data sources (the network-based approaches) out-performed the classical single-gene/gene-set approaches. We also observed that their performance in terms of error-rates varyied with the cancer data set being analysed. For example, NetRank achieved the lowest error rates overall in the melanoma cohort, a finding that did not hold up using the ovarian cancer data set but which was consistent with the findings of [55] who noted that of the 25 datasets they analysed, NetRank outperformed a number of classical single-gene methods in 23 of them.

                          Patient-level analyses indicated there is value to be gained from network-based methods

                          Prior related reports have described and discussed the single-gene, gene-set, and network-based approaches to gene expression signature modelling to varying degrees [6, 56]. However, these works did not analyse first hand nor consider in detail the methods for translation of such feature selection methods into a general classification framework. Of the previous studies that did undertake formal evaluations, the focus has been upon network- and pathway-based classifiers for outcome prediction in breast cancers [19–21] or on comparing the effect of using different kinds of external biological information in the learning process like functional annotations, PPIs, and expression correlation among genes [57]. Further, these previous studies draw their conclusions based primarily on classification error rates, whereas a particular novelty of this study is that we considered (for the first time) our results at the more in-depth level of individual patients. Within this framework, we observed that different methods were capturing different subsets of the sample space i.e., the different approaches were correctly classifying different samples. These results would suggest that new composite classification methods are needed to capture the complementary value of network-based, gene-set and classical single-gene approaches.

                          Analyses of error rates within the good and poor survival classes further highlights the need for a composite approach to gene expression signature modelling

                          Our finding that samples from the GP class were easier to classify in melanoma, but samples from the PP class were easier to classify in ovarian cancer, highlights an underlying dataset dependency of these gene-set and network approaches. We also note some overall differences between the datasets. Performing a moderated t-statistic differential expression analysis on each data set (Additional File 9 - Supplementary results: Comparison of the ovarian cancer data set and the melanoma data set), we found that the melanoma data set contained 96 DE genes (p-value < 0.1) while the ovarian cancer data set had only 13 DE genes. This observation could offer an explanation as to why the methods focusing on identification of individual informative genes (the single-gene moderated t-statistic method and the NetRank network method) perform better on the melanoma data set than in the ovarian cancer data. Furthermore, there was no overlap between the 100 most DE genes from each data set. Using Taylor's differential correlation measure, and an arbitrary threshold of 0.5, we found that the melanoma data set contained 23 differentially correlation hub sub-networks and the ovarian cancer data set contained only 7. Thus in general it seems as though the melanoma data set contained more differential features between the PP and GP classes than the ovarian data set, which could explain the lower error rates obtained for melanoma.

                          NetRank is the most stable approach, supporting the potential value of network-based approaches to prognostication

                          Our assessment of the stability of features identified in each method highlighted NetRank [11] as the most stable approach in both cancers. This finding is consistent with observations made by [55] who found that cancer-related signatures identified by NetRank had significant overlap between the data sets they considered. The study also showed that performing classification on datasets with the aim of classification into prognostic classes (as was the goal here) was notably less accurate than when the aim was to address diagnosis or sub-typing classification problems,. This finding again confirms that incorporation of network information may lead to identification of more stable gene expression signatures.

                          Findings were validated in an independent PPI network, suggesting network invariance of the network methods

                          Our reproduction of findings obtained using the iRefWeb PPI network but instead using the MetaCore™ PPI network demonstrates that the network methods are not hugely dependent on the PPI network used. Strengthening this claim we note also that although the two networks contain many of the same proteins, they are somewhat different from one another in terms of global structure (Additional File 9 - Supplementary Results: Comparison of the MetaCore™ and iRefWeb PPI networks).

                          Ongoing network issues

                          Network-based approaches continue to have important limitations of consequence to their translational relevance. In a PPI network, nodes correspond to protein-coding genes, and the existence of an edge between two such genes implies that the proteins coded for by those genes interact in a biological setting. One of the most significant issues in PPI network analysis is the inaccuracy and the lack of reliability of the available networks. It has been noted that of the huge number of currently available networks, there is very little overlap and consistency between them [58–61]. Currently, the two main approaches for identification of protein-protein interactions are the yeast two-hybrid (Y2H) and the affinity purification of complexes followed by mass spectrometry (AP-MS). Details on these methods can be found in [62]. Moreover, despite the huge number of publicly and privately available PPI databases, there is no interaction database covering the entire human genome. However, there are ongoing projects, such as the Human Interactome Project [63], with the aim of developing a complete human PPI map. Although our study showed that the network methods are not hugely dependent on the PPI network used, full elucidation of the extent to which the a priori information impacts the findings remains an open question.


                          Results

                          Counts per million: a simple interpretable scale for assessing differential expression

                          We suppose that RNA-seq profiles (or libraries) are available for a set of n RNA samples. Each profile records the number of sequence reads from that sample that have been mapped to each one of G genomic features. A genomic feature can be any predefined subset of the transcriptome, for example a transcript, an exon or a gene. For simplicity, we will assume throughout this article that reads have been summarized by gene, so that the RNA-seq profiles give the number of reads from each sample that have been mapped to each gene. Typically G is large, in the tens of thousands or more, whereas n can be as low as three. The total number of mapped reads (library size) for each sample might vary from a few hundred thousand to hundreds of millions. This is the same context as assumed by a number of previous articles [13, 18, 20, 21, 34].

                          The number of reads observed for a given gene is proportional not just to the expression level of the gene but also to its gene transcript length and to the sequencing depth of the library. Dividing each read count by the corresponding library size (in millions) yields counts per million (cpm), a simple measure of read abundance that can be compared across libraries of different sizes. Standardizing further by transcript length (in kilobases) gives rise to reads per kilobase per million (rpkm), a well-accepted measure of gene expression [35]. In this article we will work with the simpler cpm rather than rpkm, because we are interested in relative changes in expression between conditions rather than absolute expression.

                          This article treats log-cpm as analogous to log-intensity values from a microarray experiment, with the difference that log-cpm values cannot be treated as having constant variances. Differences in log-cpm between samples can be interpreted as log-fold-changes of expression. The counts are augmented by a small positive value (a half of one read) to avoid taking the logarithm of zero. This ensures no missing log-cpm values and reduces the variability at low count values.

                          Log-cpms have stabilized variances at high counts

                          Probability distributions for counts are naturally heteroscedastic, with larger variances for larger counts. It has previously been argued that the mean-variance relationship for RNA-seq counts should be approximately quadratic [34]. This leads to the conclusion that the coefficient of variation (CV) of RNA-seq counts should be a decreasing function of count size for small to moderate counts but for larger counts should asymptote to a value that depends on biological variability. Specifically, the squared CV of the counts should be roughly

                          where λ is the expected size of the count and ϕ is a measure of biological variation [34]. The first term arises from the technical variability associated with sequencing, and gradually decreases with expected count size, while biological variation remains roughly constant. For large counts, the CV is determined mainly by biological variation.

                          A simple linearization calculation suggests that the standard deviation of the log-cpm should be approximately equal to the CV of the counts (see Materials and methods). Examination of a wide range of real datasets confirms these expectations. For studies where the replicates are entirely technical in nature, the standard deviation of the log-cpm decreases steadily as a function of the mean (Figure 1a). For studies where the replicates are genetically identical mice, the standard deviation asymptotes at a moderate level corresponding to a biological CV of about 10% (Figure 1b). Studies where the replicates are unrelated human individuals show greater biological variation. For these studies, the standard deviation asymptotes early and at a relatively high level (Figure 1d).

                          Mean-variance relationships. Gene-wise means and variances of RNA-seq data are represented by black points with a LOWESS trend. Plots are ordered by increasing levels of biological variation in datasets. (a) voom trend for HBRR and UHRR genes for Samples A, B, C and D of the SEQC project technical variation only. (b) C57BL/6J and DBA mouse experiment low-level biological variation. (c) Simulation study in the presence of 100 upregulating genes and 100 downregulating genes moderate-level biological variation. (d) Nigerian lymphoblastoid cell lines high-level biological variation. (e) Drosophila melanogaster embryonic developmental stages very high biological variation due to systematic differences between samples. (f) LOWESS voom trends for datasets (a)–(e). HBRR, Ambion’s Human Brain Reference RNA LOWESS, locally weighted regression UHRR, Stratagene’s Universal Human Reference RNA.

                          We conclude that log-cpm values generally show a smoothly decreasing mean-variance trend with count size, and that the log-cpm transformation roughly de-trends the variance of the RNA-seq counts as a function of count size for genes with larger counts.

                          Using log-cpm in a limma pipeline

                          A simple approach to analyzing RNA-seq data would be to input the log-cpm values into a well-established microarray analysis pipeline such as that provided by the limma software package [3, 9]. This would be expected to behave well if the counts were all reasonably large, but it ignores the mean-variance trend for lower counts. The microarray pipeline should behave better if modified to include a mean-variance trend as part of the variance modeling. We have therefore modified the empirical Bayes procedure of the limma package so that the gene-wise variances are squeezed towards a global mean-variance trend curve instead of towards a constant pooled variance. This is similar in principle to the procedure proposed by Sartor et al. [36] for microarray data, except that we model the trend using a regression spline and our implementation allows for the possibility of missing values or differing residual degrees of freedom between genes. We call this strategy limma-trend, whereby the log-cpm values are analyzed as for microarray data but with a trended prior variance. For comparison, the more naive approach without the mean-variance trend will be called limma-notrend.

                          Voom: variance modeling at the observation-level

                          The limma-trend pipeline models the variance at the gene level. However, in RNA-seq applications, the count sizes may vary considerably from sample to sample for the same gene. Different samples may be sequenced to different depths, so different count sizes may be quite different even if the cpm values are the same. For this reason, we wish to model the mean-variance trend of the log-cpm values at the individual observation level, instead of applying a gene-level variability estimate to all observations from the same gene.

                          Our strategy is to estimate non-parametrically the mean-variance trend of the logged read counts and to use this mean-variance relationship to predict the variance of each log-cpm value. The predicted variance is then encapsulated as an inverse weight for the log-cpm value. When the weights are incorporated into a linear modeling procedure, the mean-variance relationship in the log-cpm values is effectively eliminated.

                          A technical difficulty is that we want to predict the variances of individual observations although there is, by definition, no replication at the observational level from which variances could be estimated. We work around this inconvenience by estimating the mean-variance trend at the gene level, then interpolating this trend to predict the variances of individual observations (Figure 2).

                          voom mean-variance modeling. (a) Gene-wise square-root residual standard deviations are plotted against average log-count. (b) A functional relation between gene-wise means and variances is given by a robust LOWESS fit to the points. (c) The mean-variance trend enables each observation to map to a square-root standard deviation value using its fitted value for log-count. LOWESS, locally weighted regression.

                          The algorithm proceeds as follows. First, gene-wise linear models are fitted to the normalized log-cpm values, taking into account the experimental design, treatment conditions, replicates and so on. This generates a residual standard deviation for each gene (Figure 2a). A robust trend is then fitted to the residual standard deviations as a function of the average log-count for each gene (Figure 2b).

                          Also available from the linear models is a fitted value for each log-cpm observation. Taking the library sizes into account, the fitted log-cpm for each observation is converted into a predicted count. The standard deviation trend is then interpolated to predict the standard deviation of each individual observation based on its predicted count size (Figure 2c). Finally, the inverse squared predicted standard deviation for each observation becomes the weight for that observation.

                          The log-cpm values and associated weights are then input into limma’s standard differential expression pipeline. Most limma functions are designed to accept quantitative weights, providing the ability to perform microarray-like analyses while taking account of the mean-variance relationship of the log-cpm values at the observation level.

                          Voom and limma-trend control the type I error rate correctly

                          We have found that voom and limma-trend, especially voom, perform well and produce P values that control error rates correctly over a wide range of simulation scenarios. For illustration, we present results from simulations in which read counts were generated under the same NB model as assumed by a number of existing RNA-seq analysis methods. These simulations should represent the ideal for the NB-based methods. If the normal-based methods can give performance comparable to or better than count-based methods in these simulations, then this is strong evidence that they will be competitive across a wide range of data types.

                          Six RNA-seq count libraries were simulated with counts for 10,000 genes. The first three libraries were treated as group 1 and the others as group 2. The distribution of cpm values for each library was simulated to match the distribution that we observed for a real RNA-seq dataset from our own practice. The NB dispersion ϕ was set to decrease on average with expected count size, asymptoting to 0.2 squared for large counts. This degree of biological variation is representative of what we observe for real RNA-seq data, being larger than we typically observe between genetically identical laboratory mice but less than we typically see between unrelated human subjects (Figure 1). An individual dispersion ϕ was generated for each gene around the trend according to an inverse chi-square distribution with 40 degrees of freedom. The voom mean-variance trend for one such simulated dataset is shown in Figure 1c. It can be seen from Figure 1 that the simulated dataset is intermediate between the mouse data (Figure 1b) and the human data (Figure 1d) both in terms of the absolute size of the dispersions and in terms of heterogeneity of the dispersions between genes.

                          We found that variation in sequencing depth between libraries had a noticeable impact on some RNA-seq analysis methods. Hence all the simulations were repeated under two library size scenarios, one with the same sequencing depth for all six libraries and one with substantial variation in sequencing depth. In the equal size scenario, all libraries were simulated to contain 11 million reads. In the unequal size scenario, the odd-numbered libraries were simulated to have a sequence depth of 20 million reads while the even-numbered libraries had a sequence depth of 2 million reads. Hence the same total number of reads was simulated in this scenario but distributed unevenly between the libraries.

                          In the first set of simulations, we examined the ability of voom and limma-trend to control the type I error rate correctly in the absence of any genuine differential expression between the groups. When there are no truly differentially expressed genes, the gene-wise P values should follow an approximate uniform distribution. If the type I error rate is controlled correctly, then the expected proportion of P values below any cutoff should be less than or equal to the cutoff value. A number of popular RNA-seq analysis methods based on the NB or Poisson distributions were included for comparison. Figure 3 shows results for a P value cutoff of 0.01. Results for other cutoffs are qualitatively similar. None of the NB- or Poisson-based methods were found to control the type I error rate very accurately. When the library sizes are equal, the NB and Poisson methods were overly liberal, except for DESeq which is very conservative. When the library sizes are unequal, DSS and DESeq became extremely conservative. By contrast, all the normal-based methods were slightly conservative. voom produces results very close to the nominal type I error rate for both library size scenarios. limma-trend is similar to voom when the library sizes are equal but somewhat conservative when the library sizes are unequal.

                          Type I error rates in the absence of true differential expression. The bar plots show the proportion of genes with P<0.01 for each method (a) when the library sizes are equal and (b) when the library sizes are unequal. The red line shows the nominal type I error rate of 0.01. Results are averaged over 100 simulations. Methods that control the type I error at or below the nominal level should lie below the red line.

                          baySeq was not included in the type I error rate comparison because it does not return P values. However, the results presented in the next section show that it is relatively conservative in terms of the false discovery rate (FDR) (Figure 4).

                          Power to detect true differential expression. Bars show the total number of genes that are detected as statistically significant (FDR < 0.1) (a) with equal library sizes and (b) with unequal library sizes. The blue segments show the number of true positives while the red segments show false positives. 200 genes are genuinely differentially expressed. Results are averaged over 100 simulations. Height of the blue bars shows empirical power. The ratio of the red to blue segments shows empirical FDR. FDR, false discovery rate.

                          To check voom’s conservativeness on real data, we used a set of four replicate libraries from the SEQC Project [37]. All four libraries were Illumina HiSeq 2000 RNA-seq profiles of samples of Ambion’s Human Brain Reference RNA (HBRR) [38]. We split the four libraries into two groups in all possible ways, and tested for differential expression between the two groups for each partition. voom returned no differentially expressed (DE) genes at 5% FDR for six out of the seven possible partitions, indicating good error rate control. The voom mean-variance trend for the SEQC data, using all the libraries rather than the HBRR samples only, is shown in Figure 1a.

                          Voom has the best power of methods that control the type I error rate

                          Next we examined the power to detect true differential expression. For the following simulations, 100 randomly selected genes were twofold upregulated in the first group and another 100 were twofold upregulated in the second group. This represents a typical scenario for a functional genomics experiment in which the differential expression effects are large enough to be biologically important but nevertheless sufficiently subtle as to challenge many analysis methods. Figure 4 shows the number of true and false discoveries made by various analysis methods at significance cutoff FDR <0.1. When the library sizes are equal, voom and limma-trend have the next best power after edgeR and PoissonSeq. However, both edgeR and PoissonSeq give empirical FDRs greater than 0.1, confirming the results of the previous section that these methods are somewhat liberal. limma-trend gives an empirical FDR slightly greater than voom but still less than 0.1. With unequal library sizes, voom has the best power except for edgeR while still maintaining a low FDR. TSPM declares by far the most DE genes, but these are mostly false discoveries. DSS also gives a worryingly high rate of false discoveries when the library sizes are unequal. Figures 3 and 4 together show that voom has the best power of those methods that correctly control the type I and FDR error rates.

                          Voom has the lowest false discovery rate

                          Next we compared methods from a gene ranking point of view, comparing methods in terms of the number of false discoveries for any given number of genes selected as DE. Methods that perform well will rank the truly DE genes in the simulation ahead of non-DE genes. Genes were ranked by posterior likelihood for baySeq and by P value for the other methods. The results show that voom has the lowest FDR at any cutoff (Figure 5). When the library sizes are equal, limma-trend and PoissonSeq are very close competitors (Figure 5a). When the library sizes are unequal, limma-trend and edgeR are the closest competitors (Figure 5b).

                          False discovery rates. The number of false discoveries is plotted for each method versus the number of genes selected as differentially expressed. Results are averaged over 100 simulations (a) with equal library sizes and (b) with unequal library sizes. voom has the lowest FDR at any cutoff in either scenario. FDR, false discovery rate.

                          Next we compared FDRs using spike-in control transcripts from the SEQC project [39]. The data consists of eight RNA-seq libraries, in two groups of four. A total of 92 artificial control transcripts were spiked-in at different concentrations in such a way that three quarters of the transcripts were truly DE and the remaining quarter were not. To make the spike-ins more like a realistic dataset, we replicated the counts for each of the 23 non-DE transcripts three times. That is, we treated each non-DE transcript as three different transcripts. This resulted in a dataset of 138 transcripts with half DE and half non-DE. Figure 6 is analogous to Figure 5 but using the spike-in data instead of simulated data. voom again achieved the lowest FDR, with edgeR and the other limma methods again being the closest competitors (Figure 6).

                          False discovery rates evaluated from SEQC spike-in data. The number of false discoveries is plotted for each method versus the number of genes selected as differentially expressed. voom has the lowest false discovery rate overall.

                          Voom and limma-trend are faster than specialist RNA-seq methods

                          The different statistical methods compared varied considerably in computational time required, with DESeq, TSPM and baySeq being slow enough to limit the number of simulations that were done. voom is easily the fastest of the methods compared, with edgeR-classic next fastest (Figure 7).

                          Computing times of RNA-seq methods. Bars show time in seconds required for the analysis of one simulated dataset on a MacBook laptop. Methods are ordered from quickest to most expensive.

                          RNA-seq profiles of male and female Nigerian individuals

                          So far we have demonstrated the performance of voom on RNA-seq datasets with small numbers of replicate libraries. To demonstrate the performance of voom on a heterogeneous dataset with a relatively large number of replicates and a high level of biological variability, we compared males to females using RNA-seq profiles of lymphoblastoid cell lines from 29 male and 40 female unrelated Nigerian individuals [40]. Summarized read counts and gene annotation are provided by the Bioconductor tweeDEseqCountData package [41]. Figure 1d shows the voom mean-variance trend of this dataset.

                          voom yielded 16 genes upregulated in males and 43 upregulated in females at 5% FDR. As expected, most of the top differentially expressed genes belonged to the X or Y sex chromosomes (Table 1). The top gene is XIST, which is a key player in X inactivation and is known to be expressed at meaningful levels only in females.

                          We examined 12 particular genes that are known to belong to the male-specific region of chromosome Y [42, 43]. A ROAST gene set test confirmed that these genes collectively are significantly upregulated in males (P=0.0001). A CAMERA gene set test was even more convincing, confirming that these genes are significantly more upregulated in males than are other genes in the genome (P=2×10 -28 ).

                          We also examined 46 X chromosome genes that have been reported to escape X inactivation [43, 44]. These genes were significantly upregulated in females (ROAST P=0.0001, CAMERA P=10 -10 ). The log-fold-changes for the X and Y chromosome genes involved in the gene set tests are highlighted on an MA plot (Figure 8).

                          MA plot of male vs female comparison with male- and female-specific genes highlighted. The MA plot was produced by the limma plotMA function, and is a scatterplot of log-fold-change versus average log-cpm for each gene. Genes on the male-specific region of the Y chromosome genes are highlighted blue and are consistently upregulated in males, while genes on the X chromosome reported to escape X inactivation are highlighted red and are generally down in males. log-cpm, log-counts per million.

                          Note that these gene set testing approaches are not available for any of the count-based approaches to differential expression. If a count-based method had been used to assess differential expression, we could still have examined whether sex-linked genes were highly ranked among the differentially expressed genes, but we could not have undertaken any formal statistical test for enrichment of this signature while accounting for inter-gene correlation. On the other hand, the voom expression values and weights are suitable for input into the ROAST and CAMERA procedures without any further processing.

                          Development stages of Drosophila melanogaster

                          Like edgeR-glm, but unlike most other analysis tools, voom and limma-trend offer full-featured linear modeling for RNA-seq data, meaning that they can analyze arbitrary complex experiments. The possibilities of linear modeling are so rich that it is impossible to select a representative example. voom and limma could be used to analyze any gene-level RNA-seq differential expression experiment, including those with multiple experimental factors [34]. Here we give a novel analysis illustrating the use of quadratic regression to analyze a time-course study.

                          RNA-seq was used to explore the developmental transcriptome of Drosophila melanogaster[45]. RNA-seq libraries were formed from whole-animal samples to represent a large number of distinct developmental stages. In particular, samples were collected from embryonic animals at equi-spaced development stages from 2 hours to 24 hours in 2-hour intervals. Here we analyze the 12 RNA-seq libraries from these embryonic stages. We sought to identify those genes that are characteristic of each embryonic stage. In particular we wished to identify, for each embryonic stage, those genes that achieve their peak expression level during that stage.

                          As all the samples are from distinct stages, there are no replicate libraries in this study. To estimate variances we utilized the fact that gene expression should for most genes vary smoothly over time. A multidimensional scaling plot of log-cpm values shows the gradual change in gene expression during embryonic development, with each stage intermediate in expression profile between the stages before and after (Figure 9). We used gene-wise linear models to fit a quadratic trend with time to the log-cpm values for each gene. These quadratic trends will not match all the intricacies of gene expression changes over time but are sufficient to model the major trends. The voom mean-variance trend for this data is shown in Figure 1e.

                          Multidimensional scaling plot of Drosophila melanogaster embryonic stages. Distances are computed from the log-cpm values. The 12 successive embryonic developmental stages are labeled 1 to 12, from earliest to latest.

                          Out of 14,869 genes that were expressed during embryonic development, 8,366 showed a statistically significant trend at 5% FDR using empirical Bayes F-tests. For each differentially expressed gene, we identified the embryonic stage at which the fitted quadratic trend achieved its maximum value. This allowed us to associate each significant gene with a particular development stage (Figure 10). Most genes peaked at the first or last stage (Figure 10), indicating smoothly decreasing or increasing trends over time (Figure 11, panels 1 and 12). Genes peaking at the first embryonic stage tended to be associated with the cell cycle. Genes peaking at the last stage tended to be associated with precursor metabolites and energy, the oxidation-reduction process and with metabolic pathways.

                          Number of genes associated with each Drosophila melanogaster embryonic stage. The number of genes whose peak estimated expression occurs at each of the stages is recorded.

                          Expression trends for genes that peak at each Drosophila melanogaster embryonic stage. Panels (1) to (12) correspond to the 12 successive developmental stages. Each panel displays the fitted expression trends for the top ten genes that achieve their peak expression during that stage. In particular, panel (1) shows genes that are most highly expressed at the first stage and panel (12) shows genes most highly expressed at the last stage. Panels (7) and (8) are notable because they show genes with marked peaks at 12–14 hours and 14–16 hours respectively.

                          Genes peaking at intermediate stages have expression trends with an inverse-U shape (Figure 11, panels 2–11). There was a substantial set of genes with peak activity between 12–16 hours of embryonic development (Figure 10), suggesting some important developmental change occurring during this period requiring the action of special-purpose genes. Indeed, gene ontology analysis of the genes associated with this period showed that anatomical structure morphogenesis was the most significantly enriched biological process. Other leading terms were organ morphogenesis and neuron differentiation.

                          This analysis demonstrates a simple but effective means of identifying genes that have a particular role at each developmental stage.


                          Background

                          Detection of significant differentially expressed genes (DEGs) from DNA microarray datasets is a common routine task conducted in biomedical research [1–3]. For the detection of DEGs, numerous methods are proposed [4–7]. By such conventional methods, generally, DEGs are detected from one dataset consisting of group of control and treatment. However, some DEGs are easily to be detected in very wide or common experimental conditions. For example, "pyoverdin" genes (pvdD and pvdJ) [8] of Pseudomonas aeruginosa, which are ones of Iron transporter proteins and involved in cell division, are generally detected as DEGs in experimental conditions which are conducted to observe cell division (such as GSE24784 in GEO database) (Figure 1). Additionally, in analyses of some expression dataset of public database by commonly used statistical methods, pyoverdin genes are also detected as DEGs in many other experimental condition which are not conducted to observe cell division. Literatures suggested that this may be because of pyoverdin is involved in many other biological processes such as cell-to-cell signaling (Quorum Sensing, QS) [9] and virulence factor production [10]. In this way pyoverdin genes are prone to be detected as DEGs in any experiment condition, however, many researchers may want to these genes to be detected in the special experiments (i.e., cell division condition). For this purpose, each measurement value of gene expression levels should be compared in two dimensional ways, or both with other genes and other datasets simultaneously.

                          Expression change of pyoverdin genes. We analyze some expression data of pyoverdin genes (pvdD and pvdJ) of public database (GEO and Array- Express) by commonly used statistical methods (log-FC, RankProducts, t-rank and SAM). The threshold value of log-FC is set to 2 (4-fold) and that of RankProducts, t-rank and SAM are set to upper 300 gene. All dataset are normalized by RMA method separately. If both genes are co-expressed, corresponding box is filled in white, otherwise gray. Figure shows that pyoverdin genes are prone to be detected in any experiment condition and our method focuses on much experiment condition specific DEGs (GSE7704).

                          For the detection of such DEGs, we retrieve the gene expression data from public database as possible and construct "meta-dataset" which summarize expression change of all genes in various experiment condition (Figure 2). Although there are no 'de fact' standard definition for meta-datasets, log ratio value which are widely used to analyze DNA microarray data can be introduced to construct meta-datasets when each dataset is consist of control and treatment experiment data.

                          Meta-dataset and log-FC matrix. A meta-dataset is a set of multiple datasets. Each dataset consists of a control group and a treatment group, each of which has one or more DNA microarray data. The measured probe (gene) is common to all datasets. The element of F i,j in log-FC matrix is the log-transformed (base 2) fraction of arithmetic mean values of treatment and control group in i-th gene of j-th dataset.

                          In such meta-datasets, direct application of widely used conventional statistical methods is not suitable to detect two-dimensional DEGs because such methods are intended to find special genes among all experiments to be analyzed.

                          For example, ANOVA [11–14] is applied very widely for multi-group analysis method, but its concludes only that differences between groups (genes) are significant or not. Therefore ANOVA can not detect simultaneously specific genes in specific experiments as two-dimensional DEGs.

                          Outlier detection methods are also widely used to detect DEGs, such as Shannon entropy [15] or Sprent's non-parametric method [16]. In difference to ANOVA, these methods can also detect both special genes or special experimental conditions, but it is not simultaneously. It is one-dimensional and similar to ANOVA.

                          Multiple testing [17] (multiple comparisons, such as Bonferroni correction, Tukey-Kramer's method, and Games-Howell's method) also produce limited results as same as outlier detections. For an example of a dataset consisting of N genes and E experiments, it never means that the i-th gene of the j-th experiment is a DEG when multiple testing shows that the i-th gene (size E vector) is significantly different from other genes and the j-th experiment (size N vector) is significantly different from other experiments independently. This is because most multiple testing methods are conducted to ascertain differences between mean values of groups.

                          Herein, we propose "two-way AIC" (Akaike Information Criteria) method for simultaneous detection of significant genes and experiments on metadatasets. This method detects specific genes that are differentially expressed in specific experimental conditions. Here, we present comparison of the performance of our method to other widely used statistical methods and show that two-way AIC method has high specificity for detection of test data which tend to express in specific experiment condition.


                          Discussion

                          Building up new knowledge about biological systems is the ultimate purpose of microarray experiments, but all such insights have to be built on a solid foundation to be accurate and useful. Proper normalization of data and accurate detection of which genes are regulated are vital to the success of downstream exploration of microarray data. Even for exploratory cluster analyses, the genes that are significantly regulated must be selected beforehand. This task of detecting these genes is a difficult statistical problem a statistical hypothesis is made for each of tens of thousands of genes tested, but only a small number of replicate arrays are available to test those hypotheses. The statistical methods presented in this study attempt to draw as much information as possible out of a small number of array replicates to determine which genes are likely to be regulated.

                          It is clear that looking at the measurements of each gene in isolation can produce a test with low statistical power (e.g. using the standard t-test, Fig. 1). To improve statistical power, we can use knowledge about the relationships among the many thousands of points in the arrays. Specifically, we group together spots that have similar standard deviations and then pool together many less accurate estimates of standard deviation into a single, more accurate estimate. Our data also show that the Z-statistics are more precise than either standard or penalized t-statistics for detecting differential gene expression in microarray data. We further demonstrate that pooling standard deviations using the minimum intensity metric produces Z-statistics that are more accurate than the standard t-test, the penalized t-tests, and the average intensity-based Z-statistic.

                          Average combined logged intensity (Iavg) vs. minimum logged intensity (Imin) pooling metric

                          We evaluated two different intensity-based metrics for pooling standard deviations. There are many reports that the variance is a function of intensity, but the exact shape of this relationship could depend on many factors extrinsic to the biological experiment, such as the array technology being used, the signal-to-noise ratio of the data, the similarity between the two conditions[30], the normalization technique or the background subtraction technique. For this reason, we favor an estimation of the standard deviation using a curve-fitting technique rather than a fixed model based on previous data. Furthermore, when dealing with two-channel arrays, there are two different intensity values associated with each replicated spot. It is possible that the variation is best described as a function of the average intensities of both channels. However, our own experience and many other reports suggest that the highest variances are often seen for low intensity spots. If so, the variance may be better described as a function of the minimum intensity over all the spots.

                          The data presented here show that the mean residual errors are either equal or lower when using the Imin compared to the Iavg pooling metric, for every dataset using the Agilent Feature Extraction image processing technique. The subset of Dataset 4 for which this difference is most striking, #3 in Table 1, also has a population of spots with particularly high variance (see Fig. 4). The Iavg metric pools these spots together with other spots that have a much lower variance. In contrast, the Imin metric moves these spots to the low end of the x-axis, and the curve fit tracks the standard deviation of the spots much better. The noisiest spots on microarrays are often those where at least one channel is "blank", i.e. a noisy, low level of signal that presumably represents no expression. The Imin metric is better at grouping such spots together. For datasets with low background levels, there is a smaller difference in the performance of the two pooling metrics.

                          The trends in the mean residual errors from the unpaired reference sample analysis agree with the results from the direct comparison analyses. This similarity is to be expected, since processing each reference sample condition separately is equivalent to doing a direct comparison between each condition and reference RNA samples. Both pooling metrics generate similar mean residual error values when pooling σμ, but one dataset is not enough to make any generalizations about which pooling metric will perform best for all paired reference sample datasets. The improved performance of the Imin pooling metric is lost when using SPOT processing or combined Agilent foreground and SPOT background image processing, suggesting that these image processing techniques may be more effective at removing noise at low intensities.

                          The Iavg and Imin pooling techniques are reproducible to the same degree, since their R 2 coefficients between Z-statistics from paired datasets (see Table 1) are similar to each other. The Imin pooling technique generates slightly more accurate results, as indicated by the greater R 2 coefficients between Z(Imin) and tgold compared to those between Z(Iavg) and tgold (see Table 2). This trend holds for all six subsets of Dataset 4.

                          The higher accuracy of Z(Imin)

                          The Z-statistic calculated using the Imin pooling metric provides an improvement in accuracy over the other techniques. The t-statistic derived from datasets with 20 replicates was used as a surrogate "gold standard" since 8 or more replicates can be considered sufficient to give power to the t-statistic [17]. The t-statistic was chosen as the "gold standard" instead of the average logged ratio since the latter does not take variability into account. For each of the six permuted subsets of Datasets 4, the 90 th percentile penalized t-statistic, SAM penalized t-statistic, and Z(Iavg) had similar R 2 values when correlated with the "gold standard" t-statistic, although the SAM statistic did perform poorly for the noisiest subset of Dataset 4 (#3 in Table 2) with an R 2 value of only 0.70. Z(Imin), however, consistently produced the highest R 2 value for each of the six datasets. Since the ratios used in each of these statistics is identical, this result indicates that the standard error generated with the Imin technique produces the best correlation with the gold standard t-statistic based on 20 replicates. Although excluding spots with very low intensity could eliminate the difference in performance between the Imin and Iavg pooling metrics, this approach would make it impossible to detect low-expressed regulated genes, which may be biologically significant.

                          The Z-statistics from the Imin technique do not correlate perfectly with the "gold standard" t-statistic, however. Some disagreement can be expected because the Z(Imin) data was based on only three replicate arrays, which contain much less information than the 20 replicates used to calculate the "gold standard" t-statistic. Also the significance estimates calculated using the "gold standard" t-statistic may still contain some inaccuracies, even with 20 replicates. Kerr et al. found this to be true with 12 replicates, where accuracy is reduced if the error distribution for each gene is modeled separately instead of using a pooled estimate [15]. Analyzing the large (N = 20) replicate dataset using robust estimators of ratio and standard deviation may be able to create a more accurate "gold standard" to use for further testing of the Z-statistic or other statistics. Note that we do not employ an explicit permutation-based approach to estimate the false detection rates of the statistics investigated in this study, as in Ref. [16]. Rather than permute gene labels from a small set of arrays to estimate the distribution of expected test statistics, with the availability of the large (N = 23) replicate dataset described herein, we preferred to use this rich source of actual test statistics directly.

                          The higher reproducibility of z-statistics

                          The Z-statistic – calculated with either pooling the Imin or Iavg pooling metric – provides an appreciable improvement in reproducibility over the average logged ratio alone, the standard t-test and the 90 th percentile and SAM penalized t-statistics. Both linear (R 2 ) and non-parametric rank correlation coefficients were highest for the Z-statistic when comparing corresponding spots between three independent pairs of replicate datasets. Also, the standard t-statistic and SAM penalized t-statistic generate linear regression slope coefficients that vary greatly from pair to pair, indicating that their absolute magnitude is not as reproducible as the Z-statistics, whose linear regression slope coefficients are much closer to 1.

                          The high correlation values and near-unity slope coefficients for the Z-statistic support the hypothesis that pooling the standard deviations of spots with similar intensities provides a stable, precise estimate of the standard deviation. This assumption of a well-estimated standard deviation supports the use of the Gaussian distribution to map the Z-statistic to a p-value. Using only the measured standard deviation, one is forced to use a t-distribution with only 2 degrees of freedom to generate a p-value. This test does not have sufficient power to generate any significantly regulated points because of the very small number of degrees of freedom, not a single spot seen in Fig. 1a is found to be significant after multiple test correction. In contrast, even after a conservative multiple test correction that makes the cutoff for statistical significance much more stringent, many spots are found significant using the Z-statistic. The penalized t-statistics do not produce a stable estimate of the standard deviation with these data, perhaps because the constant added to the denominator of the test statistic showed a large variation between replicate datasets. Therefore they cannot be mapped to a p-value in a reproducible manner.

                          Outlier detection

                          One limitation of using a pooled standard deviation is that for a spot with replicate ratios that include one or more outliers, the appropriately high measured standard deviation will be replaced by an inappropriately low pooled standard deviation. This substitution could produce a false positive result. We have sought to minimize this limitation by implementing an overlying outlier detection algorithm. (For other implementations of outlier detection, see Ref. [26, 30].) The algorithm in this study uses the measured standard deviation instead of the pooled standard deviation for spots for which the pooling model may not hold. These spots are identified as ones for which residual error σ–σ' is positive and greater than twice the standard deviation of the positive residual errors.

                          The measured standard deviations for these outlier points are valid sample measurements of the variance process and should be used to calculate the pooled standard deviations for spots with similar intensities. These ratio measurements, however, are too widely varying for one to have the same confidence in the average ratio as one would have for other spots thus, it is appropriate to substitute the measured standard deviation for the pooled standard deviation in these cases. Fig. 7c,7d, which highlight outlier spots on a plot of the Z(Imin) vs. the "gold standard" t-statistic for Dataset 4b, show that this outlier detection technique correctly detects many of the presumably false positive spots that have a high Z-statistic and low tgold value. The plots also show some false positive spots that are not detected through this algorithm, as well as a few spots that become false negatives after outlier detection. Other, more complex outlier detection algorithms may perform better, and should be explored. A simple modification to the current algorithm, using local instead of global estimates of the standard deviation of the residual error, may improve outlier detection. Alternative implementations include modifying the pooling window shape to give more weight to a spot's measured standard deviation or that of its nearest neighbors by intensity. Strictly speaking, the p-values for outlier spots should be calculated using a t-distribution instead of a Gaussian distribution since the measured standard deviation is being used. We have shown, however, that with 3 replicates, no spots in our datasets can be found statistically significant using the t-test and strict multiple test correction. In order to preserve detection of spots, we continue to use the Gaussian distribution to convert outlier Z-statistics to p-values, which may slightly increase the false positive rate for spots detected as outliers. In practice, however, such spots are rarely found to be significantly regulated.

                          Unpaired vs. paired analysis for reference sample experiments

                          Finally, we have extended our algorithms to apply to data from a reference sample experimental design. This design gives one the flexibility to compare many different conditions to one another, but the trade-off is a loss in precision. In theory, using a reference sample design instead of a direct comparison design should increase the variance by a factor of 2. This increase has in fact been observed in practice [33].

                          The paired analysis method can reduce the measured variation in a reference sample design. The linear regression slope coefficients in Table 1 indicate that the Z-statistic values using the paired analysis are higher than the unpaired Z-statistic values. Thus, the paired difference of logged ratios, μ, is less variable than the independent logged ratios, MX and MY. This observation suggests that the effects of biological or analytical variation from replicate to replicate can be reduced if comparisons are made between paired samples. Whether this reduction is due to using paired biological samples or paired array processing dates [34] is still an open question, and probably will be context-dependent. Although it may not always be practical, it would be beneficial for investigators to design reference sample experiments to be performed in parallel whenever possible to take advantage of the lower standard deviations produced by paired analysis.

                          Finding the optimal statistical test

                          Several areas remain for further refinement of our implementation of pooling-based statistical analysis of microarray data. Currently, the standard deviation is pooled using a simple moving rectangular window of 501 spots, but other window sizes and shapes may improve performance slightly. More generally, we have not explicitly compared the moving average estimator with the spline-fit or loess-based techniques to estimate the standard deviation used in other studies (see Background). While we expect performance to be similar, further testing may reveal an advantage.

                          Following Ref. [35], we do not try to estimate the dye-specific bias of individual spots or genes (i.e., dye-gene interaction) in order to preserve degrees of freedom needed to estimate the variance. Informally we noted that dye bias in some spots produced high measured variances that caused those spots to be considered non-significant outliers. A post-hoc test to warn of potential dye bias of individual spots may be appropriate for small numbers of array replicates (e.g. N = 3), especially if the experimental design is unbalanced (i.e., the number of dye-swapped and unswapped arrays is not equal).

                          Note that this study only considered statistics of the general form (ratio) / (standard deviation). ANOVA models that consider the variance as intensity-dependent, as seen in Ref. [15, 25], can be seen as an extension of this concept. An ANOVA framework, however, also allows for a more complicated experimental model that can incorporate normalization and multiple biological conditions. Pooling standard deviations as a function of minimum intensity instead of average intensity may benefit such models. Permutation tests can also be used to detect regulated genes, and are known to be robust to outliers but can have low power for small N. Xu et al. found a permutation test to be equally or less accurate than parametric methods in ranking genes [36]. Bayesian analysis can also be applied to microarray data [13, 20, 21], and may be useful in this context to draw more information out of the distribution of intensities and ratios in the data.

                          In this study, data is first normalized, and then detection of regulated genes is performed in a separate step. In contrast, other approaches incorporate normalization and statistical inference into a unified model [29, 35]. Furthermore, the options for normalizing the data are numerous, including algorithms based on local regression (loess) [7], splines [37], a constant shift [15], or more exotic transforms that tend to remove the intensity dependence of the variance [38]. Increased attention to the low-level details of scanning and image processing may also improve accuracy [22, 33, 39], while at the same time potentially changing the intensity dependence of the variance. It remains to be seen how the techniques used for normalization or variance-stabilizing transforms will impact the accuracy and precision of regulated gene detection. In addition, we are concerned that some of these transforms may create a systematic bias for or against genes of low intensity (e.g., [40]).

                          Test performance can depend on data characteristics

                          Although many datasets have a variance that is intensity-dependent [12, 15, 21–26], some studies have analyzed datasets whose variance characteristics are not strongly intensity-dependent (e.g., [35]). In general, we have experienced that microarray datasets with a low background relative to signal, loess-based normalization, and conservative background subtraction (e.g. SPOT Image Processing) produce standard deviations that are not strongly intensity-dependent. In this context, the differences between the Imin and Iavg metrics disappear. In fact, for data with unusually low noise, the standard deviations is nearly constant across all spots and all of the statistical tests considered in this paper, even simply the average logged ratio, tend to converge. This observation is not unexpected as the standard deviations converge to the same value, the denominator of the test statistics will become constant, leaving the test statistics simply proportional to the ratio. We would recommend finding a normalization [7, 29, 33, 37] and background subtraction technique [22, 32, 39] that produces low, intensity-independent standard deviations. Applying variance stabilizing transforms may eliminate the intensity dependence of the standard deviation [38], but might also reduce statistical power or bias the test toward spots of certain intensities. It cannot be predicted in advance whether all intensity dependence of the variation will be removed, so we continue to use the more robust statistic Z(Imin) for all of our datasets. Furthermore, in situations where changing the background subtraction or normalization technique is not possible because the original data is not available, using a more robust statistic like Z(Imin) will be advantageous.

                          While the pooling techniques described herein can compensate for intensity-dependent variation, this intensity dependence can be minimized or exaggerated by different normalization techniques and background subtraction techniques. These techniques may have subtle effects on the power to detect regulated genes at different intensities, perhaps creating bias for or against detection of low-expressed genes. For this reason, until the most sensitive and unbiased normalization and background subtraction methods are optimized for each microarray system, we would encourage creators of microarray data archives to preserve unnormalized intensity and background data, and the original image data when possible.

                          Of the many useful tests used to detect regulated genes from a small number of microarray replicates, we see the intensity-based variance estimation and Z-statistic described in this study to be a good combination of simplicity, robustness, precision, and accuracy. This technique allows meaningful p-values to be added to a list of regulated genes. With this assessment of statistical significance, an investigator can proceed to focus on genes that are most likely to be regulated.


                          Implementation

                          BATS is a graphical user-friendly software written in MATLAB. Executable program for Windows, Linux and Mac Osx, the source code and the user manual can be freely downloaded from http://www.na.iac.cnr.it/bats.

                          Permission to use, copy, modify, and distribute BATS for any purpose without fee is granted by the BATS permissive license (derived from the MIT license). The compiled software needs to run the MATLAB component Runtime (MCR), also available on the website for the sole purpose of running BATS.

                          Current implementation of BATS is designed for a single processor, and it is fast enough for any practical purpose. Version 1.0 of BATS is composed of two main applications: A NALYSIS and S IMULATIONS it is equipped with a third option, U TILITY , which provides additional functions. Each application can be activated from the main window (see Figure 1).

                          The Main Menu of BATS.

                          A context-specific H ELP button is present in all windows, providing all necessary information as well as a short description of all the parameters required by a procedure. The A BOUT button reports the Terms of the License. A more detailed description can be found in the U SER R EFERENCE M ANUAL . The guided T UTORIAL available on the website can be used for a fast introduction to the software. In what follows, we briefly describe each application.

                          Analysis

                          The A NALYSIS application allows to apply the methodology developed in [12] to either synthetic or real data-sets. The menu of A NALYSIS application is divided into sub-windows (see Figure 2) which allow an user to define the parameters of the analysis. Obviously, A NALYSIS constitutes the most important part of BATS from biologists' point of view.

                          The Analysis window of BATS.

                          Data can be loaded into the system and analyzed on the basis of any of the three error models described in Section "Methodology" and denoted in the software as M ODEL 1, M ODEL 2 and M ODEL 3, respectively. The input data should be in the EXCEL spreadsheet or a tab-delimited text file format prepared as follows. The first row should contain a text string (i.e., G ENE N AME ) in the first column, and, in the remaining columns, the numerical values of the time measurements t (j) in ascending order and represented in the same time units (seconds, hours, days, etc.). From the second row on, the first column should contain the gene identifier, a unique string of letters or a combination of letters and numbers (numbers only are not allowed). The remaining columns should contain data, z i = ( z i 1 , 1 … z i 1 , k ( 1 ) , ⋯ , z i n , 1 , … z i n , k ( n ) ) [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeCOEaO3aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqGGOaakcqWG6bGEdaqhaaWcbaGaemyAaKgabaGaeGymaeJaeiilaWIaeGymaedaaOGaeSOjGSKaemOEaO3aa0baaSqaaiabdMgaPbqaaiabigdaXiabcYcaSiabdUgaRnaaCaaameqabaGaeiikaGIaeGymaeJaeiykaKcaaaaakiabcYcaSiabl+UimjabcYcaSiabdQha6naaDaaaleaacqWGPbqAaeaacqWGUbGBcqGGSaalcqaIXaqmaaGccqGGSaalcqWIMaYscqWG6bGEdaqhaaWcbaGaemyAaKgabaGaemOBa4MaeiilaWI[email protected][email protected] ), in the form of log2-signal-to-reference ratios. Missing values can be entered as either empty cells or NaN. Before analyzing microarray data with BATS, the data should be pre-processed to remove systematic sources of variation. For a detailed discussion of the normalization procedures for microarray data we refer the reader to e.g., [17–19] or [20]. We recall (see also Remark 1) that BATS is particularly suitable for those experiments where at least 5–6 different time points are available. Moreover, although BATS automatically accounts for missing data, for a reliable analysis we suggest that the proportion of missing data should remain relatively small (for each gene at least 50–60% of the observations should be available). Note, if the data set to be analyzed does not meet these general requirements, a warning message will be displayed. From the A NALYSIS window, an expert user can choose prior parameters (see Step 1 of the Algorithm). We briefly discuss these choices below. A detailed description can be found in the user-manual.

                          The type of basis functions can be either Legendre or Fourier, with default choice Legendre. The global regularity ν of the gene expression profiles is a real number in [0, 1], (default value 0). The maximum degree Lmax allowed in the expansion is an integer value, default value [n/ 2] as a compromise between the goodness of fit and variance of the estimate. The parameter λ of the Poisson distribution truncated at Lmax has to be chosen in order to match the prior expected degree of the polynomial.

                          Choosing appropriate parameters for the analysis of a particular data-set with BATS usually requires some preliminary knowledge of statistics and some level of expertise. However, a user who is not an expert in statistics should not be discouraged, since for all parameters BATS provides default values that can be used in most cases, and the parameters' sub-windows are hidden by default. If necessary, hidden windows can be opened in order to change the default values.

                          After that, the user can either select a specific method for estimating global parameters π0 and σ0, or enter their values manually by choosing the CUSTOM option (see Step 2 of the Algorithm). In the current version of BATS, estimation of the global parameters is based only on the N cgenes for which the complete set of M observations is available. If the default option S TANDARD remains selected, for each array of observations at a time point t (j) , σ (j) is estimated by the sample standard deviation σ ^ ( j ) [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWa[email protected][email protected] . On the other hand, if normal distribution of the data can be justified, by selecting the corresponding option MAD, the sample variance can be replaced by a more robust estimator like the Median Absolute Deviation, which is usually proposed when the majority of array components are zeros [21]. In both cases, the estimator σ ^ 2 [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGa[email protected][email protected] is obtained by averaging of ( σ ^ ( j ) [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWa[email protected][email protected] ) 2 , j = 1. M.

                          Given σ ^ [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0x[email protected][email protected] , with the option U NIVERSAL , following [21], the global parameter π0 is estimated by averaging over the arrays the proportion of data points which fall below the universal threshold σ ^ 2 log N c [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4WdmNbaKaadaGcaaqaaiabikdaYiG[email protected][email protected] . Note that this method tends to overestimate π0 when the error is normally distributed, but not when the error distribution has heavier tails, which is very common in microarray data.

                          Once one of the three error models has been selected in the box C HOICE OF THE P RIOR M ODEL , the model-dependent parameters are estimated automatically for M ODELS 1 or 3. If M ODEL 2 is selected, the user can further choose the way for estimating the hyperparameters b and γ. Specifically, with C HOICE 2, γ and b are estimated by using the Maximum Likelihood Estimator (MLE) on the set of values σ ^ ( j ) [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWa[email protected][email protected] , j = 1. M, which are treated as a sample from the distribution of σ (note that if ( σ ^ ( j ) [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWa[email protected][email protected] ) 2

                          IG(γ, b), then ( σ ^ ( j ) [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWa[email protected][email protected] ) -2

                          Gamma(γ, b)). If the user selects alternative option C HOICE 1, he/she has to fix γ and then parameter b will be automatically evaluated by matching the mean of IG(γ, b) with σ ^ 2 [email protected]@[email protected]@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGa[email protected][email protected] . We observe that with selection of C HOICE 2 an user does not have to specify any parameters. With C HOICE 1, an user have to specify the positive parameter γ (default value 15). The two options produce slightly different lists of genes and allow to check the robustness of the selections.

                          An user can also choose whether to estimate the degree of the polynomial L iby the posterior mean (option M EAN ) or the posterior mode (option MAP) (Step 5 of the Algorithm) from the box E STIMATION OF THE POLYNOMIAL DEGREE , and what procedure to use for testing which of the genes are differentially expressed (Step 6 of the algorithm) from the box T EST PROCEDURE . In the latter, the default option B INOMIAL refers to the Binomial prior elicited on the number of alternative hypotheses, option TRUNCATED P OISSON (with further choices which of the stepwise approaches to use in order to decide which hypothesis to accept and which to reject, see [15] for details) is based on the truncated Poisson prior. Options S TANDARD ODDS , S TANDARD BF do not implement any multiplicity control and option F ULL R ANK only ranks the genes without providing any automatic cut off.

                          An user has an option to print out the estimated profiles (superimposed to the raw data) for the top 'nfirst' genes according to ranking, either in 'Global scale' (all gene profiles are shown on the same scale to make the figures comparable) or in 'Auto scale' (each gene profile is shown using the most appropriate scale in order to improve visualization). We note that visual inspection of the profiles can be very useful for a quick assessment of the fit.

                          Alternatively, expression profiles of individual genes can be generated later using the Utility – P LOT P ROFILES .

                          Once the necessary parameters have been defined, an user has to choose a Project name and launch the analysis. By default, for each run of the analysis, three files are generated in the folder Projects: a summary of the analysis _SR.txt (reporting all the parameters used), the ordered list of differentially expressed genes _GL.xls for Windows systems or _GL.txt for Linux or Mac Osx, and the estimated gene profiles _SH.xls for windows systems or _SH.txt for Linux or Mac Osx. The dialog window shows intermediate results and stages of the algorithm during the execution of the analysis.

                          Simulations

                          The S IMULATIONS application enables an expert user to generate, analyze and save synthetic data. This feature can be useful for planning experimental design (e.g., for finding an acceptable balance between the cost and the benefits of increasing the number of arrays, for deciding whether to employ new arrays as further replicates at existing time points or at additional time points), for preliminary verification whether BATS is a suitable tool for a given type of experiments, or for generating synthetic data which can be used for comparison of other statistical tools. This application can also be used to enhance understanding of some features of the proposed software. Simulations are indeed a typical tool for validation and comparisons of statistical procedures. They are also widely used in microarray analysis, see, for example, [9, 10] and [13]. Running an appropriate simulation study requires some basic knowledge of statistics and some experience in computing.

                          The S IMULATIONS application consists of two windows. In the first window (see Figure 3) an user provides parameters required to generate synthetic data. In the second window the user can choose how to analyze the generated data set (the second window is similar to the A NALYSIS window).