R Script: sequenceDifferentialExperssion (RNA-seq)

Approx. CPU requirement: minutes
Description:

sequenceDifferentialExpression R Documentation

RNA-Seq Differential Expression

Description

This script performs a two-sample test for RNA-seq differential expression. It is based on the work flow and assumptions of the DESeq package, using specific methods for between-sample normalization, variance stabilization, and differential expression. Details are available in the DESeq package vignette, http://bioconductor.org/packages/release/bioc/html/DESeq.html.

Usage

sequenceDifferentialExpression((dataset, condition1, condition2)

Arguments

dataset A tab delimited text file of sample and count data.

 

The first row contains information describing the condition to which each sample is classified. The first entry (corresponding to the column described below containing the region of interest) is empty. There must be exactly two unique conditions across all samples.

The second row contains information about sample names. The first cell of the second row applies to the feature name column, and is ignored. The remaining columns contain sample names, and must be distinct.

The third and subsequent rows contain a name identifiying the region to which counts apply (e.g., gene names), followed by counts of reads found in each sample. There can be no missing values. All region names must be distinct.

An example of this file is the ‘SequenceDifferentialExpressionInput.tab’ file provided.

condition1, condition2 These parameters determine the direction of the comparsion – condition 2 versus condition 1. For the sample dataset provided, set these values to "Normal" and "Treatment".

Value

A topTable summarizing count differences, and ordered according to adjusted P value. The columns of the top table are:

Id The region of interest identifier from the input data.
NormalizedMean The mean count across all samples, after normalizing so that each sample is approximately equally represented.
NormalizedMean.Condition1 The mean of the counts divided by the size factors for the counts for condition 1
NormalizedMean.Condition2 The mean of the counts divided by the size factors for the counts for condition 2
log2FoldChange The log2 fold change, condition 2 versus condition 1
PValue The P value for rejecting the null hypothesis ‘mean of condition 1 is equal to mean of condition 2’
AdjustedPValue The P values adjusted for false discovery using the Benjamini-Hochberg algorithm.

 

Author(s)

Martin Morgan <mtmorgan@fhcrc.org>

Examples

topTable <-
    sequenceDifferentialExpression(dataset, condition1, condition2)

volcano <-
    xyplot(-log10(AdjustedPValue)~Log2FoldChange, topTable,
           type=c("g", "p"),
           xlab="Log2 Fold Change, Condition2 vs Condition1",
           ylab="-Log10 Adjusted (Benjamini-Hochberg) P value")
volcano

xtab <- xtabs(~., topTable[,3:4]!=0)
xtab

Source code:


   1  version <- '0.0.6'
   2  #<crdata_text>Script version: </crdata_text>
   3  #<crdata_object>version</crdata_object>
   4  CRDATA_SOURCE_R_FILES <- TRUE
   5  if (interactive() ||
   6      (exists("CRDATA_SOURCE_R_FILES") && get("CRDATA_SOURCE_R_FILES")))
   7  {
   8      library(DESeq)
   9      library(GenomicFeatures)
  10      library(GenomicRanges)
  11      library(IRanges)
  12  }
  13  
  14  sequenceDifferentialExpression <-
  15      function(dataset, condition1, condition2)
  16  {
  17      conn <- file(dataset, "r")
  18      conditions <- scan(conn, "character", nlines=1)
  19      counts <- read.delim(conn, header=TRUE, stringsAsFactors=TRUE,
  20                           row.names=1)
  21      close(conn)
  22  
  23      cds <- newCountDataSet( counts, conditions )
  24      cds <- estimateSizeFactors( cds )
  25      cds <- estimateVarianceFunctions( cds )
  26      topTable <- nbinomTest( cds, condition1, condition2)[,c(1:4, 6:8)]
  27      names(topTable) <-
  28          c("Id", "NormalizedMean",
  29            "NormalizedMean.Condition1", "NormalizedMean.Condition2", 
  30            "Log2FoldChange", "PValue", "AdjustedPValue")
  31      data.frame(topTable[order(topTable$AdjustedPValue),], row.names=NULL)
  32  }
  33  
  34  sequenceCountsPerTranscript <- 
  35      function(bamFiles, genome, tableName,
  36               transcriptSequenceNames,
  37               bamSequenceNames)
  38  {
  39      if (missing(genome) ||
  40          (length(genome) == 1 && !nzchar(genome))) {
  41          require(rtracklayer)
  42          return(list(genome=ucscGenomes(),
  43                      supportedTables=supportedUCSCtables()))
  44      }
  45      nameMap <- transcriptSequenceNames
  46      names(nameMap) <- bamSequenceNames
  47      txdb <-
  48          makeTranscriptDbFromUCSC(genome=genome, tablename=tableName)
  49      tx <- transcripts(txdb)
  50      txLevels <- levels(seqnames(tx))
  51      if (!all(idx <- nameMap %in% txLevels))
  52          stop("'transcriptSequenceNames' do not exist in transcript list: '",
  53               paste(nameMap[!idx], collapse="' '"), "'")
  54      for (bamFile in bamFiles) {
  55          bam <- readGappedAlignments(bamFile)
  56          idx <- match(rname(bam), names(nameMap), nomatch=0L) >= 0L
  57          bam <- bam[idx,]
  58          rname(bam) <- factor(nameMap[as.character(rname(bam))],
  59                               levels=nameMap)
  60          values(tx)[[basename(bamFile)]] <- countOverlaps(tx, bam)
  61      }
  62      df <- as.data.frame(tx)
  63      df$Total <- rowSums(df[,8:ncol(df), drop=FALSE])
  64      df
  65  }
  66  #<crdata_title>Sequence-based Differential Expression</crdata_title>
  67  
  68  #<crdata_text>Supplied parameters:</crdata_text>
  69  
  70  #<crdata_text>dataset: </crdata_text>
  71  #<crdata_object>dataset</crdata_object>
  72  
  73  #<crdata_text>condition1: </crdata_text>
  74  #<crdata_object>condition1</crdata_object>
  75  
  76  #<crdata_text>condition2: </crdata_text>
  77  #<crdata_object>condition2</crdata_object>
  78  
  79  topTable <-
  80      sequenceDifferentialExpression(dataset, condition1, condition2)
  81  
  82  #<crdata_text>The figure below shows the relationship between log fold change (on the X axis) and Benjamini-Hochberg adjusted (i.e., for false discovery rate) P values </crdata_text>
  83  
  84  #volcano 
  85  library(lattice)
  86  #<crdata_image>
  87  print(xyplot(-log10(AdjustedPValue)~Log2FoldChange, topTable, type=c("g", "p"),
  88             xlab=sprintf("Log2 Fold Change, '%s' vs '%s'", condition2, condition1),
  89             ylab="-Log10 Adjusted (Benjamini-Hochberg) P value"))
  90  #</crdata_image>
  91  
  92  xtab <- xtabs(~., topTable[,3:4]!=0)
  93  
  94  #<crdata_text>The following table summarizes the number of tags unique to each sample, e.g., the cell corresponding to the row labeled 'TRUE' and the column labelled 'FALSE' represent the regions present only in condition1</crdata_text>
  95  
  96  #<crdata_object>xtab</crdata_object>
  97  
  98  #<crdata_text>Here we save the table to a file 'job-JOBID-SequenceDifferentialExpression-TopTable.txt'</crdata_text>
  99  write.table(topTable, file="SequenceDifferentialExpression-TopTable.txt")




Back