R Script: affyClassify

Approx. CPU requirement: minutes
Description:

affyClassify R Documentation

Statistical Classification of Affymetrix CEL Files

Description

This script classifies CEL files into groups. Classification can be useful in exploratory (how well do probesets discriminate samples into known groups?) or confirmatory (do the probesets correctly predict the group to which a sample belongs) analyses.

Usage

affyClassify(dataset, preprocess = c("rma", "vsnrma"),
             method = c("knn", "svm", "lda", "dlda"),
             covariate, parameters = list())

Arguments

dataset A ‘zip’ archive containing (a) Affymetrix CEL files and (b) a ‘targets.csv’ file. The ‘targets.csv’ file is a comma-separated table, as might be exported from Excel. The file must contain a column ‘FileName’ that references each CEL file in the archive to be included in the analysis.
preprocess The 'preprocess' option is one of 'rma', or 'vsnrma'. 'rma' pre-processing is implemented in the affy Bioconductor package. 'vsnrma' is implemented in the vsn Bioconductor package. In addition, only the probesets with between-sample variability in the top 20th quantile are retained.
method The 'method' for classification is one of 'knn' (k-nearest neighbor), 'svm' (support vector machine), 'lda' (linear discriminant analysis), and 'dlda' (diagonal linear discriminant analysis). All analyses are done using the MLearn function of the MLInterfaces Bioconductor package, using leave-one-out cross validation.
covariate The 'covariate' is the factor, defined in the .csv file, describing which group each sample is associated with. For the sample Cel files zip archive provided, set this value to: "mol.biol".
parameters The 'knn' algorithm requires parameters 'numberOfNeighbors' and 'minimumVote'.

Value

A classifierOutput class, as defined in the MLInterfaces package.

Author(s)

Martin Morgan <mtmorgan@fhcrc.org>

Examples

classification <-
    affyClassify(dataset, preprocess, method, covariate,
                 list(numberOfNeighbors=1, minimumVote=0))
predictions <-
    merge(as.data.frame(classification@testOutcomes),
          as.data.frame(classification@testPredictions), by=0)
names(predictions) <- c("CEL file", "Actual", "Predicted")
predictions

confusionMatrix <-
    MLInterfaces::confuMat(classification)
confusionMatrix

Source code:


   1  version <- '0.0.6'
   2  #<crdata_text>Script version: </crdata_text>
   3  #<crdata_object>version</crdata_object>
   4  CRDATA_SOURCE_R_FILES <- TRUE
   5  ## setup, if source'd in scripts
   6  
   7  if (interactive() ||
   8      (exists("CRDATA_SOURCE_R_FILES") && get("CRDATA_SOURCE_R_FILES")))
   9  {
  10      require(Biobase)
  11      require(Category)
  12      require(KEGG.db)
  13      require(MLInterfaces)
  14      require(RColorBrewer)
  15      require(affy)
  16      require(annotate)
  17      require(bioDist)
  18      require(genefilter)
  19      require(limma)
  20      require(multicore)
  21      require(vsn)
  22  }
  23  
  24  ## utilities
  25  
  26  .stopf <- function(fmt, ...)
  27      stop(sprintf(fmt, ...), call.=FALSE)
  28  
  29  ## steps
  30  
  31  .affyPhenoData <-
  32      function(dataset, covariate)
  33      ## 'covariate' provides a test allowing for early exit
  34  {
  35      pData <- read.csv(dataset)
  36      ## validation
  37      if (0L == nrow(pData))
  38          .stopf("%s has 0 rows", "dataset")
  39      requiredNames <- c("FileName", covariate)
  40      if (!all(requiredNames %in% names(pData)))
  41          .stopf("'%s' must contain names '%s'", basename(dataset),
  42                 paste(requiredNames, collapse="' '"))
  43      if (!is.factor(pData[["FileName"]]))
  44          .stopf("'%s' column '%s' must be %s", basename(dataset),
  45                 "FileName", "factor()")
  46      if (1L == length(covariate) && !is.factor(pData[[covariate]]))
  47          .stopf("'%s' column '%s' must be %s", basename(dataset),
  48                 covariate, "factor()")
  49  
  50      new("AnnotatedDataFrame", data=pData)
  51  }
  52  
  53  .affyPreprocess <-
  54      function(dataset, algorithm, covariate=character(0))
  55  {
  56      dir <- tempfile()
  57      if (!dir.create(dir))
  58          .stopf("failed to create directory '%s'", dir)
  59      on.exit(unlink(dir))
  60  
  61      tryCatch(unzip(dataset, exdir=dir),
  62               error=function(err) {
  63                   .stopf("failed to unzip dataset '%s': %s",
  64                          dataset, conditionMessage(err))
  65               })
  66      target <- list.files(dir, ".*\\.csv$", full=TRUE)
  67      if (1L != length(target))
  68          .stopf("dataset '%s' does not contain a '.csv' file",
  69                 dataset)
  70      phenoData <- .affyPhenoData(target, covariate)
  71  
  72      fnames <- as.character(phenoData[["FileName"]])
  73      if (!all(ok <- file.exists(file.path(dir, fnames))))
  74          .stopf("'%s' does not contain file(s) listed in .csv:\n    %s",
  75                 basename(dataset),
  76                 paste(basename(fnames)[!ok], collapse="\n    "))
  77      switch(algorithm,
  78             rma = {
  79                 justRMA(filenames=fnames, phenoData=phenoData,
  80                         celfile.path=dir)
  81             },
  82             vsnrma = {
  83                 vsnrma(ReadAffy(filenames=fnames, phenoData=phenoData,
  84                                 celfile.path=dir))
  85             })
  86  }
  87  
  88  .esetNonSpecificFilter <-
  89      function(eset, ...)
  90  {
  91      nsFilter(eset, ...)$eset
  92  }
  93  
  94  .esetScale <-
  95      function(eset)
  96  {
  97      iqr <- esApply(eset, 1, IQR)
  98      scale(t(exprs(eset)), rowMedians(eset), iqr[featureNames(eset)])
  99  }
 100  
 101  .esetDifferentialExpression <-
 102      function(eset, design)
 103  {
 104      efit <- eBayes(lmFit(eset, design))
 105      topTable(efit, coef=2, number=Inf)
 106  }
 107  
 108  .esetDistance <-
 109      function(eset, metric)
 110  {
 111      switch(metric,
 112             manhattan=,
 113             euclidean= dist(.esetScale(eset), method=metric),
 114             spearman= spearman.dist(eset),
 115             .stopf("'%s' metric unknown", metric))
 116  }
 117  
 118  .distHeatmap <-
 119      function(dist)
 120  {
 121      col <- rev(colorRampPalette(brewer.pal(10, "RdBu"))(256))
 122      heatmap(as.matrix(dist), sym=TRUE, col=col,
 123              distfun=function(x) as.dist(x))
 124  }
 125  
 126  .esetCluster <-
 127      function(eset, method, groups)
 128  {
 129      data <- as.data.frame(.esetScale(eset))
 130      switch(method,
 131             hclust=MLearn(~., data, hclustI(dist, list(k=groups))),
 132             kmeans=MLearn(~., data, kmeansI, centers=groups,
 133                           nstart=25L, algorithm="Hartigan-Wong"),
 134             .stopf("'%s' method unknown", method))
 135  }
 136  
 137  .esetClassify <-
 138      function(eset, method, covariate, parameters)
 139  {
 140      data <- cbind(as.data.frame(.esetScale(eset)), eset[[covariate]])
 141      names(data)[ncol(data)] <- covariate
 142      formula <- as.formula(sprintf("%s ~ .", covariate))
 143      xvalSpec <- xvalSpec("LOO")
 144      switch(method,
 145             knn={
 146                 vars <- c("numberOfNeighbors", "minimumVote")
 147                 if (!all(ok <- vars %in% names(parameters)))
 148                     .stopf("'%s' must have values for '%s'",
 149                            "parameters",
 150                            paste(vars[!ok], collapse="', '"))
 151                 MLearn(formula, data,
 152                        knnI(k=parameters[["numberOfNeighbors"]],
 153                             l=parameters[["minimumVote"]]),
 154                        xvalSpec)
 155             },
 156             svm=MLearn(formula, data, svmI, xvalSpec),
 157             lda=MLearn(formula, data, ldaI, xvalSpec),
 158             dlda=MLearn(formula, data, dldaI, xvalSpec),
 159             .stopf("'%s' method unknown", method))
 160  }
 161  
 162  .esetContinuousGeneSet <-
 163      function(eset, pathway, covariate,
 164               parameters=list(minimumGenes=10L))
 165  {
 166      ## FIXME: Permuation significance
 167      ttests <- rowttests(eset, covariate)
 168      probeIds <- findLargest(featureNames(eset), abs(ttests$statistic),
 169                            annotation(eset))
 170      havePath <- probes2Path(probeIds, annotation(eset))
 171      probeIds <- probeIds[probeIds %in% names(havePath)]
 172      eset1 <- eset[probeIds,]
 173  
 174      adjMatrix <- local({
 175          m <- t(PWAmat(annotation(eset)))
 176          m <- m[, names(probeIds), drop=FALSE]
 177          idx <- rowSums(m) >= parameters[["minimumGenes"]]
 178          m <- m[idx,, drop=FALSE]
 179      })
 180  
 181      idx <- colSums(adjMatrix) != 0L
 182      probeIds <- probeIds[idx]
 183      adjMatrix <- adjMatrix[, idx, drop=FALSE]
 184  
 185      z <- local({
 186          x <- adjMatrix %*% ttests[probeIds, "statistic"]
 187          x / sqrt(rowSums(adjMatrix))
 188      })
 189      logFC <- adjMatrix %*% ttests[probeIds, "dm"]
 190      
 191      pathwayName <-
 192          mappedRkeys(KEGGPATHID2NAME[row.names(z)])
 193      genesInPathway <- rowSums(adjMatrix)
 194      df <- data.frame(N=genesInPathway, logFC=logFC, z=z,
 195                       Pathway=pathwayName,
 196                       row.names=row.names(z))
 197      df[order(abs(df[["z"]]), decreasing=TRUE),,drop=FALSE]
 198  }
 199  
 200  ## work flows
 201  ## 'force()' used to prompt early error for missing parameters
 202  
 203  affyPreprocess <-
 204      function(dataset, preprocess=c("rma", "vsnrma"))
 205  {
 206      force(dataset)
 207      preprocess <- match.arg(preprocess)
 208      .affyPreprocess(dataset, preprocess)
 209  }
 210  
 211  affyHeatmap <-
 212      function(dataset, preprocess=c("rma", "vsnrma"),
 213               metric=c("manhattan", "euclidean", "spearman"))
 214  {
 215      force(dataset)
 216      preprocess <- match.arg(preprocess)
 217      metric <- match.arg(metric)
 218  
 219      eset <- .affyPreprocess(dataset, preprocess)
 220      eset <- .esetNonSpecificFilter(eset, var.cutoff=0.8)
 221      .distHeatmap(.esetDistance(eset, metric))
 222  }
 223  
 224  affyDifferentialExpression <-
 225      function(dataset, preprocess=c("rma", "vsnrma"),
 226               covariate)
 227  {
 228      force(dataset)
 229      preprocess <- match.arg(preprocess)
 230      force(covariate)
 231  
 232      eset <- .affyPreprocess(dataset, preprocess, covariate)
 233      eset <- .esetNonSpecificFilter(eset, var.cutoff=0.5)
 234  
 235      formula <- as.formula(sprintf("~ %s", covariate))
 236      design <- model.matrix(formula, pData(eset))
 237      de <- .esetDifferentialExpression(eset, design)
 238  
 239      eid <- getAnnMap("ENTREZID", annotation(eset))
 240      gid <- getAnnMap("GENENAME", annotation(eset))
 241      de <- merge(de,
 242                  merge(toTable(eid[de$ID]), toTable(gid[de$ID])),
 243                  by.x="ID", by.y="probe_id", sort=FALSE)
 244      idx <- match(c("ID", "gene_id", "gene_name"), names(de))
 245      names(de)[idx] <- c("ProbesetId", "EntrezId", "GeneName")
 246      de
 247  }
 248  
 249  affyCluster <-
 250      function(dataset, preprocess=c("rma", "vsnrma"),
 251               method=c("hclust", "kmeans"),
 252               groups=2L)
 253  {
 254      force(dataset)
 255      preprocess <- match.arg(preprocess)
 256      method <- match.arg(method)
 257  
 258      eset <- .affyPreprocess(dataset, preprocess)
 259      eset <- .esetNonSpecificFilter(eset, var.cutoff=0.8)
 260      .esetCluster(eset, method, groups)
 261  }
 262  
 263  affyClassify <-
 264      function(dataset, preprocess=c("rma", "vsnrma"),
 265               method=c("knn", "svm", "lda", "dlda"), covariate,
 266               parameters=list())
 267  {
 268      force(dataset)
 269      ## uses leave-one-out cross-validation
 270      preprocess <- match.arg(preprocess)
 271      method <- match.arg(method)
 272      force(covariate)
 273  
 274      eset <- .affyPreprocess(dataset, preprocess, covariate)
 275      eset <- .esetNonSpecificFilter(eset, var.cutoff=0.8)
 276      .esetClassify(eset, method, covariate, parameters)
 277  }
 278  
 279  affyContinuousGeneSet <-
 280      function(dataset, preprocess=c("rma", "vsnrma"), covariate)
 281  {
 282      force(dataset)
 283      preprocess <- match.arg(preprocess)
 284      pathway <- "KEGG"               # FIXME: support for more pathways
 285      force(covariate)
 286  
 287      eset <- .affyPreprocess(dataset, preprocess, covariate)
 288      eset <- .esetNonSpecificFilter(eset, var.cutoff=0.5)
 289      .esetContinuousGeneSet(eset, pathway, covariate)
 290  }
 291  #<crdata_title>Classification of Affymetrix CEL Files</crdata_title>
 292  
 293  #<crdata_text>This script classifies CEL files into groups. Classification can be useful in exploratory (how well do probesets discriminate samples into known groups?) or confirmatory (do the probesets correctly predict the group to which a sample belongs) analyses.</crdata_text>
 294  
 295  #<crdata_text>The 'preprocess' option is one of 'rma', or 'vsnrma'. 'rma' pre-processing is implemented in the affy Bioconductor package. 'vsnrma' is implemented in the vsn Bioconductor package. In addition, only the probesets with between-sample variability in the top 20th quantile are retained.</crdata_text>
 296  
 297  #<crdata_text>The 'method' for classification is one of 'knn' (k-nearest neighbor), 'svm' (support vector machine), 'lda' (linear discriminant analysis), and 'dlda' (diagonal linear discriminant analysis). All analyses are done using the MLearn function of the MLInterfaces Bioconductor package, using leave-one-out cross validation.</crdata_text>
 298  
 299  #<crdata_text>The 'covariate' is the factor, defined in the .csv file, describing which group each sample is associated with.</crdata>
 300  
 301  #<crdata_text>The 'knn' algorithm requires parameters 'numberOfNeighbors' and 'minimumVote'.</crdata_text>
 302  
 303  #<crdata_text>Supplied parameters:</crdata_text>
 304  
 305  #<crdata_text>celArchive: </crdata_text> 
 306  #<crdata_object>celArchive</crdata_object>
 307  
 308  #<crdata_text>preprocess: </crdata_text> 
 309  #<crdata_object>preprocess</crdata_object>
 310  
 311  #<crdata_text>method: </crdata_text> 
 312  #<crdata_object>method</crdata_object>
 313  
 314  #<crdata_text>covariate: </crdata_text> 
 315  #<crdata_object>covariate</crdata_object>
 316  
 317  parameters <-
 318      list(numberOfNeighbors=knnNumberOfNeighbors,
 319           minimumVote=knnMinimumVote)
 320  #<crdata_text>parameters: </crdata_text>
 321  #<crdata_object>parameters</crdata_object>
 322  
 323  classification <-
 324      affyClassify(celArchive, preprocess, method, covariate,
 325                   parameters)
 326  predictions <-
 327      merge(as.data.frame(classification@testOutcomes),
 328            as.data.frame(classification@testPredictions), by=0)
 329  names(predictions) <- c("CEL file", "Actual", "Predicted")
 330  #<crdata_object>predictions</crdata_object>
 331  
 332  #<crdata_text>The 'confusion matrix' summarizes how often the classifier assigns each sample to the correct group; ideally one would like a classifier that always assigned samples correctly</crdata_text>
 333  
 334  confusionMatrix <- confuMat(classification)
 335  #<crdata_object>confusionMatrix</crdata_object>



Ratings by Hamid CRdata team:

Speed of Execution:
  • Your rating: 4 out of 5
  • 4
  • 4
  • 4
  • 4
  • 4
Ease of Use:
  • Your rating: 4 out of 5
  • 4
  • 4
  • 4
  • 4
  • 4
Reliability of Results:
  • Your rating: 4 out of 5
  • 4
  • 4
  • 4
  • 4
  • 4
Quality of Docs:
  • Your rating: 3 out of 5
  • 3
  • 3
  • 3
  • 3
  • 3

Comment:
HB

testing



Back