Last data update: 2014.03.03

R: Generalised gene set testing for 450K methylation data
gsamethR Documentation

Generalised gene set testing for 450K methylation data

Description

Given a user specified list of gene sets to test, gsameth tests whether significantly differentially methylated CpG sites are enriched in these gene sets.

Usage

gsameth(sig.cpg, all.cpg = NULL, collection, plot.bias = FALSE, prior.prob = TRUE)

Arguments

sig.cpg

character vector of significant CpG sites to test for gene set enrichment

all.cpg

character vector of all CpG sites tested. Defaults to all CpG sites on the array.

collection

a list of user specified gene sets to test. Can also be a single character vector gene set. Gene identifiers must be Entrez Gene IDs.

plot.bias

logical, if true a plot showing the bias due to the differing numbers of probes per gene will be displayed

prior.prob

logical, if true will take into account the probability of significant differentially methylation due to numbers of probes per gene. If false, a hypergeometric test is performed ignoring any bias in the data.

Details

This function extends gometh, which only tests GO and KEGG pathways. gsameth can take a list of user specified gene sets and test whether the significant CpG sites are enriched in these pathways. gsameth maps the CpG sites to Entrez Gene IDs and tests for pathway enrichment using a hypergeometric test, taking into account the number of CpG sites per gene on the 450K array. Please note the gene ids for the collection of gene sets must be Entrez Gene IDs.

Geeleher et al. (2013) showed that a severe bias exists when performing gene set analysis for genome-wide methylation data that occurs due to the differing numbers of CpG sites profiled for each gene. gsameth and gometh is based on the goseq method (Young et al., 2010). If prior.prob is set to FALSE, then prior probabilities are not used and it is assumed that each gene is equally likely to have a significant CpG site associated with it.

Genes associated with each CpG site are obtained from the annotation package IlluminaHumanMethylation450kanno.ilmn12.hg19. In order to get a list which contains the mapped Entrez gene IDS, please use the getMappedEntrezIDs function.

Value

A data frame with a row for each gene set and the following columns:

N

number of genes in the gene set

DE

number of genes that are differentially methylated

P.DE

p-value for over-representation of the gene set

FDR

False discovery rate, calculated using the method of Benjamini and Hochberg (1995).

Author(s)

Belinda Phipson

References

Geeleher, P., Hartnett, L., Egan, L. J., Golden, A., Ali, R. A. R., and Seoighe, C. (2013). Gene-set analysis is severely biased when applied to genome-wide methylation data. Bioinformatics, 29(15), 1851–1857.

Young, M. D., Wakefield, M. J., Smyth, G. K., and Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14.

Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., and Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, gkv007.

Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B, 57, 289-300.

See Also

gometh,getMappedEntrezIDs

Examples

library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(org.Hs.eg.db)
library(limma)
ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)

# Randomly select 1000 CpGs to be significantly differentially methylated
sigcpgs <- sample(rownames(ann),1000,replace=FALSE)

# All CpG sites tested
allcpgs <- rownames(ann)

# Use org.Hs.eg.db to extract a GO term
GOtoID <- toTable(org.Hs.egGO2EG)
setname1 <- GOtoID$go_id[1]
setname1
keep.set1 <- GOtoID$go_id %in% setname1
set1 <- GOtoID$gene_id[keep.set1]
setname2 <- GOtoID$go_id[2]
setname2
keep.set2 <- GOtoID$go_id %in% setname2
set2 <- GOtoID$gene_id[keep.set2]

# Make the gene sets into a list
sets <- list(set1, set2)
names(sets) <- c(setname1,setname2)

# Testing with prior probabilities taken into account
# Plot of bias due to differing numbers of CpG sites per gene
gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE)
topGSA(gst)

# Testing ignoring bias
gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, prior.prob = FALSE)
topGSA(gst.bias)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(missMethyl)
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/missMethyl/gsameth.Rd_%03d_medium.png", width=480, height=480)
> ### Name: gsameth
> ### Title: Generalised gene set testing for 450K methylation data
> ### Aliases: gsameth
> 
> ### ** Examples
> 
> library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
Loading required package: minfi
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: lattice
Loading required package: GenomicRanges
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biostrings
Loading required package: XVector
Loading required package: bumphunter
Loading required package: foreach
Loading required package: iterators
Loading required package: locfit
locfit 1.5-9.1 	 2013-03-22
> library(org.Hs.eg.db)
Loading required package: AnnotationDbi
> library(limma)

Attaching package: 'limma'

The following object is masked from 'package:BiocGenerics':

    plotMA

> ann <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
> 
> # Randomly select 1000 CpGs to be significantly differentially methylated
> sigcpgs <- sample(rownames(ann),1000,replace=FALSE)
> 
> # All CpG sites tested
> allcpgs <- rownames(ann)
> 
> # Use org.Hs.eg.db to extract a GO term
> GOtoID <- toTable(org.Hs.egGO2EG)
> setname1 <- GOtoID$go_id[1]
> setname1
[1] "GO:0008150"
> keep.set1 <- GOtoID$go_id %in% setname1
> set1 <- GOtoID$gene_id[keep.set1]
> setname2 <- GOtoID$go_id[2]
> setname2
[1] "GO:0001869"
> keep.set2 <- GOtoID$go_id %in% setname2
> set2 <- GOtoID$gene_id[keep.set2]
> 
> # Make the gene sets into a list
> sets <- list(set1, set2)
> names(sets) <- c(setname1,setname2)
> 
> # Testing with prior probabilities taken into account
> # Plot of bias due to differing numbers of CpG sites per gene
> gst <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, plot.bias = TRUE, prior.prob = TRUE)
Warning message:
In alias2SymbolTable(flat$symbol) :
  Multiple symbols ignored for one or more aliases
> topGSA(gst)
             N DE       P.DE       FDR
GO:0001869   2  0 0.06228641 0.1245728
GO:0008150 645 23 0.49800420 0.4980042
> 
> # Testing ignoring bias
> gst.bias <- gsameth(sig.cpg = sigcpgs, all.cpg = allcpgs, collection = sets, prior.prob = FALSE)
Warning message:
In alias2SymbolTable(flat$symbol) :
  Multiple symbols ignored for one or more aliases
> topGSA(gst.bias)
             N DE       P.DE       FDR
GO:0001869   2  0 0.07636356 0.1527271
GO:0008150 645 23 0.62065630 0.6206563
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>