S4 class for storing genotypes efficiently as
column-oriented sparse matrices along with variant info
Details
This class stores genotypes as a column-oriented sparse numeric
matrix, where rows correspond to samples and columns correspond to
variants. This is accomplished by extending the
dgCMatrix class from which this class inherits
all slots. Information about variants is stored in an additional slot
named variantInfo. This slot must be of class
VariantInfo and have exactly as many elements as the
genotype matrix has columns. The variantInfo slot has a
dedicated metadata column named “MAF” that contains the minor
allele frequencies (MAFs) of the variants. For convenience,
accessor functions variantInfo and MAF are available (see
below).
Objects of this class should only be created and manipulated by the
constructors and accessors described below, as only these methods ensure the
integrity of the created objects. Direct modification of object slots
is strongly discouraged!
Constructors
See help pages genotypeMatrix and
readGenotypeMatrix.
Methods
show
signature(object="GenotypeMatrix"):
displays the matrix dimensions (i.e. the number of samples and variants)
along with some basic statistics of the minor allele frequency
(MAF).
Accessors
variantInfo
signature(object="GenotypeMatrix"):
returns variant information as a VariantInfo
object.
MAF
signature(object="GenotypeMatrix"):
returns a numeric vector with the minor allele frequencies (MAFs).
Row and column names can be set and get as usual for matrix-like
objects with rownames and
colnames, respectively.
When setting the column names of a GenotypeMatrix
object, both the names of the variant info (slot variantInfo) and the
column names of the matrix are set.
Subsetting
In the following code snippets, x is a
GenotypeMatrix object.
x[i, ]: returns a
GenotypeMatrix object that only
contains the samples selected by the index vector i
x[, j]: returns a
GenotypeMatrix
object that only contains the variants selected by the index
vector j
x[i, j]: returns a
GenotypeMatrix
object that only contains the samples selected by the index vector
i and the variants selected by the index vector j
None of these subsetting functions support a drop argument.
As soon as a drop argument is supplied, no matter whether
TRUE or FALSE, all variant information is stripped off and a
dgCMatrix object is returned.
By default, MAFs
are not altered by subsetting samples. However, if the optional
argument recomputeMAF is set to TRUE (the default is
FALSE), MAFs are recomputed for the resulting
subsetted genotype matrix as described in
genotypeMatrix. The ploidy for computing MAFs can
be controlled by the optional ploidy argument (the default is 2).
## create a toy example
A <- matrix(rbinom(50, 2, prob=0.2), 5, 10)
sA <- as(A, "dgCMatrix")
pos <- sort(sample(1:10000, ncol(A)))
seqname <- "chr1"
## variant with 'GRanges' object
gr <- GRanges(seqnames=seqname, ranges=IRanges(start=pos, width=1))
gtm <- genotypeMatrix(A, gr)
gtm
as.matrix(gtm)
variantInfo(gtm)
MAF(gtm)
## variant with 'pos' and 'seqnames' object
genotypeMatrix(sA, pos, seqname)
## variant with 'seqname:pos' strings passed through 'pos' argument
spos <- paste(seqname, pos, sep=":")
spos
genotypeMatrix(sA, spos)
## read data from VCF file using 'readVcf()' from the 'VariantAnnotation'
## package
if (require(VariantAnnotation))
{
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
sp <- ScanVcfParam(info=NA, genome="GT", fixed=c("ALT", "FILTER"))
vcf <- readVcf(vcfFile, genome="hgA", param=sp)
rowRanges(vcf)
## call constructor for 'VCF' object
gtm <- genotypeMatrix(vcf)
gtm
variantInfo(gtm)
## alternatively, extract information from 'VCF' object and use
## variant with character matrix and 'GRanges' positions
## note that, in 'VCF' objects, rows correspond to variants and
## columns correspond to samples, therefore, we have to transpose the
## genotype
gt <- t(geno(vcf)$GT)
gt[1:5, 1:5]
gr <- rowRanges(vcf)
gtm <- genotypeMatrix(gt, gr)
as.matrix(gtm[1:20, 1:5, recomputeMAF=TRUE])
}
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(podkat)
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/podkat/GenotypeMatrix-class.Rd_%03d_medium.png", width=480, height=480)
> ### Name: GenotypeMatrix-class
> ### Title: Class 'GenotypeMatrix'
> ### Aliases: GenotypeMatrix-class class:GenotypeMatrix GenotypeMatrix
> ### [,GenotypeMatrix,index,missing,missing-method
> ### [,GenotypeMatrix,missing,index,missing-method
> ### [,GenotypeMatrix,index,index,missing-method
> ### variantInfo,GenotypeMatrix-method MAF,GenotypeMatrix-method
> ### show,GenotypeMatrix-method
> ### Keywords: classes
>
> ### ** Examples
>
> ## create a toy example
> A <- matrix(rbinom(50, 2, prob=0.2), 5, 10)
> sA <- as(A, "dgCMatrix")
> pos <- sort(sample(1:10000, ncol(A)))
> seqname <- "chr1"
>
> ## variant with 'GRanges' object
> gr <- GRanges(seqnames=seqname, ranges=IRanges(start=pos, width=1))
> gtm <- genotypeMatrix(A, gr)
> gtm
Genotype matrix:
Number of samples: 5
Number of variants: 9
Mean MAF: 0.2333333
Median MAF: 0.2
Minimum MAF: 0.1
Maximum MAF: 0.4
> as.matrix(gtm)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 0 1 0 0 0 0 0 1
[2,] 0 0 1 0 0 0 2 0 1
[3,] 0 1 0 0 1 0 1 2 1
[4,] 2 0 0 0 0 0 1 0 0
[5,] 1 1 0 1 0 1 0 1 0
> variantInfo(gtm)
VariantInfo object with 9 ranges and 1 metadata column:
seqnames ranges strand | MAF
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 [1986, 1986] * | 0.4
[2] chr1 [4278, 4278] * | 0.2
[3] chr1 [4802, 4802] * | 0.2
[4] chr1 [4932, 4932] * | 0.1
[5] chr1 [5198, 5198] * | 0.1
[6] chr1 [6467, 6467] * | 0.1
[7] chr1 [6804, 6804] * | 0.4
[8] chr1 [8908, 8908] * | 0.3
[9] chr1 [9019, 9019] * | 0.3
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> MAF(gtm)
[1] 0.4 0.2 0.2 0.1 0.1 0.1 0.4 0.3 0.3
>
> ## variant with 'pos' and 'seqnames' object
> genotypeMatrix(sA, pos, seqname)
Genotype matrix:
Number of samples: 5
Number of variants: 9
Mean MAF: 0.2333333
Median MAF: 0.2
Minimum MAF: 0.1
Maximum MAF: 0.4
>
> ## variant with 'seqname:pos' strings passed through 'pos' argument
> spos <- paste(seqname, pos, sep=":")
> spos
[1] "chr1:1986" "chr1:2409" "chr1:4278" "chr1:4802" "chr1:4932" "chr1:5198"
[7] "chr1:6467" "chr1:6804" "chr1:8908" "chr1:9019"
> genotypeMatrix(sA, spos)
Genotype matrix:
Number of samples: 5
Number of variants: 9
Mean MAF: 0.2333333
Median MAF: 0.2
Minimum MAF: 0.1
Maximum MAF: 0.4
>
> ## read data from VCF file using 'readVcf()' from the 'VariantAnnotation'
> ## package
> if (require(VariantAnnotation))
+ {
+ vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
+ sp <- ScanVcfParam(info=NA, genome="GT", fixed=c("ALT", "FILTER"))
+ vcf <- readVcf(vcfFile, genome="hgA", param=sp)
+ rowRanges(vcf)
+
+ ## call constructor for 'VCF' object
+ gtm <- genotypeMatrix(vcf)
+ gtm
+ variantInfo(gtm)
+
+ ## alternatively, extract information from 'VCF' object and use
+ ## variant with character matrix and 'GRanges' positions
+ ## note that, in 'VCF' objects, rows correspond to variants and
+ ## columns correspond to samples, therefore, we have to transpose the
+ ## genotype
+ gt <- t(geno(vcf)$GT)
+ gt[1:5, 1:5]
+ gr <- rowRanges(vcf)
+ gtm <- genotypeMatrix(gt, gr)
+ as.matrix(gtm[1:20, 1:5, recomputeMAF=TRUE])
+ }
Loading required package: VariantAnnotation
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'VariantAnnotation'
The following object is masked from 'package:base':
tabulate
snv:6 snv:7
S1 0 0
S2 0 0
S3 0 0
S4 0 0
S5 0 0
S6 0 1
S7 0 0
S8 1 0
S9 0 0
S10 0 0
S11 0 0
S12 0 0
S13 0 0
S14 0 0
S15 1 0
S16 0 0
S17 0 0
S18 0 0
S19 1 0
S20 0 1
>
>
>
>
> dev.off()
null device
1
>