R Graphical Manual

Browse All

Last data update: 2014.03.03

R: VCF class objects

VCF-class

R Documentation

VCF class objects

Description

The VCF class is a virtual class extended from RangedSummarizedExperiment. The subclasses, CompressedVCF and ExtendedVCF, are containers for holding data from Variant Call Format files.

Details

The VCF class is a virtual class with two concrete subclasses, CollapsedVCF and ExtendedVCF.

Slots unique to VCF and subclasses,

fixed: A DataFrame containing the REF, ALT, QUAL and FILTER fields from a VCF file.
info: A DataFrame containing the INFO fields from a VCF file.

Slots inherited from RangedSummarizedExperiment,

metadata: A list containing the file header or other information about the overall experiment.
rowRanges: A GRanges-class instance defining the variant ranges and associated metadata columns of REF, ALT, QUAL and FILTER.
colData: A DataFrame-class instance describing the samples and associated metadata.
geno: The assays slot from RangedSummarizedExperiment has been renamed as geno for the VCF class. This slot contains the genotype information immediately following the FORMAT field in a VCF file. Each element of the list or SimpleList is a matrix or array.

It is expected that users will not create instances of the VCF class but instead one of the concrete subclasses, CollapsedVCF or ExpandVCF. CollapsedVCF contains the ALT data as a DNAStringSetList allowing for multiple alleles per variant. The ExpandedVCF ALT data is a DNAStringSet where the ALT column has been expanded to create a flat form of the data with one row per variant-allele combination. In the case of strucutral variants, ALT will be a CompressedCharacterList or character in the collapsed or expanded forms.

Constructors

: readVcf(rowRanges = GRanges(), colData = DataFrame(), exptData = list(), fixed = DataFrame(), info = DataFrame(), geno = SimpleList(), ..., collapsed=TRUE, verbose = FALSE) Creates a CollapsedVCF class from data in a Variant Call Format file on disk.
: VCF(rowRanges = GRanges(), colData = DataFrame(), exptData = list(header = VCFHeader()), fixed = DataFrame(), info = DataFrame(), geno = SimpleList(), ..., collapsed=TRUE, verbose = FALSE) Creates CollapsedVCF when collapsed = TRUE and an ExpandedVCF when collapsed = FALSE.

Accessors

In the following code snippets x is a CollapsedVCF or ExpandedVCF object.

rowRanges(x, ..., fixed = TRUE), rowRanges(x) <- value: Gets or sets the rowRanges. The CHROM, POS, ID and REF fields are used to create a GRanges object. Ranges are created using POS as the start value and width of the reference allele (REF). The IDs become the rownames. If they are missing (i.e., ‘.’) a string of CHROM:POS_REF/ALT is used instead. The genome argument is stored in the seqinfo of the GRanges and can be accessed with genome(<VCF>).

When fixed = TRUE, REF, ALT, QUAL and FILT metadata columns are included.

One metadata column, paramRangeID, is included with the rowRanges. This ID is meaningful when multiple ranges are specified in the ScanVcfParam and distinguishes which records match each range.

The metadata columns can be accessed with mcols(x)$<variable> or the following accessors:

ref(x), ref(x) <- value: Gets or sets the reference allele (REF). value must be a DNAStringSet.
alt(x), alt(x) <- value: Gets or sets the alternate allele data (ALT). When x is a CollapsedVCF, value must be a DNAStringSetList or CompressedCharacterList. For ExpandedVCF, value must be a DNAStringSet or character.
qual(x), qual(x) <- value: Returns or sets the quality scores (QUAL). value must be an numeric(1L).
filt(x), filt(x) <- value: Returns or sets the filter data (FILT). value must be a character(1L).

fixed(x), fixed(x) <- value: Gets or sets a DataFrame of REF, ALT, QUAL and FILTER only. Note these fields are displayed as metadata columns when rowRanges(VCF) is invoked. The fields can be replaced using rowRanges<- but it is more efficient to do so with fixed<-.

info(x, ..., row.names = TRUE), info(x) <- value: Gets or sets a DataFrame of INFO variables. Row names are added if unique and row.names=TRUE.

geno(x, withDimnames=TRUE), geno(x) <- value: Gets a SimpleList of genotype data. value is a SimpleList. To replace a single variable in the SimpleList use geno(x)$variable <- value; in this case value must be a matrix or array. By default row names are returned; to override specify geno(vcf, withDimnames=FALSE).

metadata(x): Gets a list of experiment-related data. By default this list includes the ‘header’ information from the VCF file. See the use of header() for details in extracting header information.

colData(x), colData(x) <- value: Gets or sets a DataFrame of sample-specific information. Each row represents a sample in the VCF file. value must be a DataFrame with rownames representing the samples in the VCF file.

genome(x): Extract the genome information from the GRanges object returned by the rowRanges accessor.

seqlevels(x): Extract the seqlevels from the GRanges object returned by the rowRanges accessor.

strand(x): Extract the strand from the GRanges object returned by the rowRanges accessor.

header(x), header(x)<- value: Get or set the VCF header information. Replacement value must be a VCFHeader object. To modify individual elements use info<-, geno<- or meta<- on a ‘VCFHeader’ object. See ?VCFHeader man page for details.

info(header(x))
geno(header(x))
meta(header(x))
samples(header(x))

Subsetting and combining

In the following code x is a VCF object, and ... is a list of VCF objects.

x[i, j], x[i, j] <- value: Gets or sets rows and columns. i and j can be integer or logical vectors. value is a replacement VCF object.

cbind(...), rbind(...): cbind combines objects with identical ranges (rowRanges) but different samples (columns in assays). The colnames in colData must match or an error is thrown. Columns with duplicate names in fixed, info and mcols(rowRanges(VCF)) must contain the same data.

rbind combines objects with different ranges (rowRanges) and the same subjects (columns in assays). Columns with duplicate names in colData must contain the same data. The ‘Samples’ columns in colData (created by readVcf) are renamed with a numeric extension ordered as they were input to rbind e.g., “Samples.1, Samples.2, ...” etc.

metadata from all objects are combined into a list with no name checking.

expand

In the following code snippets x is a CollapsedVCF object.

expand(x, ..., row.names = FALSE): Expand (unlist) the ALT column of a CollapsedVCF object to one row per ALT value. Variables with Number='A' have one value per alternate allele and are expanded accordingly. The 'AD' genotype field is expanded into REF/ALT pairs. For all other fields, the rows are replicated to match the elementNROWS of ALT.

The output is an ExpandedVCF with ALT as a DNAStringSet or character (structural variants). By default rownames are NULL. When row.names=TRUE the expanded output has duplicated rownames corresponding to the original x.

genotypeCodesToNucleotides(vcf, ...)

This function converts the 'GT' genotype codes in a VCF object to nucleotides. See also ?readGT to read in only 'GT' data as codes or nucleotides.

SnpMatrixToVCF(from, seqSource)

This function converts the output from the read.plink function to a VCF class. from must be a list of length 3 with named elements "map", "fam" and "genotypes". seqSource can be a BSgenome or an FaFile used for reference sequence extraction.

Variant Type

Functions to identify variant type include isSNV, isInsertion, isDeletion, isIndel, isSubstitution and isTransition. See the ?isSNV man page for details.

Arguments

geno

A list or SimpleList of matrix elements, or a matrix containing the genotype information from a VCF file. If present, these data immediately follow the FORMAT field in the VCF.

Each element of the list must have the same dimensions, and dimension names (if present) must be consistent across elements and with the row names of rowRanges, colData.

info

A DataFrame of data from the INFO field of a VCF file. The number of rows must match that in the rowRanges object.

fixed

A DataFrame of REF, ALT, QUAL and FILTER fields from a VCF file. The number of rows must match that of the rowRanges object.

rowRanges

A GRanges instance describing the ranges of interest. Row names, if present, become the row names of the VCF. The length of the GRanges must equal the number of rows of the matrices in geno.

colData

A DataFrame describing the samples. Row names, if present, become the column names of the VCF.

metadata

A list describing the header of the VCF file or additional information for the overall experiment.

...

For cbind and rbind a list of VCF objects. For all other methods ... are additional arguments passed to methods.

collapsed

A logical(1) indicating whether a CollapsedVCF or ExpandedVCF should be created. The ALT in a CollapsedVCF is a DNAStringSetList while in a ExpandedVCF it is a DNAStringSet.

verbose

A logical(1) indicating whether messages about data coercion during construction should be printed.

Author(s)

Valerie Obenchain

Examples

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")

## ----------------------------------------------------------------
## Accessors 
## ----------------------------------------------------------------
## Variant locations are stored in the GRanges object returned by
## the rowRanges() accessor.
rowRanges(vcf)

## Individual fields can be extracted with ref(), alt(), qual() 
## and filt().
qual(vcf)
ref(vcf)

## All 'info' fields can be extracted together along
## with the GRanges of locations.
head(info(vcf))

## All genotype fields can be seen with geno(). Individual
## fields are accessed with '$' or '[['.
geno(vcf)
identical(geno(vcf)$GQ, geno(vcf)[[2]])

## ----------------------------------------------------------------
## Renaming seqlevels and subsetting 
## ----------------------------------------------------------------
## Overlap and matching operations require that the objects
## being compared have the same seqlevels (chromosome names).
## It is often the case that the seqlevesls in on of the objects
## needs to be modified to match the other. In this VCF, the 
## seqlevels are numbers instead of preceded by "chr" or "ch". 

seqlevels(vcf)

## Rename the seqlevels to start with 'chr'.
vcf2 <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf))) 
seqlevels(vcf2)

## The VCF can also be subset by seqlevel using 'keepSeqlevels'
## or 'dropSeqlevels'. See ?keepSeqlevels for details. 
vcf3 <- keepSeqlevels(vcf2, "chr2")
seqlevels(vcf3)

## ----------------------------------------------------------------
## Header information 
## ----------------------------------------------------------------

## Header data can be modified in the 'meta', 'info' and 'geno'
## slots of the VCFHeader object. See ?VCFHeader for details.

## Current 'info' fields.
rownames(info(header(vcf)))

## Add a new field to the header.
newInfo <- DataFrame(Number=1, Type="Integer",
                     Description="Number of Samples With Data",
                     row.names="NS")
info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
rownames(info(header(vcf)))

## ----------------------------------------------------------------
## Collapsed and Expanded VCF 
## ----------------------------------------------------------------
## readVCF() produces a CollapsedVCF object.
fl <- system.file("extdata", "ex2.vcf", 
                  package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
vcf

## The ALT column is a DNAStringSetList to allow for more
## than one alternate allele per variant.
alt(vcf)

## For structural variants ALT is a CharacterList.
fl <- system.file("extdata", "structural.vcf", 
                  package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
alt(vcf)

## ExpandedVCF is the 'flattened' counterpart of CollapsedVCF.
## The ALT and all variables with Number='A' in the header are
## expanded to one row per alternate allele.
vcfLong <- expand(vcf)
alt(vcfLong)

## Also see the ?VRanges class for an alternative form of
## 'flattened' VCF data.

## ----------------------------------------------------------------
## isSNV()
## ----------------------------------------------------------------
## isSNV() returns a subset VCF containing SNVs only.

vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
alt(vcf) <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))
fixed(vcf)[c("REF", "ALT")]

## SNVs are present in rows 1 (single ALT value), 3 (both ALT values) 
## and 4 (1 of the 2 ALT values).
vcf[isSNV(vcf, singleAltOnly=TRUE)] 
vcf[isSNV(vcf, singleAltOnly=FALSE)] ## all 3 SNVs

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(VariantAnnotation)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: GenomeInfoDb
Loading required package: stats4
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector

Attaching package: 'VariantAnnotation'

The following object is masked from 'package:base':

    tabulate

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/VariantAnnotation/VCF-class.Rd_%03d_medium.png", width=480, height=480)
> ### Name: VCF-class
> ### Title: VCF class objects
> ### Aliases: class:VCF VCF-class CollapsedVCF class:CollapsedVCF
> ###   CollapsedVCF-class ExpandedVCF class:ExpandedVCF ExpandedVCF-class
> ###   VCF SnpMatrixToVCF ref ref,VCF-method ref<-
> ###   ref<-,VCF,DNAStringSet-method alt alt,VCF-method alt<-
> ###   alt<-,CollapsedVCF,CharacterList-method
> ###   alt<-,ExpandedVCF,character-method
> ###   alt<-,CollapsedVCF,DNAStringSetList-method
> ###   alt<-,ExpandedVCF,DNAStringSet-method qual qual,VCF-method qual<-
> ###   qual<-,VCF,numeric-method filt filt,VCF-method filt<-
> ###   filt<-,VCF,character-method fixed fixed,VCF-method fixed<-
> ###   fixed<-,VCF,DataFrame-method info info,VCF-method info<-
> ###   info<-,VCF,DataFrame-method rowRanges,VCF-method
> ###   rowRanges<-,VCF-method mcols<-,VCF,DataFrame-method
> ###   dimnames<-,VCF,list-method geno geno,VCF-method geno,VCF,ANY-method
> ###   geno,VCF,character-method geno,VCF,numeric-method
> ###   geno,VCFHeader,ANY-method geno<- geno<-,VCF,character,matrix-method
> ###   geno<-,VCF,numeric,matrix-method geno<-,VCF,missing,matrix-method
> ###   geno<-,VCF,missing,SimpleList-method strand,VCF-method
> ###   strand<-,VCF,ANY-method header,VCF-method header<-
> ###   header<-,VCF,VCFHeader-method [,VCF-method [,VCF,ANY,ANY-method
> ###   [<-,VCF,ANY,ANY,VCF-method cbind,VCF-method rbind,VCF-method
> ###   genome,VCF-method seqlevels,VCF-method expand,CollapsedVCF-method
> ###   expand,ExpandedVCF-method genotypeCodesToNucleotides show,VCF-method
> ###   show,CollapsedVCF-method show,ExpandedVCF-method
> 
> ### ** Examples
> 
> fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
> vcf <- readVcf(fl, genome="hg19")
> 
> ## ----------------------------------------------------------------
> ## Accessors 
> ## ----------------------------------------------------------------
> ## Variant locations are stored in the GRanges object returned by
> ## the rowRanges() accessor.
> rowRanges(vcf)
GRanges object with 7 ranges and 5 metadata columns:
                                                                                     seqnames
                                                                                        <Rle>
                                                                     1:13220_T/<DEL>        1
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C        1
                                                                    2:321682_T/<DEL>        2
                                                           2:14477084_C/<DEL:ME:ALU>        2
                                                             3:9425916_C/<INS:ME:L1>        3
                                                                  3:12665100_A/<DUP>        3
                                                           4:18665128_T/<DUP:TANDEM>        4
                                                                                                   ranges
                                                                                                <IRanges>
                                                                     1:13220_T/<DEL> [   13220,    13220]
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C [ 2827693,  2827762]
                                                                    2:321682_T/<DEL> [  321682,   321682]
                                                           2:14477084_C/<DEL:ME:ALU> [14477084, 14477084]
                                                             3:9425916_C/<INS:ME:L1> [ 9425916,  9425916]
                                                                  3:12665100_A/<DUP> [12665100, 12665100]
                                                           4:18665128_T/<DUP:TANDEM> [18665128, 18665128]
                                                                                     strand
                                                                                      <Rle>
                                                                     1:13220_T/<DEL>      *
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C      *
                                                                    2:321682_T/<DEL>      *
                                                           2:14477084_C/<DEL:ME:ALU>      *
                                                             3:9425916_C/<INS:ME:L1>      *
                                                                  3:12665100_A/<DUP>      *
                                                           4:18665128_T/<DUP:TANDEM>      *
                                                                                     |
                                                                                     |
                                                                     1:13220_T/<DEL> |
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C |
                                                                    2:321682_T/<DEL> |
                                                           2:14477084_C/<DEL:ME:ALU> |
                                                             3:9425916_C/<INS:ME:L1> |
                                                                  3:12665100_A/<DUP> |
                                                           4:18665128_T/<DUP:TANDEM> |
                                                                                     paramRangeID
                                                                                         <factor>
                                                                     1:13220_T/<DEL>         <NA>
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C         <NA>
                                                                    2:321682_T/<DEL>         <NA>
                                                           2:14477084_C/<DEL:ME:ALU>         <NA>
                                                             3:9425916_C/<INS:ME:L1>         <NA>
                                                                  3:12665100_A/<DUP>         <NA>
                                                           4:18665128_T/<DUP:TANDEM>         <NA>
                                                                                                         REF
                                                                                              <DNAStringSet>
                                                                     1:13220_T/<DEL>                       T
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C CCGTGGATGC...TCCCCTCGCA
                                                                    2:321682_T/<DEL>                       T
                                                           2:14477084_C/<DEL:ME:ALU>                       C
                                                             3:9425916_C/<INS:ME:L1>                       C
                                                                  3:12665100_A/<DUP>                       A
                                                           4:18665128_T/<DUP:TANDEM>                       T
                                                                                                 ALT
                                                                                     <CharacterList>
                                                                     1:13220_T/<DEL>           <DEL>
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C               C
                                                                    2:321682_T/<DEL>           <DEL>
                                                           2:14477084_C/<DEL:ME:ALU>    <DEL:ME:ALU>
                                                             3:9425916_C/<INS:ME:L1>     <INS:ME:L1>
                                                                  3:12665100_A/<DUP>           <DUP>
                                                           4:18665128_T/<DUP:TANDEM>    <DUP:TANDEM>
                                                                                          QUAL
                                                                                     <numeric>
                                                                     1:13220_T/<DEL>         6
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C      <NA>
                                                                    2:321682_T/<DEL>         6
                                                           2:14477084_C/<DEL:ME:ALU>        12
                                                             3:9425916_C/<INS:ME:L1>        23
                                                                  3:12665100_A/<DUP>        14
                                                           4:18665128_T/<DUP:TANDEM>        11
                                                                                          FILTER
                                                                                     <character>
                                                                     1:13220_T/<DEL>        PASS
  1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C        PASS
                                                                    2:321682_T/<DEL>        PASS
                                                           2:14477084_C/<DEL:ME:ALU>        PASS
                                                             3:9425916_C/<INS:ME:L1>        PASS
                                                                  3:12665100_A/<DUP>        PASS
                                                           4:18665128_T/<DUP:TANDEM>        PASS
  -------
  seqinfo: 4 sequences from hg19 genome; no seqlengths
> 
> ## Individual fields can be extracted with ref(), alt(), qual() 
> ## and filt().
> qual(vcf)
[1]  6 NA  6 12 23 14 11
> ref(vcf)
  A DNAStringSet instance of length 7
    width seq
[1]     1 T
[2]    70 CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA
[3]     1 T
[4]     1 C
[5]     1 C
[6]     1 A
[7]     1 T
> 
> ## All 'info' fields can be extracted together along
> ## with the GRanges of locations.
> head(info(vcf))
DataFrame with 6 rows and 10 columns
                                                                                                BKPTID
                                                                                       <CharacterList>
1:13220_T/<DEL>                                                                                       
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C Pindel_LCS_D1099159
2:321682_T/<DEL>                                                                                      
2:14477084_C/<DEL:ME:ALU>                                                                             
3:9425916_C/<INS:ME:L1>                                                                               
3:12665100_A/<DUP>                                                                                    
                                                                                           CIEND
                                                                                   <IntegerList>
1:13220_T/<DEL>                                                                           -10,62
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C         NA,NA
2:321682_T/<DEL>                                                                          -10,62
2:14477084_C/<DEL:ME:ALU>                                                                 -12,32
3:9425916_C/<INS:ME:L1>                                                                    NA,NA
3:12665100_A/<DUP>                                                                      -500,500
                                                                                           CIPOS
                                                                                   <IntegerList>
1:13220_T/<DEL>                                                                           -56,20
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C         NA,NA
2:321682_T/<DEL>                                                                          -56,20
2:14477084_C/<DEL:ME:ALU>                                                                 -22,18
3:9425916_C/<INS:ME:L1>                                                                   -16,22
3:12665100_A/<DUP>                                                                      -500,500
                                                                                         END
                                                                                   <integer>
1:13220_T/<DEL>                                                                        13221
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C   2827680
2:321682_T/<DEL>                                                                      321887
2:14477084_C/<DEL:ME:ALU>                                                           14477381
3:9425916_C/<INS:ME:L1>                                                              9425916
3:12665100_A/<DUP>                                                                  12686200
                                                                                          HOMLEN
                                                                                   <IntegerList>
1:13220_T/<DEL>                                                                                 
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C             1
2:321682_T/<DEL>                                                                                
2:14477084_C/<DEL:ME:ALU>                                                                       
3:9425916_C/<INS:ME:L1>                                                                         
3:12665100_A/<DUP>                                                                              
                                                                                            HOMSEQ
                                                                                   <CharacterList>
1:13220_T/<DEL>                                                                                   
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C               C
2:321682_T/<DEL>                                                                                  
2:14477084_C/<DEL:ME:ALU>                                                                         
3:9425916_C/<INS:ME:L1>                                                                           
3:12665100_A/<DUP>                                                                                
                                                                                   IMPRECISE
                                                                                   <logical>
1:13220_T/<DEL>                                                                         TRUE
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C     FALSE
2:321682_T/<DEL>                                                                        TRUE
2:14477084_C/<DEL:ME:ALU>                                                               TRUE
3:9425916_C/<INS:ME:L1>                                                                 TRUE
3:12665100_A/<DUP>                                                                      TRUE
                                                                                             MEINFO
                                                                                    <CharacterList>
1:13220_T/<DEL>                                                                        NA,NA,NA,...
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C     NA,NA,NA,...
2:321682_T/<DEL>                                                                       NA,NA,NA,...
2:14477084_C/<DEL:ME:ALU>                                                          AluYa5,5,307,...
3:9425916_C/<INS:ME:L1>                                                             L1HS,1,6025,...
3:12665100_A/<DUP>                                                                     NA,NA,NA,...
                                                                                           SVLEN
                                                                                   <IntegerList>
1:13220_T/<DEL>                                                                             -105
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C           -66
2:321682_T/<DEL>                                                                            -105
2:14477084_C/<DEL:ME:ALU>                                                                   -297
3:9425916_C/<INS:ME:L1>                                                                     6027
3:12665100_A/<DUP>                                                                         21100
                                                                                        SVTYPE
                                                                                   <character>
1:13220_T/<DEL>                                                                            DEL
1:2827693_CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA/C         DEL
2:321682_T/<DEL>                                                                           DEL
2:14477084_C/<DEL:ME:ALU>                                                                  DEL
3:9425916_C/<INS:ME:L1>                                                                    INS
3:12665100_A/<DUP>                                                                         DUP
> 
> ## All genotype fields can be seen with geno(). Individual
> ## fields are accessed with '$' or '[['.
> geno(vcf)
List of length 4
names(4): GT GQ CN CNQ
> identical(geno(vcf)$GQ, geno(vcf)[[2]])
[1] TRUE
> 
> ## ----------------------------------------------------------------
> ## Renaming seqlevels and subsetting 
> ## ----------------------------------------------------------------
> ## Overlap and matching operations require that the objects
> ## being compared have the same seqlevels (chromosome names).
> ## It is often the case that the seqlevesls in on of the objects
> ## needs to be modified to match the other. In this VCF, the 
> ## seqlevels are numbers instead of preceded by "chr" or "ch". 
> 
> seqlevels(vcf)
[1] "1" "2" "3" "4"
> 
> ## Rename the seqlevels to start with 'chr'.
> vcf2 <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf))) 
> seqlevels(vcf2)
[1] "chr1" "chr2" "chr3" "chr4"
> 
> ## The VCF can also be subset by seqlevel using 'keepSeqlevels'
> ## or 'dropSeqlevels'. See ?keepSeqlevels for details. 
> vcf3 <- keepSeqlevels(vcf2, "chr2")
> seqlevels(vcf3)
[1] "chr2"
> 
> ## ----------------------------------------------------------------
> ## Header information 
> ## ----------------------------------------------------------------
> 
> ## Header data can be modified in the 'meta', 'info' and 'geno'
> ## slots of the VCFHeader object. See ?VCFHeader for details.
> 
> ## Current 'info' fields.
> rownames(info(header(vcf)))
 [1] "BKPTID"    "CIEND"     "CIPOS"     "END"       "HOMLEN"    "HOMSEQ"   
 [7] "IMPRECISE" "MEINFO"    "SVLEN"     "SVTYPE"   
> 
> ## Add a new field to the header.
> newInfo <- DataFrame(Number=1, Type="Integer",
+                      Description="Number of Samples With Data",
+                      row.names="NS")
> info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
> rownames(info(header(vcf)))
 [1] "BKPTID"    "CIEND"     "CIPOS"     "END"       "HOMLEN"    "HOMSEQ"   
 [7] "IMPRECISE" "MEINFO"    "SVLEN"     "SVTYPE"    "NS"       
> 
> ## ----------------------------------------------------------------
> ## Collapsed and Expanded VCF 
> ## ----------------------------------------------------------------
> ## readVCF() produces a CollapsedVCF object.
> fl <- system.file("extdata", "ex2.vcf", 
+                   package="VariantAnnotation")
> vcf <- readVcf(fl, genome="hg19")
> vcf
class: CollapsedVCF 
dim: 5 3 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 6 columns: NS, DP, AF, AA, DB, H2
info(header(vcf)):
      Number Type    Description                
   NS 1      Integer Number of Samples With Data
   DP 1      Integer Total Depth                
   AF A      Float   Allele Frequency           
   AA 1      String  Ancestral Allele           
   DB 0      Flag    dbSNP membership, build 129
   H2 0      Flag    HapMap2 membership         
geno(vcf):
  SimpleList of length 4: GT, GQ, DP, HQ
geno(header(vcf)):
      Number Type    Description      
   GT 1      String  Genotype         
   GQ 1      Integer Genotype Quality 
   DP 1      Integer Read Depth       
   HQ 2      Integer Haplotype Quality
> 
> ## The ALT column is a DNAStringSetList to allow for more
> ## than one alternate allele per variant.
> alt(vcf)
DNAStringSetList of length 5
[[1]] A
[[2]] A
[[3]] G T
[[4]] 
[[5]] G GTCT
> 
> ## For structural variants ALT is a CharacterList.
> fl <- system.file("extdata", "structural.vcf", 
+                   package="VariantAnnotation")
> vcf <- readVcf(fl, genome="hg19")
> alt(vcf)
CharacterList of length 7
[[1]] <DEL>
[[2]] C
[[3]] <DEL>
[[4]] <DEL:ME:ALU>
[[5]] <INS:ME:L1>
[[6]] <DUP>
[[7]] <DUP:TANDEM>
> 
> ## ExpandedVCF is the 'flattened' counterpart of CollapsedVCF.
> ## The ALT and all variables with Number='A' in the header are
> ## expanded to one row per alternate allele.
> vcfLong <- expand(vcf)
> alt(vcfLong)
[1] "<DEL>"        "C"            "<DEL>"        "<DEL:ME:ALU>" "<INS:ME:L1>" 
[6] "<DUP>"        "<DUP:TANDEM>"
> 
> ## Also see the ?VRanges class for an alternative form of
> ## 'flattened' VCF data.
> 
> ## ----------------------------------------------------------------
> ## isSNV()
> ## ----------------------------------------------------------------
> ## isSNV() returns a subset VCF containing SNVs only.
> 
> vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
> alt(vcf) <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
> ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))
> fixed(vcf)[c("REF", "ALT")]
DataFrame with 4 rows and 2 columns
             REF                ALT
  <DNAStringSet> <DNAStringSetList>
1              G                  A
2             AA                 TT
3              T                G,A
4              G               TT,C
> 
> ## SNVs are present in rows 1 (single ALT value), 3 (both ALT values) 
> ## and 4 (1 of the 2 ALT values).
> vcf[isSNV(vcf, singleAltOnly=TRUE)] 
class: CollapsedVCF 
dim: 1 0 
rowRanges(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 0 columns: 
geno(vcf):
  SimpleList of length 0: 
> vcf[isSNV(vcf, singleAltOnly=FALSE)] ## all 3 SNVs
class: CollapsedVCF 
dim: 3 0 
rowRanges(vcf):
  GRanges with 4 metadata columns: REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 0 columns: 
geno(vcf):
  SimpleList of length 0: 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>