A list with elements Counts (a 4d
integer array of size [1:12, 1:2, 1:k, 1:n]),
Coverage (a 3d integer array of size [1:2, 1:k, 1:n]),
Deletions (a 3d integer array of size [1:2, 1:k, 1:n]),
Reference (a 1d integer vector of size [1:n]) – see Details.
sampledata
A data.frame with k rows (one for each
sample) and columns Type, Column and (SampleGroup
or Patient). The tally file should contain this information as
a group attribute, see getSampleData for an example.
cl
A list with parameters used by the variant calling
functions. Such a list can be produced, for instance, by a call to
vcConfParams.
minStrandCov
Minimum coverage per strand in the case sample.
maxStrandCov
Maximum coverage per strand in the case sample.
minStrandCovControl
Minimum coverage per strand in the control sample.
maxStrandCovControl
Maximum coverage per strand in the control sample.
minStrandAltSupport
Minimum support for the alternative allele
per strand in the case sample. This should be 1 or higher.
maxStrandAltSupportControl
Maximum support for the alternative allele
per strand in the control sample. This should usually be 0.
minStrandDelSupport
Minimum support for the deletion
per strand in the case sample. This should be 1 or higher.
maxStrandDelSupportControl
Maximum support for the deletion
per strand in the control sample. This should usually be 0.
minStrandInsSupport
Minimum support for the insertion
per strand in the case sample. This should be 1 or higher.
maxStrandInsSupportControl
Maximum support for the insertion
per strand in the control sample. This should usually be 0.
bases
Indices for subsetting in the bases dimension of the Counts
array, 5:8 extracts only those calls made in the middle one of the
sequencing cycle bins.
returnDataPoints
Boolean flag to specify that a data.frame
with the variant calls should be returned, otherwise only position are returned as a numeric vector.
If returnDataPoints == FALSE only the variant positions are returned.
annotateWithBackground
Boolean flag to specify that the
background mismatch / deletion frequency estimated from all control
samples in the cohort should be added to the output. A simple binomial
test will be performed as well. Only usefull if returnDataPoints
== TRUE
mergeCalls
Boolean flag to specify that adjacent calls should be
merged where appropriate (used by callDeletionsPaired).
Only usefull applied if returnDataPoints == TRUE
mergeAggregator
Aggregator function for merging adjacent calls,
defaults to mean, which means that a deletion larger than 1bp
will be annotated with the means of the counts and coverages
pValueAggregator
Aggregator function for combining the p-values
of adjacent calls when merging, defaults to max. Is only
applied if annotateWithBackground == TRUE
Details
data is a list of datasets which has to at least contain the
Counts and Coverages for variant calling respectively
Deletions for deletion calling. This list will usually be
generated by a call to the h5dapply function in which the tally
file, chromosome, datasets and regions within the datasets would be
specified. See ?h5dapply for specifics. In order for callVariantsPaired
to return the correct locations of the variants there must be the h5dapplyInfo
slot present in data as well. This is itself a list (being automatically added by
h5dapply and h5readBlock respectively) and contains the slots Group
(location in the HDF5 file) and Blockstart, which are used to set the chromosome
and the genomic positions of variants.
vcConfParams is a helper function that builds a set of variant
calling parameters as a list. This list is provided to the calling
functions e.g. callVariantsPaired and influences their behavior.
callVariantsPaired implements a simple pairwise variant
callign approach applying the filters specified in cl, and
might additionally computes an estimate of the background mismatch
rate (the mean mismatch rate of all samples labeled as 'Control' in
the sampledata and annotate the calls with p-values for the
binom.test of the observed mismatch counts and coverage at each
of the samples labeled as 'Case'.
Value
The result is either a list of positions with SNVs / deletions or a
data.frame containing the calls themselves which might contain
annotations. Adjacent calls might be merged and calls might be
annotated with p-values depending on configuration parameters.
When the configuration parameter returnDataPoints is FALSE the functions return the positions of potential variants as a list containing one integer vector of positions for each sample, if no positions were found for a sample the list will contain NULL instead. In the case of returnDatapoints == TRUE the functions return either NULL if no poisitions were found or a data.frame with the following slots:
Chrom
The chromosome the potential variant / deletion is on
Start
The starting position of the variant / deletion
End
The end position of the variant / deletions (equal to Start for SNVs and single basepair deletions)
Sample
The Case sample in which the variant was observed
altAllele
The alternate allele for SNVs (skipped for deletions, would be "-")
refAllele
The reference allele for SNVs (skipped for deletions since the tally file might not contain all the information necessary to extract it)
caseCountFwd
Support for the variant in the Case sample on the forward strand
caseCountRev
Support for the variant in the Case sample on the reverse strand
caseCoverageFwd
Coverage of the variant position in the Case sample on the forward strand
caseCoverageRev
Coverage of the variant position in the Case sample on the reverse strand
controlCountFwd
Support for the variant in the Control sample on the forward strand
controlCountRev
Support for the variant in the Control sample on the reverse strand
controlCoverageFwd
Coverage of the variant position in the Control sample on the forward strand
controlCoverageRev
Coverage of the variant position in the Control sample on the reverse strand
If the annotateWithBackground option is set the following extra columns are returned
backgroundFrequencyFwd
The averaged frequency of mismatches / deletions at the position of all samples of type Control on the forward strand
backgroundFrequencyRev
The averaged frequency of mismatches / deletions at the position of all samples of type Control on the reverse strand
pValueFwd
The p.value of the test binom.test( caseCountFwd, caseCoverageFwd, p = backgroundFrequencyFwd, alternative = "greater")
pValueRev
The p.value of the test binom.test( caseCountRev, caseCoverageRev, p = backgroundFrequencyRev, alternative = "greater")
The function callDeletionsPaired merges adjacent single-base deletion calls if the option mergeCalls is set to TRUE, in that case the counts and coverages ( e.g. caseCountFwd ) are aggregated using the function supplied in the mergeAggregator option of the configuration list (defaults to mean) and the p-values pValueFwd and pValueFwd (if annotateWithBackground is TRUE), are aggregated using the function supplied in the pValueAggregator option (defaults to max).
Author(s)
Paul Pyl
Examples
library(h5vc) # loading library
tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
position <- 29979629
windowsize <- 1000
vars <- h5dapply( # Calling Variants
filename = tallyFile,
group = "/ExampleStudy/16",
blocksize = 500,
FUN = callVariantsPaired,
sampledata = sampleData,
cl = vcConfParams(returnDataPoints=TRUE),
names = c("Coverages", "Counts", "Reference", "Deletions"),
range = c(position - windowsize, position + windowsize)
)
vars <- do.call( rbind, vars ) # merge the results from all blocks by row
vars # We did find a variant
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(h5vc)
Loading required package: grid
Loading required package: gridExtra
Loading required package: ggplot2
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/h5vc/callVariants.Rd_%03d_medium.png", width=480, height=480)
> ### Name: callVariants
> ### Title: Variant calling
> ### Aliases: callVariantsPaired vcConfParams callDeletionsPaired
>
> ### ** Examples
>
> library(h5vc) # loading library
> tallyFile <- system.file( "extdata", "example.tally.hfs5", package = "h5vcData" )
> sampleData <- getSampleData( tallyFile, "/ExampleStudy/16" )
> position <- 29979629
> windowsize <- 1000
> vars <- h5dapply( # Calling Variants
+ filename = tallyFile,
+ group = "/ExampleStudy/16",
+ blocksize = 500,
+ FUN = callVariantsPaired,
+ sampledata = sampleData,
+ cl = vcConfParams(returnDataPoints=TRUE),
+ names = c("Coverages", "Counts", "Reference", "Deletions"),
+ range = c(position - windowsize, position + windowsize)
+ )
> vars <- do.call( rbind, vars ) # merge the results from all blocks by row
> vars # We did find a variant
NULL
>
>
>
>
>
> dev.off()
null device
1
>