R: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP141.GRCh38
getSNPlocs
R Documentation
Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP141.GRCh38
Description
Functions for accessing the SNPs stored in the
SNPlocs.Hsapiens.dbSNP141.GRCh38 package.
WARNING: All the functions described in this man page are deprecated
and will be removed at some point in the future.
See ?snplocs in the BSgenome software
package for the new preferred way to access the data stored in this
package.
Usage
## Count and load all the SNPs for a given chromosome:
getSNPcount()
getSNPlocs(seqname, as.GRanges=FALSE, caching=TRUE)
## Extract SNP information for a set of rs ids:
rsid2loc(rsids, caching=TRUE)
rsid2alleles(rsids, caching=TRUE)
rsidsToGRanges(rsids, caching=TRUE)
Arguments
seqname
The name of the sequence for which to get the SNP locations
and alleles.
If as.GRanges is FALSE, only one sequence can
be specified (i.e. seqname must be a single string).
If as.GRanges is TRUE, an arbitrary number of
sequences can be specified (i.e. seqname can be
a character vector of arbitrary length).
as.GRanges
TRUE or FALSE. If TRUE, then the SNP locations
and alleles are returned in a GRanges object.
Otherwise (the default), they are returned in a data frame (see below).
caching
Should the loaded SNPs be cached in memory for faster further
retrieval but at the cost of increased memory usage?
rsids
A vector of rs ids. Can be integer or character vector, with or
without the "rs" prefix. NAs are not allowed.
Details
See SNPlocs.Hsapiens.dbSNP141.GRCh38 for general information
about this package.
The SNP data are split by chromosome (1-22, X, Y, MT) i.e. the
package contains one data set per chromosome, each of them being a
serialized data frame with 1 row per SNP and the 2 following columns:
loc: The 1-based location of the SNP relative to the
first base at the 5' end of the plus strand of the reference
sequence.
alleles: A raw vector with no NAs which can be
converted into a character vector containing the alleles
for each SNP represented by an IUPAC nucleotide ambiguity
code (see ?IUPAC_CODE_MAP in the
Biostrings package for more information).
Note that those data sets are not intended to be used directly but
the user should instead use the getSNPcount and getSNPlocs
convenience wrappers for loading the SNP data. When used with
as.GRanges=FALSE (the default), getSNPlocs returns
a data frame with 1 row per SNP and the 3 following columns:
RefSNP_id: RefSNP ID (aka "rs id") with "rs"
prefix removed. Character vector with no NAs and no duplicates.
alleles_as_ambig: A character vector with no NAs
containing the alleles for each SNP represented by an IUPAC
nucleotide ambiguity code.
loc: Same as for the 2-col serialized data frame
described previously.
Value
getSNPcount returns a named integer vector containing the number
of SNPs for each sequence in the reference genome.
By default (as.GRanges=FALSE), getSNPlocs returns the
3-col data frame described above containing the SNP data for the
specified chromosome.
Otherwise (as.GRanges=TRUE), it returns a
GRanges object with extra columns
"RefSNP_id" and "alleles_as_ambig".
Note that all the elements (genomic ranges) in this
GRanges object have their strand set
to "+" and that all the sequence lengths are set to NA.
rsid2loc and rsid2alleles both return a named vector
(integer vector for the former, character vector for the latter)
where each (name, value) pair corresponds to a supplied rs id.
For both functions the name in (name, value) is the chromosome
of the rs id. The value in (name, value) is the position of the rs id
on the chromosome for rsid2loc, and a single IUPAC code
representing the associated alleles for rsid2alleles.
rsidsToGRanges returns a GRanges object
similar to the one returned by getSNPlocs (when used with
as.GRanges=TRUE) and where each element corresponds to a
supplied rs id.
Author(s)
H. Pages
See Also
snplocs in the BSgenome software
package for the new preferred way to access the data stored in
this package.