R: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP.20090506
getSNPlocs
R Documentation
Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP.20090506
Description
Functions for accessing the SNPs stored in the
SNPlocs.Hsapiens.dbSNP.20090506 package.
Usage
## Count and load all the SNPs for a given chromosome:
getSNPcount()
getSNPlocs(seqname, as.GRanges=FALSE, caching=TRUE)
## Extract SNP information for a set of rs ids:
rsid2loc(rsids, caching=TRUE)
rsid2alleles(rsids, caching=TRUE)
rsidsToGRanges(rsids, caching=TRUE)
Arguments
seqname
The name of the sequence for which to get the SNP locations
and alleles.
If as.GRanges is FALSE, only one sequence can
be specified (i.e. seqname must be a single string).
If as.GRanges is TRUE, an arbitrary number of
sequences can be specified (i.e. seqname can be
a character vector of arbitrary length).
as.GRanges
TRUE or FALSE. If TRUE, then the SNP locations
and alleles are returned in a GRanges object.
Otherwise (the default), they are returned in a data frame (see below).
caching
Should the loaded SNPs be cached in memory for faster further
retrieval but at the cost of increased memory usage?
rsids
A vector of rs ids. Can be integer or character vector, with or
without the "rs" prefix. NAs are not allowed.
Details
See SNPlocs.Hsapiens.dbSNP.20090506 for general information
about this package.
The SNP data are split by chromosome (1-22, X, Y) i.e. the
package contains one data set per chromosome, each of them being a
serialized data frame with 1 row per SNP and the 2 following columns:
loc: The 1-based location of the SNP relative to the
first base at the 5' end of the plus strand of the reference
sequence.
alleles: A raw vector with no NAs which can be
converted into a character vector containing the alleles
for each SNP represented by an IUPAC nucleotide ambiguity
code (see ?IUPAC_CODE_MAP in the
Biostrings package for more information).
Note that those data sets are not intended to be used directly but
the user should instead use the getSNPcount and getSNPlocs
convenience wrappers for loading the SNP data. When used with
as.GRanges=FALSE (the default), getSNPlocs returns
a data frame with 1 row per SNP and the 3 following columns:
RefSNP_id: RefSNP ID (aka "rs id") with "rs"
prefix removed. Character vector with no NAs and no duplicates.
alleles_as_ambig: A character vector with no NAs
containing the alleles for each SNP represented by an IUPAC
nucleotide ambiguity code.
loc: Same as for the 2-col serialized data frame
described previously.
Value
getSNPcount returns a named integer vector containing the number
of SNPs for each sequence in the reference genome.
By default (as.GRanges=FALSE), getSNPlocs returns the
3-col data frame described above containing the SNP data for the
specified chromosome.
Otherwise (as.GRanges=TRUE), it returns a
GRanges object with extra columns
"RefSNP_id" and "alleles_as_ambig".
Note that all the elements (genomic ranges) in this
GRanges object have their strand set
to "+" and that all the sequence lengths are set to NA.
rsid2loc and rsid2alleles both return a named vector
(integer vector for the former, character vector for the latter)
where each (name, value) pair corresponds to a supplied rs id.
For both functions the name in (name, value) is the chromosome
of the rs id. The value in (name, value) is the position of the rs id
on the chromosome for rsid2loc, and a single IUPAC code
representing the associated alleles for rsid2alleles.
rsidsToGRanges returns a GRanges object
similar to the one returned by getSNPlocs (when used with
as.GRanges=TRUE) and where each element corresponds to a
supplied rs id.
Author(s)
H. Pages
See Also
SNPlocs.Hsapiens.dbSNP.20090506
IUPAC_CODE_MAP
GRanges-class
BSgenome-class
injectSNPs
findOverlaps
Examples
## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------
getSNPcount()
## Get the locations and alleles of all SNPs on chromosome 22:
chr22snps <- getSNPlocs("chr22")
dim(chr22snps)
colnames(chr22snps)
head(chr22snps)
## Get the locations and alleles of all SNPs on chromosomes 22 and X
## as a GRanges object:
getSNPlocs(c("chr22", "chrX"), as.GRanges=TRUE)
## ---------------------------------------------------------------------
## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
## ---------------------------------------------------------------------
## ... and return it in a GRanges object:
myrsids <- c("rs2639606", "rs73396229", "rs55871206", "rs10932221",
"rs56219727", "rs73709730", "rs55838886", "rs1516535")
rsidsToGRanges(myrsids)
## ---------------------------------------------------------------------
## C. INJECTION IN THE REFERENCE GENOME
## ---------------------------------------------------------------------
library(BSgenome.Hsapiens.UCSC.hg18)
BSgenome.Hsapiens.UCSC.hg18
## Inject the SNPs in hg18:
genome <- BSgenome.Hsapiens.UCSC.hg18
genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP.20090506")
genome2
alphabetFrequency(unmasked(genome2$chr22))
alphabetFrequency(unmasked(genome$chr22))
## Get the number of nucleotides that were modified by this injection:
neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))
## ---------------------------------------------------------------------
## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
## ---------------------------------------------------------------------
## Note that dbSNP can assign distinct ids to SNPs located at the same
## position:
any(duplicated(chr22snps$RefSNP_id)) # rs ids are all distinct...
any(duplicated(chr22snps$loc)) # but some locations are repeated!
chr22snps <- chr22snps[order(chr22snps$loc), ] # sort by location
which(duplicated(chr22snps$loc))[1] # 2
chr22snps[1:3, ] # rs67522692 and rs71218724 have same locations
# and alleles
## Also note that not all SNP alleles are consistent with the hg18 genome
## i.e. the alleles reported for a given SNP are not always compatible
## with the nucleotide found at the SNP location in hg18.
## For example, to get the number of inconsistent SNPs in chr1:
chr1snps <- getSNPlocs("chr1")
all_alleles <- paste(chr1snps$alleles_as_ambig, collapse="")
nchar(all_alleles) # 920233 SNPs on chr1
neditAt(all_alleles, unmasked(genome$chr1)[chr1snps$loc], fixed=FALSE)
## ==> 1322 SNPs (0.14%) are inconsistent with hg18 chr1!
## Finally, let's check that no SNP falls in an assembly gap:
agaps <- masks(genome$chr1)$AGAPS
agaps # the assembly gaps
## Looping over the assembly gaps:
sapply(1:length(agaps),
function(i)
any(chr1snps$loc >= start(agaps)[i] &
chr1snps$loc <= end(agaps)[i]))
## Or, in a more efficient way:
stopifnot(length(findOverlaps(chr1snps$loc, agaps)) == 0)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(SNPlocs.Hsapiens.dbSNP.20090506)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
Please note that the SNPlocs.Hsapiens.dbSNP.20090506 package contains
outdated dbSNP data and will be deprecated in the near future. We
highly recommend that you use a SNPlocs package based on a more recent
dbSNP build for your analyses instead. See available.SNPs() for the
list of SNPlocs packages currently available and make sure to pick up
the most recent one.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SNPlocs.Hsapiens.dbSNP.20090506/getSNPlocs.Rd_%03d_medium.png", width=480, height=480)
> ### Name: getSNPlocs
> ### Title: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP.20090506
> ### Aliases: getSNPcount .loadLoc .loadAlleles getSNPlocs rsid2loc
> ### rsid2alleles rsidsToGRanges
> ### Keywords: data
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## A. BASIC USAGE
> ## ---------------------------------------------------------------------
>
> getSNPcount()
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
920233 933616 789121 798603 706109 760249 655873 612367 496064 583240 577300
chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
558759 427010 365742 331501 354239 316396 322866 268235 323041 160580 187392
chrX chrY
391414 6539
>
> ## Get the locations and alleles of all SNPs on chromosome 22:
> chr22snps <- getSNPlocs("chr22")
> dim(chr22snps)
[1] 187392 3
> colnames(chr22snps)
[1] "RefSNP_id" "alleles_as_ambig" "loc"
> head(chr22snps)
RefSNP_id alleles_as_ambig loc
1 67522692 M 14892137
2 71218724 M 14892137
3 9617282 R 14908060
4 8187529 Y 14917006
5 7339830 K 14917012
6 12152118 R 14917136
>
> ## Get the locations and alleles of all SNPs on chromosomes 22 and X
> ## as a GRanges object:
> getSNPlocs(c("chr22", "chrX"), as.GRanges=TRUE)
GRanges object with 578806 ranges and 2 metadata columns:
seqnames ranges strand | RefSNP_id
<Rle> <IRanges> <Rle> | <character>
[1] chr22 [14892137, 14892137] + | 67522692
[2] chr22 [14892137, 14892137] + | 71218724
[3] chr22 [14908060, 14908060] + | 9617282
[4] chr22 [14917006, 14917006] + | 8187529
[5] chr22 [14917012, 14917012] + | 7339830
... ... ... ... . ...
[578802] chrX [154582606, 154582606] + | 557132
[578803] chrX [154583424, 154583424] + | 781880
[578804] chrX [154583643, 154583643] + | 12384168
[578805] chrX [154584361, 154584361] + | 1089622
[578806] chrX [154636327, 154636327] + | 55826418
alleles_as_ambig
<character>
[1] M
[2] M
[3] R
[4] Y
[5] K
... ...
[578802] Y
[578803] R
[578804] R
[578805] R
[578806] Y
-------
seqinfo: 24 sequences from NCBI36 genome
>
> ## ---------------------------------------------------------------------
> ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
> ## ---------------------------------------------------------------------
>
> ## ... and return it in a GRanges object:
> myrsids <- c("rs2639606", "rs73396229", "rs55871206", "rs10932221",
+ "rs56219727", "rs73709730", "rs55838886", "rs1516535")
> rsidsToGRanges(myrsids)
GRanges object with 8 ranges and 2 metadata columns:
seqnames ranges strand | RefSNP_id alleles_as_ambig
<Rle> <IRanges> <Rle> | <character> <character>
[1] chr9 [ 70217947, 70217947] + | 2639606 R
[2] chr11 [ 5820137, 5820137] + | 73396229 Y
[3] chr13 [ 93957883, 93957883] + | 55871206 S
[4] chr2 [208609439, 208609439] + | 10932221 Y
[5] chr4 [177885573, 177885573] + | 56219727 W
[6] chr7 [ 80833186, 80833186] + | 73709730 R
[7] chr1 [ 78850672, 78850672] + | 55838886 Y
[8] chr4 [183492683, 183492683] + | 1516535 S
-------
seqinfo: 24 sequences from NCBI36 genome
>
> ## ---------------------------------------------------------------------
> ## C. INJECTION IN THE REFERENCE GENOME
> ## ---------------------------------------------------------------------
>
> library(BSgenome.Hsapiens.UCSC.hg18)
> BSgenome.Hsapiens.UCSC.hg18
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg18
# release date: Mar. 2006
# release name: NCBI Build 36.1
# 49 sequences:
# chr1 chr2 chr3 chr4 chr5
# chr6 chr7 chr8 chr9 chr10
# chr11 chr12 chr13 chr14 chr15
# chr16 chr17 chr18 chr19 chr20
# chr21 chr22 chrX chrY chrM
# chr5_h2_hap1 chr6_cox_hap1 chr6_qbl_hap2 chr22_h2_hap1 chr1_random
# chr2_random chr3_random chr4_random chr5_random chr6_random
# chr7_random chr8_random chr9_random chr10_random chr11_random
# chr13_random chr15_random chr16_random chr17_random chr18_random
# chr19_random chr21_random chr22_random chrX_random
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
>
> ## Inject the SNPs in hg18:
> genome <- BSgenome.Hsapiens.UCSC.hg18
> genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP.20090506")
> genome2
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg18
# release date: Mar. 2006
# release name: NCBI Build 36.1
# with SNPs injected from package: SNPlocs.Hsapiens.dbSNP.20090506
# 49 sequences:
# chr1 chr2 chr3 chr4 chr5
# chr6 chr7 chr8 chr9 chr10
# chr11 chr12 chr13 chr14 chr15
# chr16 chr17 chr18 chr19 chr20
# chr21 chr22 chrX chrY chrM
# chr5_h2_hap1 chr6_cox_hap1 chr6_qbl_hap2 chr22_h2_hap1 chr1_random
# chr2_random chr3_random chr4_random chr5_random chr6_random
# chr7_random chr8_random chr9_random chr10_random chr11_random
# chr13_random chr15_random chr16_random chr17_random chr18_random
# chr19_random chr21_random chr22_random chrX_random
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> alphabetFrequency(unmasked(genome2$chr22))
A C G T M R W S
9048416 8309473 8299446 9006968 15428 63432 10972 18531
Y K V H D B N -
62364 15497 233 147 131 224 14840170 0
+ .
0 0
> alphabetFrequency(unmasked(genome$chr22))
A C G T M R W S
9085356 8366110 8356522 9043323 0 0 0 0
Y K V H D B N -
0 0 0 0 0 0 14840121 0
+ .
0 0
> ## Get the number of nucleotides that were modified by this injection:
> neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))
[1] 187008
>
> ## ---------------------------------------------------------------------
> ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
> ## ---------------------------------------------------------------------
>
> ## Note that dbSNP can assign distinct ids to SNPs located at the same
> ## position:
> any(duplicated(chr22snps$RefSNP_id)) # rs ids are all distinct...
[1] FALSE
> any(duplicated(chr22snps$loc)) # but some locations are repeated!
[1] TRUE
> chr22snps <- chr22snps[order(chr22snps$loc), ] # sort by location
> which(duplicated(chr22snps$loc))[1] # 2
[1] 2
> chr22snps[1:3, ] # rs67522692 and rs71218724 have same locations
RefSNP_id alleles_as_ambig loc
1 67522692 M 14892137
2 71218724 M 14892137
3 9617282 R 14908060
> # and alleles
>
> ## Also note that not all SNP alleles are consistent with the hg18 genome
> ## i.e. the alleles reported for a given SNP are not always compatible
> ## with the nucleotide found at the SNP location in hg18.
> ## For example, to get the number of inconsistent SNPs in chr1:
> chr1snps <- getSNPlocs("chr1")
> all_alleles <- paste(chr1snps$alleles_as_ambig, collapse="")
> nchar(all_alleles) # 920233 SNPs on chr1
[1] 920233
> neditAt(all_alleles, unmasked(genome$chr1)[chr1snps$loc], fixed=FALSE)
[1] 1322
> ## ==> 1322 SNPs (0.14%) are inconsistent with hg18 chr1!
>
> ## Finally, let's check that no SNP falls in an assembly gap:
> agaps <- masks(genome$chr1)$AGAPS
> agaps # the assembly gaps
NULL
> ## Looping over the assembly gaps:
> sapply(1:length(agaps),
+ function(i)
+ any(chr1snps$loc >= start(agaps)[i] &
+ chr1snps$loc <= end(agaps)[i]))
Error in hasTsp(x) : attempt to set an attribute on NULL
Calls: sapply ... lapply -> FUN -> start -> start -> start.default -> hasTsp
Execution halted