R: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP142.GRCh37
getSNPlocs
R Documentation
Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP142.GRCh37
Description
Functions for accessing the SNPs stored in the
SNPlocs.Hsapiens.dbSNP142.GRCh37 package.
Usage
## Count and load all the SNPs for a given chromosome:
getSNPcount()
getSNPlocs(seqname, as.GRanges=FALSE, caching=TRUE)
## Extract SNP information for a set of rs ids:
rsid2loc(rsids, caching=TRUE)
rsid2alleles(rsids, caching=TRUE)
rsidsToGRanges(rsids, caching=TRUE)
Arguments
seqname
The name of the sequence for which to get the SNP locations
and alleles.
If as.GRanges is FALSE, only one sequence can
be specified (i.e. seqname must be a single string).
If as.GRanges is TRUE, an arbitrary number of
sequences can be specified (i.e. seqname can be
a character vector of arbitrary length).
as.GRanges
TRUE or FALSE. If TRUE, then the SNP locations
and alleles are returned in a GRanges object.
Otherwise (the default), they are returned in a data frame (see below).
caching
Should the loaded SNPs be cached in memory for faster further
retrieval but at the cost of increased memory usage?
rsids
A vector of rs ids. Can be integer or character vector, with or
without the "rs" prefix. NAs are not allowed.
Details
See SNPlocs.Hsapiens.dbSNP142.GRCh37 for general information
about this package.
The SNP data are split by chromosome (1-22, X, Y, MT) i.e. the
package contains one data set per chromosome, each of them being a
serialized data frame with 1 row per SNP and the 2 following columns:
loc: The 1-based location of the SNP relative to the
first base at the 5' end of the plus strand of the reference
sequence.
alleles: A raw vector with no NAs which can be
converted into a character vector containing the alleles
for each SNP represented by an IUPAC nucleotide ambiguity
code (see ?IUPAC_CODE_MAP in the
Biostrings package for more information).
Note that those data sets are not intended to be used directly but
the user should instead use the getSNPcount and getSNPlocs
convenience wrappers for loading the SNP data. When used with
as.GRanges=FALSE (the default), getSNPlocs returns
a data frame with 1 row per SNP and the 3 following columns:
RefSNP_id: RefSNP ID (aka "rs id") with "rs"
prefix removed. Character vector with no NAs and no duplicates.
alleles_as_ambig: A character vector with no NAs
containing the alleles for each SNP represented by an IUPAC
nucleotide ambiguity code.
loc: Same as for the 2-col serialized data frame
described previously.
Value
getSNPcount returns a named integer vector containing the number
of SNPs for each sequence in the reference genome.
By default (as.GRanges=FALSE), getSNPlocs returns the
3-col data frame described above containing the SNP data for the
specified chromosome.
Otherwise (as.GRanges=TRUE), it returns a
GRanges object with extra columns
"RefSNP_id" and "alleles_as_ambig".
Note that all the elements (genomic ranges) in this
GRanges object have their strand set
to "+" and that all the sequence lengths are set to NA.
rsid2loc and rsid2alleles both return a named vector
(integer vector for the former, character vector for the latter)
where each (name, value) pair corresponds to a supplied rs id.
For both functions the name in (name, value) is the chromosome
of the rs id. The value in (name, value) is the position of the rs id
on the chromosome for rsid2loc, and a single IUPAC code
representing the associated alleles for rsid2alleles.
rsidsToGRanges returns a GRanges object
similar to the one returned by getSNPlocs (when used with
as.GRanges=TRUE) and where each element corresponds to a
supplied rs id.
Author(s)
H. Pages
See Also
SNPlocs.Hsapiens.dbSNP142.GRCh37
IUPAC_CODE_MAP
GRanges-class
BSgenome-class
injectSNPs
findOverlaps
Examples
## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------
getSNPcount()
## Get the locations and alleles of all SNPs on chromosome 22:
ch22snps <- getSNPlocs("ch22")
dim(ch22snps)
colnames(ch22snps)
head(ch22snps)
## Get the locations and alleles of all SNPs on chromosomes 22 and MT
## as a GRanges object:
getSNPlocs(c("ch22", "chMT"), as.GRanges=TRUE)
## ---------------------------------------------------------------------
## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
## ---------------------------------------------------------------------
## ... and return it in a GRanges object:
myrsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
"rs10932221", "rs56219727", "rs73709730", "rs55838886",
"rs3734153", "rs79381275", "rs1516535")
gr <- rsidsToGRanges(myrsids)
IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]
## ---------------------------------------------------------------------
## C. INJECTION IN THE REFERENCE GENOME
## ---------------------------------------------------------------------
library(BSgenome.Hsapiens.UCSC.hg19.masked)
BSgenome.Hsapiens.UCSC.hg19.masked
## Note that the chromosome names in BSgenome.Hsapiens.UCSC.hg19
## are those used by UCSC and they differ from those used by dbSNP.
## Inject the SNPs in hg19 (injectSNPs() "knows" how to map dbSNP
## chromosome names to UCSC names):
genome <- BSgenome.Hsapiens.UCSC.hg19.masked
genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP142.GRCh37")
genome2
alphabetFrequency(unmasked(genome2$chr22))
alphabetFrequency(unmasked(genome$chr22))
## Get the number of nucleotides that were modified by this injection:
neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))
## ---------------------------------------------------------------------
## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
## ---------------------------------------------------------------------
## Note that dbSNP can assign distinct ids to SNPs located at the same
## position:
any(duplicated(ch22snps$RefSNP_id)) # rs ids are all distinct...
any(duplicated(ch22snps$loc)) # but some locations are repeated!
ch22snps <- ch22snps[order(ch22snps$loc), ] # sort by location
which(duplicated(ch22snps$loc))[1:3] # 24,50,53
ch22snps[48:54, ] # rs2495297 and rs113903952 share the same location
# (16051255) and alleles (Y, i.e. C/T).
# Same thing for rs2508057 and rs76439996.
## Also note that not all SNP alleles are consistent with the hg19 genome
## i.e. the alleles reported for a given SNP are not always compatible
## with the nucleotide found at the SNP location in hg19.
## For example, to get the number of inconsistent SNPs in chr1:
ch1snps <- getSNPlocs("ch1")
all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
nchar(all_alleles) # 8041396 SNPs on chr1
neditAt(all_alleles, unmasked(genome$chr1)[ch1snps$loc], fixed=FALSE)
## ==> 2631 SNPs (0.000327%) are inconsistent with hg19 chr1
## Finally, let's check that no SNP falls in an assembly gap:
agaps <- masks(genome$chr1)$AGAPS
agaps # the assembly gaps
## Looping over the assembly gaps:
sapply(1:length(agaps),
function(i)
any(ch1snps$loc >= start(agaps)[i] &
ch1snps$loc <= end(agaps)[i]))
## Or, in a more efficient way:
stopifnot(length(findOverlaps(ch1snps$loc, agaps)) == 0)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(SNPlocs.Hsapiens.dbSNP142.GRCh37)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SNPlocs.Hsapiens.dbSNP142.GRCh37/getSNPlocs.Rd_%03d_medium.png", width=480, height=480)
> ### Name: getSNPlocs
> ### Title: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP142.GRCh37
> ### Aliases: getSNPcount .loadLoc .loadAlleles getSNPlocs rsid2loc
> ### rsid2alleles rsidsToGRanges
> ### Keywords: data
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## A. BASIC USAGE
> ## ---------------------------------------------------------------------
>
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
8041396 8710119 7175795 7013628 6472099 6146663 5847602 5646660 4474397 4917656
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20
5012865 4739993 3487384 3258073 2994310 3400965 2917994 2773791 2324935 2313217
ch21 ch22 chX chY chMT
1397905 1401280 2091538 129866 1581
>
> ## Get the locations and alleles of all SNPs on chromosome 22:
> ch22snps <- getSNPlocs("ch22")
> dim(ch22snps)
[1] 1401280 3
> colnames(ch22snps)
[1] "RefSNP_id" "alleles_as_ambig" "loc"
> head(ch22snps)
RefSNP_id alleles_as_ambig loc
1 374742143 M 16050036
2 587697622 R 16050075
3 587755077 R 16050115
4 375383604 Y 16050159
5 587654921 Y 16050213
6 199856444 W 16050252
>
> ## Get the locations and alleles of all SNPs on chromosomes 22 and MT
> ## as a GRanges object:
> getSNPlocs(c("ch22", "chMT"), as.GRanges=TRUE)
GRanges object with 1402861 ranges and 2 metadata columns:
seqnames ranges strand | RefSNP_id alleles_as_ambig
<Rle> <IRanges> <Rle> | <character> <character>
[1] ch22 [16050036, 16050036] + | 374742143 M
[2] ch22 [16050075, 16050075] + | 587697622 R
[3] ch22 [16050115, 16050115] + | 587755077 R
[4] ch22 [16050159, 16050159] + | 375383604 Y
[5] ch22 [16050213, 16050213] + | 587654921 Y
... ... ... ... . ... ...
[1402857] chMT [16512, 16512] + | 373943637 Y
[1402858] chMT [16519, 16519] + | 3937033 Y
[1402859] chMT [16526, 16526] + | 386829315 R
[1402860] chMT [16527, 16527] + | 386829316 Y
[1402861] chMT [16529, 16529] + | 370705831 Y
-------
seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome
>
> ## ---------------------------------------------------------------------
> ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
> ## ---------------------------------------------------------------------
>
> ## ... and return it in a GRanges object:
> myrsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
+ "rs10932221", "rs56219727", "rs73709730", "rs55838886",
+ "rs3734153", "rs79381275", "rs1516535")
> gr <- rsidsToGRanges(myrsids)
> IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]
R S Y S Y W R Y Y Y S
"AG" "CG" "CT" "CG" "CT" "AT" "AG" "CT" "CT" "CT" "CG"
>
> ## ---------------------------------------------------------------------
> ## C. INJECTION IN THE REFERENCE GENOME
> ## ---------------------------------------------------------------------
>
> library(BSgenome.Hsapiens.UCSC.hg19.masked)
Loading required package: BSgenome.Hsapiens.UCSC.hg19
> BSgenome.Hsapiens.UCSC.hg19.masked
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg19
# release date: Feb. 2009
# release name: Genome Reference Consortium GRCh37
# 93 sequences:
# chr1 chr2 chr3
# chr4 chr5 chr6
# chr7 chr8 chr9
# chr10 chr11 chr12
# chr13 chr14 chr15
# ... ... ...
# chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
# chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
# chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
# chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
# chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> ## Note that the chromosome names in BSgenome.Hsapiens.UCSC.hg19
> ## are those used by UCSC and they differ from those used by dbSNP.
>
> ## Inject the SNPs in hg19 (injectSNPs() "knows" how to map dbSNP
> ## chromosome names to UCSC names):
> genome <- BSgenome.Hsapiens.UCSC.hg19.masked
> genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP142.GRCh37")
> genome2
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg19
# release date: Feb. 2009
# release name: Genome Reference Consortium GRCh37
# with SNPs injected from package: SNPlocs.Hsapiens.dbSNP142.GRCh37
# 93 sequences:
# chr1 chr2 chr3
# chr4 chr5 chr6
# chr7 chr8 chr9
# chr10 chr11 chr12
# chr13 chr14 chr15
# ... ... ...
# chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
# chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
# chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
# chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
# chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> alphabetFrequency(unmasked(genome2$chr22))
A C G T M R W S
8853615 7915690 7908672 8815635 107294 483614 75852 130106
Y K V H D B N -
482196 107395 3594 3264 3251 3584 16410804 0
+ .
0 0
> alphabetFrequency(unmasked(genome$chr22))
A C G T M R W S
9094775 8375984 8369235 9054551 0 0 0 0
Y K V H D B N -
0 0 0 0 0 0 16410021 0
+ .
0 0
> ## Get the number of nucleotides that were modified by this injection:
> neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))
[1] 1400933
>
> ## ---------------------------------------------------------------------
> ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
> ## ---------------------------------------------------------------------
>
> ## Note that dbSNP can assign distinct ids to SNPs located at the same
> ## position:
> any(duplicated(ch22snps$RefSNP_id)) # rs ids are all distinct...
[1] FALSE
> any(duplicated(ch22snps$loc)) # but some locations are repeated!
[1] TRUE
> ch22snps <- ch22snps[order(ch22snps$loc), ] # sort by location
> which(duplicated(ch22snps$loc))[1:3] # 24,50,53
[1] 24 50 53
> ch22snps[48:54, ] # rs2495297 and rs113903952 share the same location
RefSNP_id alleles_as_ambig loc
48 368207715 Y 16051252
49 2495297 Y 16051255
50 113903952 Y 16051255
51 587623720 K 16051269
52 2508057 S 16051295
53 76439996 S 16051295
54 11089130 S 16051347
> # (16051255) and alleles (Y, i.e. C/T).
> # Same thing for rs2508057 and rs76439996.
>
> ## Also note that not all SNP alleles are consistent with the hg19 genome
> ## i.e. the alleles reported for a given SNP are not always compatible
> ## with the nucleotide found at the SNP location in hg19.
> ## For example, to get the number of inconsistent SNPs in chr1:
> ch1snps <- getSNPlocs("ch1")
> all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
> nchar(all_alleles) # 8041396 SNPs on chr1
[1] 8041396
> neditAt(all_alleles, unmasked(genome$chr1)[ch1snps$loc], fixed=FALSE)
[1] 2631
> ## ==> 2631 SNPs (0.000327%) are inconsistent with hg19 chr1
>
> ## Finally, let's check that no SNP falls in an assembly gap:
> agaps <- masks(genome$chr1)$AGAPS
> agaps # the assembly gaps
NormalIRanges object with 39 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 10000 10000
[2] 177418 227417 50000
[3] 267720 317719 50000
[4] 471369 521368 50000
[5] 2634221 2684220 50000
... ... ... ...
[35] 206332222 206482221 150000
[36] 223747847 223797846 50000
[37] 235192212 235242211 50000
[38] 248908211 249058210 150000
[39] 249240622 249250621 10000
> ## Looping over the assembly gaps:
> sapply(1:length(agaps),
+ function(i)
+ any(ch1snps$loc >= start(agaps)[i] &
+ ch1snps$loc <= end(agaps)[i]))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE
> ## Or, in a more efficient way:
> stopifnot(length(findOverlaps(ch1snps$loc, agaps)) == 0)
>
>
>
>
>
> dev.off()
null device
1
>