This package contains SNP locations and alleles for Homo sapiens
extracted from dbSNP Build 141.
Details
SNPs from dbSNP were filtered to keep only those satisfying the 3
following criteria:
The SNP is a single-base substitution i.e. its type is "snp".
Other types used by dbSNP are: "in-del", "mixed",
"microsatellite", "named-locus", "multinucleotide-polymorphism",
etc... All those SNPs were dropped.
The SNP is marked as notwithdrawn.
A single location on the reference genome (GRCh37.p5) is
reported for the SNP, and this location is on chromosomes
1-22, X, Y, or MT.
SNPlocs packages always store the alleles corresponding to the plus
strand, whatever the strand reported by dbSNP is (which is achieved by
storing the complement of the alleles reported by dbSNP for SNPs located
on the minus strand).
In other words, in a SNPlocs package, all the SNPs are considered to be
on the plus strand and everything is reported with respect to that strand.
Note
The source data files used for this package were created by the dbSNP
Development Team at NCBI on May 1st, 2014.
The SNPs in this package can be "injected" in BSgenome.Hsapiens.NCBI.GRCh38
or BSgenome.Hsapiens.UCSC.hg38 and will land at the correct location.
See ?injectSNPs in the BSgenome software
package for more information about the SNP injection mechanism.
See http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg38
for the UCSC Genome Browser based on the hg38 assembly. Note hg38 and
GRCh38 are the same assemblies (i.e. the 455 genomic sequences in both
of them are the same), except that they use different conventions to
name the sequences (i.e. for the chromosome and scaffold names).
snplocs in the BSgenome software
package for how to access the data stored in this package.
IUPAC_CODE_MAP in the Biostrings
package.
The GRanges class in the GenomicRanges
package.
injectSNPs in the BSgenome software
package for SNP injection.
The VariantAnnotation software package to annotate variants
with respect to location and amino acid coding.
Examples
## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------
snps <- SNPlocs.Hsapiens.dbSNP141.GRCh38
snpcount(snps)
## Get the locations and alleles of all SNPs on chromosome 22:
ch22snps <- snplocs(snps, "ch22")
dim(ch22snps)
colnames(ch22snps)
head(ch22snps)
## Get the locations and alleles of all SNPs on chromosomes 22 and MT
## as a GRanges object:
snplocs(snps, c("ch22", "chMT"), as.GRanges=TRUE)
## ---------------------------------------------------------------------
## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
## ---------------------------------------------------------------------
## ... and return it in a GRanges object:
my_rsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
"rs10932221", "rs56219727", "rs73709730", "rs55838886",
"rs3734153", "rs79381275", "rs1516535")
gr <- snpid2grange(snps, my_rsids)
## Translate the IUPAC ambiguity codes used to represent the alleles
## into nucleotides:
IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]
## ---------------------------------------------------------------------
## C. INJECTION IN THE REFERENCE GENOME
## ---------------------------------------------------------------------
library(BSgenome.Hsapiens.NCBI.GRCh38)
genome <- BSgenome.Hsapiens.NCBI.GRCh38
genome
genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP141.GRCh38")
genome2 # note the additional line "with SNPs injected from..."
alphabetFrequency(genome[["22"]])
alphabetFrequency(genome2[["22"]])
## Get the number of nucleotides that were modified by this injection:
neditAt(genome[["22"]], genome2[["22"]]) # 737750
## ---------------------------------------------------------------------
## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
## ---------------------------------------------------------------------
## Note that dbSNP can assign distinct ids to SNPs located at the same
## position:
any(duplicated(ch22snps$RefSNP_id)) # rs ids are all distinct...
any(duplicated(ch22snps$loc)) # but some locations are repeated!
ch22snps <- ch22snps[order(ch22snps$loc), ] # sort by location
which(duplicated(ch22snps$loc))[1:2] # 106, 558
ch22snps[103:108, ] # rs9617549 and rs199758506 share the same location
# (10874444) and alleles (Y, i.e. C/T)
## Also note that not all SNP alleles are consistent with the GRCh38
## genomic sequences i.e. the alleles reported for a given SNP are not
## always compatible with the nucleotide found at the SNP location in
## GRCh38. For example, to get the number of inconsistent SNPs in chr1:
ch1snps <- snplocs(snps, "ch1")
all_alleles <- DNAString(paste(ch1snps$alleles_as_ambig, collapse=""))
nchar(all_alleles) # 4160510 SNPs on chr1
neditAt(genome[["1"]][ch1snps$loc], all_alleles, fixed=FALSE)
## ==> 57641 SNPs (1.385%) are inconsistent with GRCh38 chr1!
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(SNPlocs.Hsapiens.dbSNP141.GRCh38)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SNPlocs.Hsapiens.dbSNP141.GRCh38/package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: SNPlocs.Hsapiens.dbSNP141.GRCh38
> ### Title: The SNPlocs.Hsapiens.dbSNP141.GRCh38 package
> ### Aliases: SNPlocs.Hsapiens.dbSNP141.GRCh38-package
> ### SNPlocs.Hsapiens.dbSNP141.GRCh38 COMPATIBLE_BSGENOMES
> ### Keywords: package
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## A. BASIC USAGE
> ## ---------------------------------------------------------------------
> snps <- SNPlocs.Hsapiens.dbSNP141.GRCh38
> snpcount(snps)
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
4160510 4455172 3700693 3630068 3337728 3244071 3038017 2887273 2340705 2561394
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20
2613035 2496276 1801222 1690797 1553541 1750501 1514837 1434166 1234897 1243589
ch21 ch22 chX chY chMT
755647 746745 1938976 118943 2053
>
> ## Get the locations and alleles of all SNPs on chromosome 22:
> ch22snps <- snplocs(snps, "ch22")
> dim(ch22snps)
[1] 746745 3
> colnames(ch22snps)
[1] "RefSNP_id" "alleles_as_ambig" "loc"
> head(ch22snps)
RefSNP_id alleles_as_ambig loc
1 71207739 M 10510770
2 71207740 R 10510928
3 4022986 R 10511116
4 71207745 R 10511641
5 4022528 S 10516682
6 4022529 R 10516697
>
> ## Get the locations and alleles of all SNPs on chromosomes 22 and MT
> ## as a GRanges object:
> snplocs(snps, c("ch22", "chMT"), as.GRanges=TRUE)
GRanges object with 748798 ranges and 2 metadata columns:
seqnames ranges strand | RefSNP_id alleles_as_ambig
<Rle> <IRanges> <Rle> | <character> <character>
[1] ch22 [10510770, 10510770] + | 71207739 M
[2] ch22 [10510928, 10510928] + | 71207740 R
[3] ch22 [10511116, 10511116] + | 4022986 R
[4] ch22 [10511641, 10511641] + | 71207745 R
[5] ch22 [10516682, 10516682] + | 4022528 S
... ... ... ... . ... ...
[748794] chMT [16512, 16512] + | 373943637 Y
[748795] chMT [16519, 16519] + | 3937033 Y
[748796] chMT [16528, 16528] + | 386829315 R
[748797] chMT [16529, 16529] + | 370705831 Y
[748798] chMT [16529, 16529] + | 386829316 Y
-------
seqinfo: 25 sequences (1 circular) from GRCh38 genome
>
> ## ---------------------------------------------------------------------
> ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
> ## ---------------------------------------------------------------------
>
> ## ... and return it in a GRanges object:
> my_rsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
+ "rs10932221", "rs56219727", "rs73709730", "rs55838886",
+ "rs3734153", "rs79381275", "rs1516535")
> gr <- snpid2grange(snps, my_rsids)
>
> ## Translate the IUPAC ambiguity codes used to represent the alleles
> ## into nucleotides:
> IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]
R S Y S Y W R Y Y Y S
"AG" "CG" "CT" "CG" "CT" "AT" "AG" "CT" "CT" "CT" "CG"
>
> ## ---------------------------------------------------------------------
> ## C. INJECTION IN THE REFERENCE GENOME
> ## ---------------------------------------------------------------------
>
> library(BSgenome.Hsapiens.NCBI.GRCh38)
> genome <- BSgenome.Hsapiens.NCBI.GRCh38
> genome
Human genome:
# organism: Homo sapiens (Human)
# provider: NCBI
# provider version: GRCh38
# release date: 2013-12-17
# release name: Genome Reference Consortium Human Build 38
# 455 sequences:
# 1 2
# 3 4
# 5 6
# 7 8
# 9 10
# ... ...
# HSCHR19KIR_FH05_B_HAP_CTG3_1 HSCHR19KIR_FH06_A_HAP_CTG3_1
# HSCHR19KIR_FH06_BA1_HAP_CTG3_1 HSCHR19KIR_FH08_A_HAP_CTG3_1
# HSCHR19KIR_FH08_BAX_HAP_CTG3_1 HSCHR19KIR_FH13_A_HAP_CTG3_1
# HSCHR19KIR_FH13_BA2_HAP_CTG3_1 HSCHR19KIR_FH15_A_HAP_CTG3_1
# HSCHR19KIR_RP5_B_HAP_CTG3_1
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
>
> genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP141.GRCh38")
> genome2 # note the additional line "with SNPs injected from..."
Human genome:
# organism: Homo sapiens (Human)
# provider: NCBI
# provider version: GRCh38
# release date: 2013-12-17
# release name: Genome Reference Consortium Human Build 38
# with SNPs injected from package: SNPlocs.Hsapiens.dbSNP141.GRCh38
# 455 sequences:
# 1 2
# 3 4
# 5 6
# 7 8
# 9 10
# ... ...
# HSCHR19KIR_FH05_B_HAP_CTG3_1 HSCHR19KIR_FH06_A_HAP_CTG3_1
# HSCHR19KIR_FH06_BA1_HAP_CTG3_1 HSCHR19KIR_FH08_A_HAP_CTG3_1
# HSCHR19KIR_FH08_BAX_HAP_CTG3_1 HSCHR19KIR_FH13_A_HAP_CTG3_1
# HSCHR19KIR_FH13_BA2_HAP_CTG3_1 HSCHR19KIR_FH15_A_HAP_CTG3_1
# HSCHR19KIR_RP5_B_HAP_CTG3_1
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
>
> alphabetFrequency(genome[["22"]])
A C G T M R W S
10382214 9160652 9246186 10370725 0 2 1 0
Y K V H D B N -
2 0 0 0 0 0 11658686 0
+ .
0 0
> alphabetFrequency(genome2[["22"]])
A C G T M R W S
10252805 8920915 9005903 10242404 56909 251149 41963 67909
Y K V H D B N -
250153 56775 3251 2278 2192 3385 11660477 0
+ .
0 0
>
> ## Get the number of nucleotides that were modified by this injection:
> neditAt(genome[["22"]], genome2[["22"]]) # 737750
[1] 737750
>
> ## ---------------------------------------------------------------------
> ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
> ## ---------------------------------------------------------------------
>
> ## Note that dbSNP can assign distinct ids to SNPs located at the same
> ## position:
> any(duplicated(ch22snps$RefSNP_id)) # rs ids are all distinct...
[1] FALSE
> any(duplicated(ch22snps$loc)) # but some locations are repeated!
[1] TRUE
>
> ch22snps <- ch22snps[order(ch22snps$loc), ] # sort by location
> which(duplicated(ch22snps$loc))[1:2] # 106, 558
[1] 106 558
> ch22snps[103:108, ] # rs9617549 and rs199758506 share the same location
RefSNP_id alleles_as_ambig loc
103 71275350 Y 10681527
104 2465321 Y 10842661
105 9617549 Y 10874444
106 199758506 Y 10874444
107 201933133 Y 10874460
108 201352866 Y 10874503
> # (10874444) and alleles (Y, i.e. C/T)
>
> ## Also note that not all SNP alleles are consistent with the GRCh38
> ## genomic sequences i.e. the alleles reported for a given SNP are not
> ## always compatible with the nucleotide found at the SNP location in
> ## GRCh38. For example, to get the number of inconsistent SNPs in chr1:
> ch1snps <- snplocs(snps, "ch1")
> all_alleles <- DNAString(paste(ch1snps$alleles_as_ambig, collapse=""))
> nchar(all_alleles) # 4160510 SNPs on chr1
[1] 4160510
> neditAt(genome[["1"]][ch1snps$loc], all_alleles, fixed=FALSE)
[1] 57641
> ## ==> 57641 SNPs (1.385%) are inconsistent with GRCh38 chr1!
>
>
>
>
>
> dev.off()
null device
1
>