Last data update: 2014.03.03

R: The SNPlocs.Hsapiens.dbSNP141.GRCh38 package
SNPlocs.Hsapiens.dbSNP141.GRCh38R Documentation

The SNPlocs.Hsapiens.dbSNP141.GRCh38 package

Description

This package contains SNP locations and alleles for Homo sapiens extracted from dbSNP Build 141.

Details

SNPs from dbSNP were filtered to keep only those satisfying the 3 following criteria:

  • The SNP is a single-base substitution i.e. its type is "snp". Other types used by dbSNP are: "in-del", "mixed", "microsatellite", "named-locus", "multinucleotide-polymorphism", etc... All those SNPs were dropped.

  • The SNP is marked as notwithdrawn.

  • A single location on the reference genome (GRCh37.p5) is reported for the SNP, and this location is on chromosomes 1-22, X, Y, or MT.

SNPlocs packages always store the alleles corresponding to the plus strand, whatever the strand reported by dbSNP is (which is achieved by storing the complement of the alleles reported by dbSNP for SNPs located on the minus strand). In other words, in a SNPlocs package, all the SNPs are considered to be on the plus strand and everything is reported with respect to that strand.

Note

The source data files used for this package were created by the dbSNP Development Team at NCBI on May 1st, 2014.

The SNPs in this package can be "injected" in BSgenome.Hsapiens.NCBI.GRCh38 or BSgenome.Hsapiens.UCSC.hg38 and will land at the correct location.

See http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/ for more information about the GRCh38 assembly.

See http://www.ncbi.nlm.nih.gov/snp, the SNP Home at NCBI, for more information about dbSNP.

See ?injectSNPs in the BSgenome software package for more information about the SNP injection mechanism.

See http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg38 for the UCSC Genome Browser based on the hg38 assembly. Note hg38 and GRCh38 are the same assemblies (i.e. the 455 genomic sequences in both of them are the same), except that they use different conventions to name the sequences (i.e. for the chromosome and scaffold names).

Author(s)

H. Pages

References

Genome Reference Consortium at NCBI: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

SNP Home at NCBI: http://www.ncbi.nlm.nih.gov/snp

dbSNP Human BUILD 141 announcement: http://www.ncbi.nlm.nih.gov/mailman/pipermail/dbsnp-announce/2014q2/000139.html

UCSC Genome Browser based on the hg38 assembly: http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg38

See Also

  • snplocs in the BSgenome software package for how to access the data stored in this package.

  • IUPAC_CODE_MAP in the Biostrings package.

  • The GRanges class in the GenomicRanges package.

  • injectSNPs in the BSgenome software package for SNP injection.

  • The VariantAnnotation software package to annotate variants with respect to location and amino acid coding.

Examples

## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------
snps <- SNPlocs.Hsapiens.dbSNP141.GRCh38
snpcount(snps)

## Get the locations and alleles of all SNPs on chromosome 22:
ch22snps <- snplocs(snps, "ch22")
dim(ch22snps)
colnames(ch22snps)
head(ch22snps)

## Get the locations and alleles of all SNPs on chromosomes 22 and MT
## as a GRanges object:
snplocs(snps, c("ch22", "chMT"), as.GRanges=TRUE)

## ---------------------------------------------------------------------
## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
## ---------------------------------------------------------------------

## ... and return it in a GRanges object:
my_rsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
             "rs10932221", "rs56219727", "rs73709730", "rs55838886",
             "rs3734153", "rs79381275", "rs1516535")
gr <- snpid2grange(snps, my_rsids)

## Translate the IUPAC ambiguity codes used to represent the alleles
## into nucleotides:
IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]

## ---------------------------------------------------------------------
## C. INJECTION IN THE REFERENCE GENOME
## ---------------------------------------------------------------------

library(BSgenome.Hsapiens.NCBI.GRCh38)
genome <- BSgenome.Hsapiens.NCBI.GRCh38
genome

genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP141.GRCh38")
genome2  # note the additional line "with SNPs injected from..."

alphabetFrequency(genome[["22"]])
alphabetFrequency(genome2[["22"]])

## Get the number of nucleotides that were modified by this injection:
neditAt(genome[["22"]], genome2[["22"]])  # 737750

## ---------------------------------------------------------------------
## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
## ---------------------------------------------------------------------

## Note that dbSNP can assign distinct ids to SNPs located at the same
## position:
any(duplicated(ch22snps$RefSNP_id))  # rs ids are all distinct...
any(duplicated(ch22snps$loc))  # but some locations are repeated!

ch22snps <- ch22snps[order(ch22snps$loc), ]  # sort by location
which(duplicated(ch22snps$loc))[1:2]  # 106, 558
ch22snps[103:108, ]  # rs9617549 and rs199758506 share the same location
                     # (10874444) and alleles (Y, i.e. C/T)

## Also note that not all SNP alleles are consistent with the GRCh38
## genomic sequences i.e. the alleles reported for a given SNP are not
## always compatible with the nucleotide found at the SNP location in
## GRCh38. For example, to get the number of inconsistent SNPs in chr1:
ch1snps <- snplocs(snps, "ch1")
all_alleles <- DNAString(paste(ch1snps$alleles_as_ambig, collapse=""))
nchar(all_alleles)  # 4160510 SNPs on chr1
neditAt(genome[["1"]][ch1snps$loc], all_alleles, fixed=FALSE)
## ==> 57641 SNPs (1.385%) are inconsistent with GRCh38 chr1!

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SNPlocs.Hsapiens.dbSNP141.GRCh38)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit


Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SNPlocs.Hsapiens.dbSNP141.GRCh38/package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: SNPlocs.Hsapiens.dbSNP141.GRCh38
> ### Title: The SNPlocs.Hsapiens.dbSNP141.GRCh38 package
> ### Aliases: SNPlocs.Hsapiens.dbSNP141.GRCh38-package
> ###   SNPlocs.Hsapiens.dbSNP141.GRCh38 COMPATIBLE_BSGENOMES
> ### Keywords: package
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## A. BASIC USAGE
> ## ---------------------------------------------------------------------
> snps <- SNPlocs.Hsapiens.dbSNP141.GRCh38
> snpcount(snps)
    ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10 
4160510 4455172 3700693 3630068 3337728 3244071 3038017 2887273 2340705 2561394 
   ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20 
2613035 2496276 1801222 1690797 1553541 1750501 1514837 1434166 1234897 1243589 
   ch21    ch22     chX     chY    chMT 
 755647  746745 1938976  118943    2053 
> 
> ## Get the locations and alleles of all SNPs on chromosome 22:
> ch22snps <- snplocs(snps, "ch22")
> dim(ch22snps)
[1] 746745      3
> colnames(ch22snps)
[1] "RefSNP_id"        "alleles_as_ambig" "loc"             
> head(ch22snps)
  RefSNP_id alleles_as_ambig      loc
1  71207739                M 10510770
2  71207740                R 10510928
3   4022986                R 10511116
4  71207745                R 10511641
5   4022528                S 10516682
6   4022529                R 10516697
> 
> ## Get the locations and alleles of all SNPs on chromosomes 22 and MT
> ## as a GRanges object:
> snplocs(snps, c("ch22", "chMT"), as.GRanges=TRUE)
GRanges object with 748798 ranges and 2 metadata columns:
           seqnames               ranges strand |   RefSNP_id alleles_as_ambig
              <Rle>            <IRanges>  <Rle> | <character>      <character>
       [1]     ch22 [10510770, 10510770]      + |    71207739                M
       [2]     ch22 [10510928, 10510928]      + |    71207740                R
       [3]     ch22 [10511116, 10511116]      + |     4022986                R
       [4]     ch22 [10511641, 10511641]      + |    71207745                R
       [5]     ch22 [10516682, 10516682]      + |     4022528                S
       ...      ...                  ...    ... .         ...              ...
  [748794]     chMT       [16512, 16512]      + |   373943637                Y
  [748795]     chMT       [16519, 16519]      + |     3937033                Y
  [748796]     chMT       [16528, 16528]      + |   386829315                R
  [748797]     chMT       [16529, 16529]      + |   370705831                Y
  [748798]     chMT       [16529, 16529]      + |   386829316                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38 genome
> 
> ## ---------------------------------------------------------------------
> ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
> ## ---------------------------------------------------------------------
> 
> ## ... and return it in a GRanges object:
> my_rsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
+              "rs10932221", "rs56219727", "rs73709730", "rs55838886",
+              "rs3734153", "rs79381275", "rs1516535")
> gr <- snpid2grange(snps, my_rsids)
> 
> ## Translate the IUPAC ambiguity codes used to represent the alleles
> ## into nucleotides:
> IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]
   R    S    Y    S    Y    W    R    Y    Y    Y    S 
"AG" "CG" "CT" "CG" "CT" "AT" "AG" "CT" "CT" "CT" "CG" 
> 
> ## ---------------------------------------------------------------------
> ## C. INJECTION IN THE REFERENCE GENOME
> ## ---------------------------------------------------------------------
> 
> library(BSgenome.Hsapiens.NCBI.GRCh38)
> genome <- BSgenome.Hsapiens.NCBI.GRCh38
> genome
Human genome:
# organism: Homo sapiens (Human)
# provider: NCBI
# provider version: GRCh38
# release date: 2013-12-17
# release name: Genome Reference Consortium Human Build 38
# 455 sequences:
#   1                                  2                                 
#   3                                  4                                 
#   5                                  6                                 
#   7                                  8                                 
#   9                                  10                                
#   ...                                ...                               
#   HSCHR19KIR_FH05_B_HAP_CTG3_1       HSCHR19KIR_FH06_A_HAP_CTG3_1      
#   HSCHR19KIR_FH06_BA1_HAP_CTG3_1     HSCHR19KIR_FH08_A_HAP_CTG3_1      
#   HSCHR19KIR_FH08_BAX_HAP_CTG3_1     HSCHR19KIR_FH13_A_HAP_CTG3_1      
#   HSCHR19KIR_FH13_BA2_HAP_CTG3_1     HSCHR19KIR_FH15_A_HAP_CTG3_1      
#   HSCHR19KIR_RP5_B_HAP_CTG3_1                                          
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> 
> genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP141.GRCh38")
> genome2  # note the additional line "with SNPs injected from..."
Human genome:
# organism: Homo sapiens (Human)
# provider: NCBI
# provider version: GRCh38
# release date: 2013-12-17
# release name: Genome Reference Consortium Human Build 38
# with SNPs injected from package: SNPlocs.Hsapiens.dbSNP141.GRCh38
# 455 sequences:
#   1                                  2                                 
#   3                                  4                                 
#   5                                  6                                 
#   7                                  8                                 
#   9                                  10                                
#   ...                                ...                               
#   HSCHR19KIR_FH05_B_HAP_CTG3_1       HSCHR19KIR_FH06_A_HAP_CTG3_1      
#   HSCHR19KIR_FH06_BA1_HAP_CTG3_1     HSCHR19KIR_FH08_A_HAP_CTG3_1      
#   HSCHR19KIR_FH08_BAX_HAP_CTG3_1     HSCHR19KIR_FH13_A_HAP_CTG3_1      
#   HSCHR19KIR_FH13_BA2_HAP_CTG3_1     HSCHR19KIR_FH15_A_HAP_CTG3_1      
#   HSCHR19KIR_RP5_B_HAP_CTG3_1                                          
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> 
> alphabetFrequency(genome[["22"]])
       A        C        G        T        M        R        W        S 
10382214  9160652  9246186 10370725        0        2        1        0 
       Y        K        V        H        D        B        N        - 
       2        0        0        0        0        0 11658686        0 
       +        . 
       0        0 
> alphabetFrequency(genome2[["22"]])
       A        C        G        T        M        R        W        S 
10252805  8920915  9005903 10242404    56909   251149    41963    67909 
       Y        K        V        H        D        B        N        - 
  250153    56775     3251     2278     2192     3385 11660477        0 
       +        . 
       0        0 
> 
> ## Get the number of nucleotides that were modified by this injection:
> neditAt(genome[["22"]], genome2[["22"]])  # 737750
[1] 737750
> 
> ## ---------------------------------------------------------------------
> ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
> ## ---------------------------------------------------------------------
> 
> ## Note that dbSNP can assign distinct ids to SNPs located at the same
> ## position:
> any(duplicated(ch22snps$RefSNP_id))  # rs ids are all distinct...
[1] FALSE
> any(duplicated(ch22snps$loc))  # but some locations are repeated!
[1] TRUE
> 
> ch22snps <- ch22snps[order(ch22snps$loc), ]  # sort by location
> which(duplicated(ch22snps$loc))[1:2]  # 106, 558
[1] 106 558
> ch22snps[103:108, ]  # rs9617549 and rs199758506 share the same location
    RefSNP_id alleles_as_ambig      loc
103  71275350                Y 10681527
104   2465321                Y 10842661
105   9617549                Y 10874444
106 199758506                Y 10874444
107 201933133                Y 10874460
108 201352866                Y 10874503
>                      # (10874444) and alleles (Y, i.e. C/T)
> 
> ## Also note that not all SNP alleles are consistent with the GRCh38
> ## genomic sequences i.e. the alleles reported for a given SNP are not
> ## always compatible with the nucleotide found at the SNP location in
> ## GRCh38. For example, to get the number of inconsistent SNPs in chr1:
> ch1snps <- snplocs(snps, "ch1")
> all_alleles <- DNAString(paste(ch1snps$alleles_as_ambig, collapse=""))
> nchar(all_alleles)  # 4160510 SNPs on chr1
[1] 4160510
> neditAt(genome[["1"]][ch1snps$loc], all_alleles, fixed=FALSE)
[1] 57641
> ## ==> 57641 SNPs (1.385%) are inconsistent with GRCh38 chr1!
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>