Last data update: 2014.03.03

R: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP142.GRCh37
getSNPlocsR Documentation

Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP142.GRCh37

Description

Functions for accessing the SNPs stored in the SNPlocs.Hsapiens.dbSNP142.GRCh37 package.

Usage

## Count and load all the SNPs for a given chromosome:
getSNPcount()
getSNPlocs(seqname, as.GRanges=FALSE, caching=TRUE)

## Extract SNP information for a set of rs ids:
rsid2loc(rsids, caching=TRUE)
rsid2alleles(rsids, caching=TRUE)
rsidsToGRanges(rsids, caching=TRUE)

Arguments

seqname

The name of the sequence for which to get the SNP locations and alleles.

If as.GRanges is FALSE, only one sequence can be specified (i.e. seqname must be a single string). If as.GRanges is TRUE, an arbitrary number of sequences can be specified (i.e. seqname can be a character vector of arbitrary length).

as.GRanges

TRUE or FALSE. If TRUE, then the SNP locations and alleles are returned in a GRanges object. Otherwise (the default), they are returned in a data frame (see below).

caching

Should the loaded SNPs be cached in memory for faster further retrieval but at the cost of increased memory usage?

rsids

A vector of rs ids. Can be integer or character vector, with or without the "rs" prefix. NAs are not allowed.

Details

See SNPlocs.Hsapiens.dbSNP142.GRCh37 for general information about this package.

The SNP data are split by chromosome (1-22, X, Y, MT) i.e. the package contains one data set per chromosome, each of them being a serialized data frame with 1 row per SNP and the 2 following columns:

  • loc: The 1-based location of the SNP relative to the first base at the 5' end of the plus strand of the reference sequence.

  • alleles: A raw vector with no NAs which can be converted into a character vector containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code (see ?IUPAC_CODE_MAP in the Biostrings package for more information).

Note that those data sets are not intended to be used directly but the user should instead use the getSNPcount and getSNPlocs convenience wrappers for loading the SNP data. When used with as.GRanges=FALSE (the default), getSNPlocs returns a data frame with 1 row per SNP and the 3 following columns:

  • RefSNP_id: RefSNP ID (aka "rs id") with "rs" prefix removed. Character vector with no NAs and no duplicates.

  • alleles_as_ambig: A character vector with no NAs containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code.

  • loc: Same as for the 2-col serialized data frame described previously.

Value

getSNPcount returns a named integer vector containing the number of SNPs for each sequence in the reference genome.

By default (as.GRanges=FALSE), getSNPlocs returns the 3-col data frame described above containing the SNP data for the specified chromosome. Otherwise (as.GRanges=TRUE), it returns a GRanges object with extra columns "RefSNP_id" and "alleles_as_ambig". Note that all the elements (genomic ranges) in this GRanges object have their strand set to "+" and that all the sequence lengths are set to NA.

rsid2loc and rsid2alleles both return a named vector (integer vector for the former, character vector for the latter) where each (name, value) pair corresponds to a supplied rs id. For both functions the name in (name, value) is the chromosome of the rs id. The value in (name, value) is the position of the rs id on the chromosome for rsid2loc, and a single IUPAC code representing the associated alleles for rsid2alleles.

rsidsToGRanges returns a GRanges object similar to the one returned by getSNPlocs (when used with as.GRanges=TRUE) and where each element corresponds to a supplied rs id.

Author(s)

H. Pages

See Also

  • SNPlocs.Hsapiens.dbSNP142.GRCh37

  • IUPAC_CODE_MAP

  • GRanges-class

  • BSgenome-class

  • injectSNPs

  • findOverlaps

Examples

## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------

getSNPcount()

## Get the locations and alleles of all SNPs on chromosome 22:
ch22snps <- getSNPlocs("ch22")
dim(ch22snps)
colnames(ch22snps)
head(ch22snps)

## Get the locations and alleles of all SNPs on chromosomes 22 and MT
## as a GRanges object:
getSNPlocs(c("ch22", "chMT"), as.GRanges=TRUE)

## ---------------------------------------------------------------------
## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
## ---------------------------------------------------------------------

## ... and return it in a GRanges object:
myrsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
             "rs10932221", "rs56219727", "rs73709730", "rs55838886",
             "rs3734153", "rs79381275", "rs1516535")
gr <- rsidsToGRanges(myrsids)
IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]

## ---------------------------------------------------------------------
## C. INJECTION IN THE REFERENCE GENOME
## ---------------------------------------------------------------------

library(BSgenome.Hsapiens.UCSC.hg19.masked)
BSgenome.Hsapiens.UCSC.hg19.masked
## Note that the chromosome names in BSgenome.Hsapiens.UCSC.hg19
## are those used by UCSC and they differ from those used by dbSNP.

## Inject the SNPs in hg19 (injectSNPs() "knows" how to map dbSNP
## chromosome names to UCSC names):
genome <- BSgenome.Hsapiens.UCSC.hg19.masked
genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP142.GRCh37")
genome2
alphabetFrequency(unmasked(genome2$chr22))
alphabetFrequency(unmasked(genome$chr22))
## Get the number of nucleotides that were modified by this injection:
neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))

## ---------------------------------------------------------------------
## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
## ---------------------------------------------------------------------

## Note that dbSNP can assign distinct ids to SNPs located at the same
## position:
any(duplicated(ch22snps$RefSNP_id))  # rs ids are all distinct...
any(duplicated(ch22snps$loc))  # but some locations are repeated!
ch22snps <- ch22snps[order(ch22snps$loc), ]  # sort by location
which(duplicated(ch22snps$loc))[1:3]  # 24,50,53
ch22snps[48:54, ]    # rs2495297 and rs113903952 share the same location
                     # (16051255) and alleles (Y, i.e. C/T).
                     # Same thing for rs2508057 and rs76439996.

## Also note that not all SNP alleles are consistent with the hg19 genome
## i.e. the alleles reported for a given SNP are not always compatible
## with the nucleotide found at the SNP location in hg19.
## For example, to get the number of inconsistent SNPs in chr1:
ch1snps <- getSNPlocs("ch1")
all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
nchar(all_alleles)  # 8041396 SNPs on chr1
neditAt(all_alleles, unmasked(genome$chr1)[ch1snps$loc], fixed=FALSE)
## ==> 2631 SNPs (0.000327%) are inconsistent with hg19 chr1

## Finally, let's check that no SNP falls in an assembly gap:
agaps <- masks(genome$chr1)$AGAPS
agaps  # the assembly gaps
## Looping over the assembly gaps:
sapply(1:length(agaps),
       function(i)
           any(ch1snps$loc >= start(agaps)[i] &
               ch1snps$loc <= end(agaps)[i]))
## Or, in a more efficient way:
stopifnot(length(findOverlaps(ch1snps$loc, agaps)) == 0)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SNPlocs.Hsapiens.dbSNP142.GRCh37)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SNPlocs.Hsapiens.dbSNP142.GRCh37/getSNPlocs.Rd_%03d_medium.png", width=480, height=480)
> ### Name: getSNPlocs
> ### Title: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP142.GRCh37
> ### Aliases: getSNPcount .loadLoc .loadAlleles getSNPlocs rsid2loc
> ###   rsid2alleles rsidsToGRanges
> ### Keywords: data
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## A. BASIC USAGE
> ## ---------------------------------------------------------------------
> 
> getSNPcount()
    ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10 
8041396 8710119 7175795 7013628 6472099 6146663 5847602 5646660 4474397 4917656 
   ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20 
5012865 4739993 3487384 3258073 2994310 3400965 2917994 2773791 2324935 2313217 
   ch21    ch22     chX     chY    chMT 
1397905 1401280 2091538  129866    1581 
> 
> ## Get the locations and alleles of all SNPs on chromosome 22:
> ch22snps <- getSNPlocs("ch22")
> dim(ch22snps)
[1] 1401280       3
> colnames(ch22snps)
[1] "RefSNP_id"        "alleles_as_ambig" "loc"             
> head(ch22snps)
  RefSNP_id alleles_as_ambig      loc
1 374742143                M 16050036
2 587697622                R 16050075
3 587755077                R 16050115
4 375383604                Y 16050159
5 587654921                Y 16050213
6 199856444                W 16050252
> 
> ## Get the locations and alleles of all SNPs on chromosomes 22 and MT
> ## as a GRanges object:
> getSNPlocs(c("ch22", "chMT"), as.GRanges=TRUE)
GRanges object with 1402861 ranges and 2 metadata columns:
            seqnames               ranges strand |   RefSNP_id alleles_as_ambig
               <Rle>            <IRanges>  <Rle> | <character>      <character>
        [1]     ch22 [16050036, 16050036]      + |   374742143                M
        [2]     ch22 [16050075, 16050075]      + |   587697622                R
        [3]     ch22 [16050115, 16050115]      + |   587755077                R
        [4]     ch22 [16050159, 16050159]      + |   375383604                Y
        [5]     ch22 [16050213, 16050213]      + |   587654921                Y
        ...      ...                  ...    ... .         ...              ...
  [1402857]     chMT       [16512, 16512]      + |   373943637                Y
  [1402858]     chMT       [16519, 16519]      + |     3937033                Y
  [1402859]     chMT       [16526, 16526]      + |   386829315                R
  [1402860]     chMT       [16527, 16527]      + |   386829316                Y
  [1402861]     chMT       [16529, 16529]      + |   370705831                Y
  -------
  seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome
> 
> ## ---------------------------------------------------------------------
> ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
> ## ---------------------------------------------------------------------
> 
> ## ... and return it in a GRanges object:
> myrsids <- c("rs2639606", "rs75264089", "rs73396229", "rs55871206",
+              "rs10932221", "rs56219727", "rs73709730", "rs55838886",
+              "rs3734153", "rs79381275", "rs1516535")
> gr <- rsidsToGRanges(myrsids)
> IUPAC_CODE_MAP[mcols(gr)$alleles_as_ambig]
   R    S    Y    S    Y    W    R    Y    Y    Y    S 
"AG" "CG" "CT" "CG" "CT" "AT" "AG" "CT" "CT" "CT" "CG" 
> 
> ## ---------------------------------------------------------------------
> ## C. INJECTION IN THE REFERENCE GENOME
> ## ---------------------------------------------------------------------
> 
> library(BSgenome.Hsapiens.UCSC.hg19.masked)
Loading required package: BSgenome.Hsapiens.UCSC.hg19
> BSgenome.Hsapiens.UCSC.hg19.masked
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg19
# release date: Feb. 2009
# release name: Genome Reference Consortium GRCh37
# 93 sequences:
#   chr1                  chr2                  chr3                 
#   chr4                  chr5                  chr6                 
#   chr7                  chr8                  chr9                 
#   chr10                 chr11                 chr12                
#   chr13                 chr14                 chr15                
#   ...                   ...                   ...                  
#   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
#   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
#   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
#   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
#   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> ## Note that the chromosome names in BSgenome.Hsapiens.UCSC.hg19
> ## are those used by UCSC and they differ from those used by dbSNP.
> 
> ## Inject the SNPs in hg19 (injectSNPs() "knows" how to map dbSNP
> ## chromosome names to UCSC names):
> genome <- BSgenome.Hsapiens.UCSC.hg19.masked
> genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP142.GRCh37")
> genome2
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg19
# release date: Feb. 2009
# release name: Genome Reference Consortium GRCh37
# with SNPs injected from package: SNPlocs.Hsapiens.dbSNP142.GRCh37
# 93 sequences:
#   chr1                  chr2                  chr3                 
#   chr4                  chr5                  chr6                 
#   chr7                  chr8                  chr9                 
#   chr10                 chr11                 chr12                
#   chr13                 chr14                 chr15                
#   ...                   ...                   ...                  
#   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
#   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
#   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
#   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
#   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> alphabetFrequency(unmasked(genome2$chr22))
       A        C        G        T        M        R        W        S 
 8853615  7915690  7908672  8815635   107294   483614    75852   130106 
       Y        K        V        H        D        B        N        - 
  482196   107395     3594     3264     3251     3584 16410804        0 
       +        . 
       0        0 
> alphabetFrequency(unmasked(genome$chr22))
       A        C        G        T        M        R        W        S 
 9094775  8375984  8369235  9054551        0        0        0        0 
       Y        K        V        H        D        B        N        - 
       0        0        0        0        0        0 16410021        0 
       +        . 
       0        0 
> ## Get the number of nucleotides that were modified by this injection:
> neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))
[1] 1400933
> 
> ## ---------------------------------------------------------------------
> ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
> ## ---------------------------------------------------------------------
> 
> ## Note that dbSNP can assign distinct ids to SNPs located at the same
> ## position:
> any(duplicated(ch22snps$RefSNP_id))  # rs ids are all distinct...
[1] FALSE
> any(duplicated(ch22snps$loc))  # but some locations are repeated!
[1] TRUE
> ch22snps <- ch22snps[order(ch22snps$loc), ]  # sort by location
> which(duplicated(ch22snps$loc))[1:3]  # 24,50,53
[1] 24 50 53
> ch22snps[48:54, ]    # rs2495297 and rs113903952 share the same location
   RefSNP_id alleles_as_ambig      loc
48 368207715                Y 16051252
49   2495297                Y 16051255
50 113903952                Y 16051255
51 587623720                K 16051269
52   2508057                S 16051295
53  76439996                S 16051295
54  11089130                S 16051347
>                      # (16051255) and alleles (Y, i.e. C/T).
>                      # Same thing for rs2508057 and rs76439996.
> 
> ## Also note that not all SNP alleles are consistent with the hg19 genome
> ## i.e. the alleles reported for a given SNP are not always compatible
> ## with the nucleotide found at the SNP location in hg19.
> ## For example, to get the number of inconsistent SNPs in chr1:
> ch1snps <- getSNPlocs("ch1")
> all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
> nchar(all_alleles)  # 8041396 SNPs on chr1
[1] 8041396
> neditAt(all_alleles, unmasked(genome$chr1)[ch1snps$loc], fixed=FALSE)
[1] 2631
> ## ==> 2631 SNPs (0.000327%) are inconsistent with hg19 chr1
> 
> ## Finally, let's check that no SNP falls in an assembly gap:
> agaps <- masks(genome$chr1)$AGAPS
> agaps  # the assembly gaps
NormalIRanges object with 39 ranges and 0 metadata columns:
           start       end     width
       <integer> <integer> <integer>
   [1]         1     10000     10000
   [2]    177418    227417     50000
   [3]    267720    317719     50000
   [4]    471369    521368     50000
   [5]   2634221   2684220     50000
   ...       ...       ...       ...
  [35] 206332222 206482221    150000
  [36] 223747847 223797846     50000
  [37] 235192212 235242211     50000
  [38] 248908211 249058210    150000
  [39] 249240622 249250621     10000
> ## Looping over the assembly gaps:
> sapply(1:length(agaps),
+        function(i)
+            any(ch1snps$loc >= start(agaps)[i] &
+                ch1snps$loc <= end(agaps)[i]))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE
> ## Or, in a more efficient way:
> stopifnot(length(findOverlaps(ch1snps$loc, agaps)) == 0)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>