Last data update: 2014.03.03

R: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP.20090506
getSNPlocsR Documentation

Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP.20090506

Description

Functions for accessing the SNPs stored in the SNPlocs.Hsapiens.dbSNP.20090506 package.

Usage

## Count and load all the SNPs for a given chromosome:
getSNPcount()
getSNPlocs(seqname, as.GRanges=FALSE, caching=TRUE)

## Extract SNP information for a set of rs ids:
rsid2loc(rsids, caching=TRUE)
rsid2alleles(rsids, caching=TRUE)
rsidsToGRanges(rsids, caching=TRUE)

Arguments

seqname

The name of the sequence for which to get the SNP locations and alleles.

If as.GRanges is FALSE, only one sequence can be specified (i.e. seqname must be a single string). If as.GRanges is TRUE, an arbitrary number of sequences can be specified (i.e. seqname can be a character vector of arbitrary length).

as.GRanges

TRUE or FALSE. If TRUE, then the SNP locations and alleles are returned in a GRanges object. Otherwise (the default), they are returned in a data frame (see below).

caching

Should the loaded SNPs be cached in memory for faster further retrieval but at the cost of increased memory usage?

rsids

A vector of rs ids. Can be integer or character vector, with or without the "rs" prefix. NAs are not allowed.

Details

See SNPlocs.Hsapiens.dbSNP.20090506 for general information about this package.

The SNP data are split by chromosome (1-22, X, Y) i.e. the package contains one data set per chromosome, each of them being a serialized data frame with 1 row per SNP and the 2 following columns:

  • loc: The 1-based location of the SNP relative to the first base at the 5' end of the plus strand of the reference sequence.

  • alleles: A raw vector with no NAs which can be converted into a character vector containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code (see ?IUPAC_CODE_MAP in the Biostrings package for more information).

Note that those data sets are not intended to be used directly but the user should instead use the getSNPcount and getSNPlocs convenience wrappers for loading the SNP data. When used with as.GRanges=FALSE (the default), getSNPlocs returns a data frame with 1 row per SNP and the 3 following columns:

  • RefSNP_id: RefSNP ID (aka "rs id") with "rs" prefix removed. Character vector with no NAs and no duplicates.

  • alleles_as_ambig: A character vector with no NAs containing the alleles for each SNP represented by an IUPAC nucleotide ambiguity code.

  • loc: Same as for the 2-col serialized data frame described previously.

Value

getSNPcount returns a named integer vector containing the number of SNPs for each sequence in the reference genome.

By default (as.GRanges=FALSE), getSNPlocs returns the 3-col data frame described above containing the SNP data for the specified chromosome. Otherwise (as.GRanges=TRUE), it returns a GRanges object with extra columns "RefSNP_id" and "alleles_as_ambig". Note that all the elements (genomic ranges) in this GRanges object have their strand set to "+" and that all the sequence lengths are set to NA.

rsid2loc and rsid2alleles both return a named vector (integer vector for the former, character vector for the latter) where each (name, value) pair corresponds to a supplied rs id. For both functions the name in (name, value) is the chromosome of the rs id. The value in (name, value) is the position of the rs id on the chromosome for rsid2loc, and a single IUPAC code representing the associated alleles for rsid2alleles.

rsidsToGRanges returns a GRanges object similar to the one returned by getSNPlocs (when used with as.GRanges=TRUE) and where each element corresponds to a supplied rs id.

Author(s)

H. Pages

See Also

  • SNPlocs.Hsapiens.dbSNP.20090506

  • IUPAC_CODE_MAP

  • GRanges-class

  • BSgenome-class

  • injectSNPs

  • findOverlaps

Examples

## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------

getSNPcount()

## Get the locations and alleles of all SNPs on chromosome 22:
chr22snps <- getSNPlocs("chr22")
dim(chr22snps)
colnames(chr22snps)
head(chr22snps)

## Get the locations and alleles of all SNPs on chromosomes 22 and X
## as a GRanges object:
getSNPlocs(c("chr22", "chrX"), as.GRanges=TRUE)

## ---------------------------------------------------------------------
## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
## ---------------------------------------------------------------------

## ... and return it in a GRanges object:
myrsids <- c("rs2639606", "rs73396229", "rs55871206", "rs10932221",
             "rs56219727", "rs73709730", "rs55838886", "rs1516535")
rsidsToGRanges(myrsids)

## ---------------------------------------------------------------------
## C. INJECTION IN THE REFERENCE GENOME
## ---------------------------------------------------------------------

library(BSgenome.Hsapiens.UCSC.hg18)
BSgenome.Hsapiens.UCSC.hg18

## Inject the SNPs in hg18:
genome <- BSgenome.Hsapiens.UCSC.hg18
genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP.20090506")
genome2
alphabetFrequency(unmasked(genome2$chr22))
alphabetFrequency(unmasked(genome$chr22))
## Get the number of nucleotides that were modified by this injection:
neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))

## ---------------------------------------------------------------------
## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
## ---------------------------------------------------------------------

## Note that dbSNP can assign distinct ids to SNPs located at the same
## position:
any(duplicated(chr22snps$RefSNP_id))  # rs ids are all distinct...
any(duplicated(chr22snps$loc))  # but some locations are repeated!
chr22snps <- chr22snps[order(chr22snps$loc), ]  # sort by location
which(duplicated(chr22snps$loc))[1]  # 2
chr22snps[1:3, ]  # rs67522692 and rs71218724 have same locations
                  # and alleles

## Also note that not all SNP alleles are consistent with the hg18 genome
## i.e. the alleles reported for a given SNP are not always compatible
## with the nucleotide found at the SNP location in hg18.
## For example, to get the number of inconsistent SNPs in chr1:
chr1snps <- getSNPlocs("chr1")
all_alleles <- paste(chr1snps$alleles_as_ambig, collapse="")
nchar(all_alleles)  # 920233 SNPs on chr1
neditAt(all_alleles, unmasked(genome$chr1)[chr1snps$loc], fixed=FALSE)
## ==> 1322 SNPs (0.14%) are inconsistent with hg18 chr1!

## Finally, let's check that no SNP falls in an assembly gap:
agaps <- masks(genome$chr1)$AGAPS
agaps  # the assembly gaps
## Looping over the assembly gaps:
sapply(1:length(agaps),
       function(i)
           any(chr1snps$loc >= start(agaps)[i] &
               chr1snps$loc <= end(agaps)[i]))
## Or, in a more efficient way:
stopifnot(length(findOverlaps(chr1snps$loc, agaps)) == 0)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SNPlocs.Hsapiens.dbSNP.20090506)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
  Please note that the SNPlocs.Hsapiens.dbSNP.20090506 package contains
  outdated dbSNP data and will be deprecated in the near future. We
  highly recommend that you use a SNPlocs package based on a more recent
  dbSNP build for your analyses instead. See available.SNPs() for the
  list of SNPlocs packages currently available and make sure to pick up
  the most recent one.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SNPlocs.Hsapiens.dbSNP.20090506/getSNPlocs.Rd_%03d_medium.png", width=480, height=480)
> ### Name: getSNPlocs
> ### Title: Accessing the SNPs stored in SNPlocs.Hsapiens.dbSNP.20090506
> ### Aliases: getSNPcount .loadLoc .loadAlleles getSNPlocs rsid2loc
> ###   rsid2alleles rsidsToGRanges
> ### Keywords: data
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## A. BASIC USAGE
> ## ---------------------------------------------------------------------
> 
> getSNPcount()
  chr1   chr2   chr3   chr4   chr5   chr6   chr7   chr8   chr9  chr10  chr11 
920233 933616 789121 798603 706109 760249 655873 612367 496064 583240 577300 
 chr12  chr13  chr14  chr15  chr16  chr17  chr18  chr19  chr20  chr21  chr22 
558759 427010 365742 331501 354239 316396 322866 268235 323041 160580 187392 
  chrX   chrY 
391414   6539 
> 
> ## Get the locations and alleles of all SNPs on chromosome 22:
> chr22snps <- getSNPlocs("chr22")
> dim(chr22snps)
[1] 187392      3
> colnames(chr22snps)
[1] "RefSNP_id"        "alleles_as_ambig" "loc"             
> head(chr22snps)
  RefSNP_id alleles_as_ambig      loc
1  67522692                M 14892137
2  71218724                M 14892137
3   9617282                R 14908060
4   8187529                Y 14917006
5   7339830                K 14917012
6  12152118                R 14917136
> 
> ## Get the locations and alleles of all SNPs on chromosomes 22 and X
> ## as a GRanges object:
> getSNPlocs(c("chr22", "chrX"), as.GRanges=TRUE)
GRanges object with 578806 ranges and 2 metadata columns:
           seqnames                 ranges strand |   RefSNP_id
              <Rle>              <IRanges>  <Rle> | <character>
       [1]    chr22   [14892137, 14892137]      + |    67522692
       [2]    chr22   [14892137, 14892137]      + |    71218724
       [3]    chr22   [14908060, 14908060]      + |     9617282
       [4]    chr22   [14917006, 14917006]      + |     8187529
       [5]    chr22   [14917012, 14917012]      + |     7339830
       ...      ...                    ...    ... .         ...
  [578802]     chrX [154582606, 154582606]      + |      557132
  [578803]     chrX [154583424, 154583424]      + |      781880
  [578804]     chrX [154583643, 154583643]      + |    12384168
  [578805]     chrX [154584361, 154584361]      + |     1089622
  [578806]     chrX [154636327, 154636327]      + |    55826418
           alleles_as_ambig
                <character>
       [1]                M
       [2]                M
       [3]                R
       [4]                Y
       [5]                K
       ...              ...
  [578802]                Y
  [578803]                R
  [578804]                R
  [578805]                R
  [578806]                Y
  -------
  seqinfo: 24 sequences from NCBI36 genome
> 
> ## ---------------------------------------------------------------------
> ## B. EXTRACT SNP INFORMATION FOR A SET OF RS IDS...
> ## ---------------------------------------------------------------------
> 
> ## ... and return it in a GRanges object:
> myrsids <- c("rs2639606", "rs73396229", "rs55871206", "rs10932221",
+              "rs56219727", "rs73709730", "rs55838886", "rs1516535")
> rsidsToGRanges(myrsids)
GRanges object with 8 ranges and 2 metadata columns:
      seqnames                 ranges strand |   RefSNP_id alleles_as_ambig
         <Rle>              <IRanges>  <Rle> | <character>      <character>
  [1]     chr9 [ 70217947,  70217947]      + |     2639606                R
  [2]    chr11 [  5820137,   5820137]      + |    73396229                Y
  [3]    chr13 [ 93957883,  93957883]      + |    55871206                S
  [4]     chr2 [208609439, 208609439]      + |    10932221                Y
  [5]     chr4 [177885573, 177885573]      + |    56219727                W
  [6]     chr7 [ 80833186,  80833186]      + |    73709730                R
  [7]     chr1 [ 78850672,  78850672]      + |    55838886                Y
  [8]     chr4 [183492683, 183492683]      + |     1516535                S
  -------
  seqinfo: 24 sequences from NCBI36 genome
> 
> ## ---------------------------------------------------------------------
> ## C. INJECTION IN THE REFERENCE GENOME
> ## ---------------------------------------------------------------------
> 
> library(BSgenome.Hsapiens.UCSC.hg18)
> BSgenome.Hsapiens.UCSC.hg18
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg18
# release date: Mar. 2006
# release name: NCBI Build 36.1
# 49 sequences:
#   chr1          chr2          chr3          chr4          chr5         
#   chr6          chr7          chr8          chr9          chr10        
#   chr11         chr12         chr13         chr14         chr15        
#   chr16         chr17         chr18         chr19         chr20        
#   chr21         chr22         chrX          chrY          chrM         
#   chr5_h2_hap1  chr6_cox_hap1 chr6_qbl_hap2 chr22_h2_hap1 chr1_random  
#   chr2_random   chr3_random   chr4_random   chr5_random   chr6_random  
#   chr7_random   chr8_random   chr9_random   chr10_random  chr11_random 
#   chr13_random  chr15_random  chr16_random  chr17_random  chr18_random 
#   chr19_random  chr21_random  chr22_random  chrX_random                
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> 
> ## Inject the SNPs in hg18:
> genome <- BSgenome.Hsapiens.UCSC.hg18
> genome2 <- injectSNPs(genome, "SNPlocs.Hsapiens.dbSNP.20090506")
> genome2
Human genome:
# organism: Homo sapiens (Human)
# provider: UCSC
# provider version: hg18
# release date: Mar. 2006
# release name: NCBI Build 36.1
# with SNPs injected from package: SNPlocs.Hsapiens.dbSNP.20090506
# 49 sequences:
#   chr1          chr2          chr3          chr4          chr5         
#   chr6          chr7          chr8          chr9          chr10        
#   chr11         chr12         chr13         chr14         chr15        
#   chr16         chr17         chr18         chr19         chr20        
#   chr21         chr22         chrX          chrY          chrM         
#   chr5_h2_hap1  chr6_cox_hap1 chr6_qbl_hap2 chr22_h2_hap1 chr1_random  
#   chr2_random   chr3_random   chr4_random   chr5_random   chr6_random  
#   chr7_random   chr8_random   chr9_random   chr10_random  chr11_random 
#   chr13_random  chr15_random  chr16_random  chr17_random  chr18_random 
#   chr19_random  chr21_random  chr22_random  chrX_random                
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
# to access a given sequence)
> alphabetFrequency(unmasked(genome2$chr22))
       A        C        G        T        M        R        W        S 
 9048416  8309473  8299446  9006968    15428    63432    10972    18531 
       Y        K        V        H        D        B        N        - 
   62364    15497      233      147      131      224 14840170        0 
       +        . 
       0        0 
> alphabetFrequency(unmasked(genome$chr22))
       A        C        G        T        M        R        W        S 
 9085356  8366110  8356522  9043323        0        0        0        0 
       Y        K        V        H        D        B        N        - 
       0        0        0        0        0        0 14840121        0 
       +        . 
       0        0 
> ## Get the number of nucleotides that were modified by this injection:
> neditAt(unmasked(genome2$chr22), unmasked(genome$chr22))
[1] 187008
> 
> ## ---------------------------------------------------------------------
> ## D. SOME BASIC QUALITY CONTROL (WITH SURPRISING RESULTS!)
> ## ---------------------------------------------------------------------
> 
> ## Note that dbSNP can assign distinct ids to SNPs located at the same
> ## position:
> any(duplicated(chr22snps$RefSNP_id))  # rs ids are all distinct...
[1] FALSE
> any(duplicated(chr22snps$loc))  # but some locations are repeated!
[1] TRUE
> chr22snps <- chr22snps[order(chr22snps$loc), ]  # sort by location
> which(duplicated(chr22snps$loc))[1]  # 2
[1] 2
> chr22snps[1:3, ]  # rs67522692 and rs71218724 have same locations
  RefSNP_id alleles_as_ambig      loc
1  67522692                M 14892137
2  71218724                M 14892137
3   9617282                R 14908060
>                   # and alleles
> 
> ## Also note that not all SNP alleles are consistent with the hg18 genome
> ## i.e. the alleles reported for a given SNP are not always compatible
> ## with the nucleotide found at the SNP location in hg18.
> ## For example, to get the number of inconsistent SNPs in chr1:
> chr1snps <- getSNPlocs("chr1")
> all_alleles <- paste(chr1snps$alleles_as_ambig, collapse="")
> nchar(all_alleles)  # 920233 SNPs on chr1
[1] 920233
> neditAt(all_alleles, unmasked(genome$chr1)[chr1snps$loc], fixed=FALSE)
[1] 1322
> ## ==> 1322 SNPs (0.14%) are inconsistent with hg18 chr1!
> 
> ## Finally, let's check that no SNP falls in an assembly gap:
> agaps <- masks(genome$chr1)$AGAPS
> agaps  # the assembly gaps
NULL
> ## Looping over the assembly gaps:
> sapply(1:length(agaps),
+        function(i)
+            any(chr1snps$loc >= start(agaps)[i] &
+                chr1snps$loc <= end(agaps)[i]))
Error in hasTsp(x) : attempt to set an attribute on NULL
Calls: sapply ... lapply -> FUN -> start -> start -> start.default -> hasTsp
Execution halted