Last data update: 2014.03.03

R: Fetching chromosomes info for some of the UCSC genomes
fetchExtendedChromInfoFromUCSCR Documentation

Fetching chromosomes info for some of the UCSC genomes

Description

Fetch the chromosomes info for some UCSC genomes. Only supports hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2 at the moment.

Usage

fetchExtendedChromInfoFromUCSC(genome,
        goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
        quiet=FALSE)

Arguments

genome

A single string specifying the UCSC genome e.g. "sacCer3".

goldenPath_url

A single string specifying the URL to the UCSC goldenPath location. This URL is used internally to build the full URL to the 'chromInfo' MySQL dump containing chromosomes information for genome. See Details section below.

quiet

TRUE or FALSE (the default). If TRUE then some warnings are suppressed. See below for the details.

Details

Chromosomes information (e.g. names and lengths) for any UCSC genome is stored in the UCSC database in the 'chromInfo' table, and is normally available as a MySQL dump at:

  goldenPath_url/<genome>/database/chromInfo.txt.gz

fetchExtendedChromInfoFromUCSC downloads and imports that table into a data frame, keeps only the UCSC_seqlevel and UCSC_seqlength columns (after renaming them), and adds the circular logical column.

Then, if this UCSC genome is based on an NCBI assembly (e.g. hg38 is based on GRCh38), the NCBI seqlevels and GenBank accession numbers are extracted from the NCBI assembly report and the UCSC seqlevels matched to them (using some guided heuristic). Finally the NCBI seqlevels and GenBank accession numbers are added to the returned data frame.

Value

A data frame with 1 row per seqlevel in the UCSC genome, and at least 3 columns:

  • UCSC_seqlevel: Character vector with no NAs. This is the chrom field of the UCSC 'chromInfo' table for the genome. See Details section above.

  • UCSC_seqlength: Integer vector with no NAs. This is the size field of the UCSC 'chromInfo' table for the genome. See Details section above.

  • circular: Logical vector with no NAs. This knowledge is stored in the GenomeInfoDb package itself for the supported genomes.

If the UCSC genome is *not* based on an NCBI assembly (e.g. gasAcu1, ce10, sacCer2), there are no additional columns and a warning is emitted (unless quiet is set to TRUE). In this case, the rows are sorted by UCSC seqlevel rank as determined by rankSeqlevels().

If the UCSC genome is based on an NCBI assembly (e.g. sacCer3), the returned data frame has 3 additional columns:

  • NCBI_seqlevel: Character vector. This information is obtained from the NCBI assembly report for the genome. Will contain NAs for UCSC seqlevels with no corresponding NCBI seqlevels (e.g. for chrM in hg18 or chrUextra in dm3), in which case fetchExtendedChromInfoFromUCSC emits a warning (unless quiet is set to TRUE).

  • SequenceRole: Factor with levels assembled-molecule, alt-scaffold, unlocalized-scaffold, unplaced-scaffold, and pseudo-scaffold. For UCSC seqlevels with corresponding NCBI seqlevels this information is obtained from the NCBI assembly report. Otherwise it is obtained from a base of knowledge included in the GenomeInfoDb package. Can contain NAs but no warning is emitted in that case.

  • GenBankAccn: Character vector. This information is obtained from the NCBI assembly report for the genome. Can contain NAs but no warning is emitted in that case.

In this case, the rows are sorted first by level in the SequenceRole column, that is, assembled-molecules first, then alt-scaffolds, etc, and NAs last. Then within each group they are sorted by UCSC seqlevel rank as determined by rankSeqlevels().

Note

fetchExtendedChromInfoFromUCSC queries the UCSC Genome Browser as well as the FTP site at NCBI and thus requires internet access.

Only supports the hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2 genomes at the moment. More will come...

Author(s)

H. Pages

See Also

  • The seqlevels getter and setter.

  • The rankSeqlevels function for ranking sequence names.

  • The seqlevelsStyle getter and setter.

  • The getBSgenome utility in the BSgenome package for searching the installed BSgenome data packages.

Examples

## All the examples below require internet access!

## ---------------------------------------------------------------------
## A. BASIC EXAMPLE
## ---------------------------------------------------------------------

## The sacCer3 UCSC genome is based on an NCBI assembly (RefSeq Assembly
## ID is GCF_000146045.2):
sacCer3_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3")
sacCer3_chrominfo

## But the sacCer2 UCSC genome is not:
sacCer2_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer2")
sacCer2_chrominfo

## ---------------------------------------------------------------------
## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON
##    THE GRCh38 GENOME
## ---------------------------------------------------------------------

## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
library(BSgenome)
genome <- getBSgenome("GRCh38")  # this loads the
                                 # BSgenome.Hsapiens.NCBI.GRCh38 package

## A quick look at the GRCh38 seqlevels:
length(seqlevels(genome))
head(seqlevels(genome), n=30)

## Fetch the extended chromosomes info for the hg38 genome:
hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
dim(hg38_chrominfo)
head(hg38_chrominfo, n=30)

## 2 sanity checks:
##   1. Check the NCBI seqlevels:
stopifnot(setequal(hg38_chrominfo$NCBI_seqlevel, seqlevels(genome)))
##   2. Check that the sequence lengths in 'hg38_chrominfo' (which are
##      coming from the same 'chromInfo' table as the UCSC seqlevels)
##      are the same as in 'genome':
stopifnot(
  identical(hg38_chrominfo$UCSC_seqlength,
            unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevel]))
)

## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as
## the names:
hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevel,
                           hg38_chrominfo$NCBI_seqlevel)

## Set the hg38 seqlevels on 'genome':
seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)]
head(seqlevels(genome), n=30)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(GenomeInfoDb)
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomeInfoDb/fetchExtendedChromInfoFromUCSC.Rd_%03d_medium.png", width=480, height=480)
> ### Name: fetchExtendedChromInfoFromUCSC
> ### Title: Fetching chromosomes info for some of the UCSC genomes
> ### Aliases: fetchExtendedChromInfoFromUCSC
> ### Keywords: manip
> 
> ### ** Examples
> 
> ## All the examples below require internet access!
> 
> ## ---------------------------------------------------------------------
> ## A. BASIC EXAMPLE
> ## ---------------------------------------------------------------------
> 
> ## The sacCer3 UCSC genome is based on an NCBI assembly (RefSeq Assembly
> ## ID is GCF_000146045.2):
> sacCer3_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3")
> sacCer3_chrominfo
   UCSC_seqlevel UCSC_seqlength circular NCBI_seqlevel       SequenceRole
1           chrI         230218    FALSE             I assembled-molecule
2          chrII         813184    FALSE            II assembled-molecule
3         chrIII         316620    FALSE           III assembled-molecule
4          chrIV        1531933    FALSE            IV assembled-molecule
5           chrV         576874    FALSE             V assembled-molecule
6          chrVI         270161    FALSE            VI assembled-molecule
7         chrVII        1090940    FALSE           VII assembled-molecule
8        chrVIII         562643    FALSE          VIII assembled-molecule
9          chrIX         439888    FALSE            IX assembled-molecule
10          chrX         745751    FALSE             X assembled-molecule
11         chrXI         666816    FALSE            XI assembled-molecule
12        chrXII        1078177    FALSE           XII assembled-molecule
13       chrXIII         924431    FALSE          XIII assembled-molecule
14        chrXIV         784333    FALSE           XIV assembled-molecule
15         chrXV        1091291    FALSE            XV assembled-molecule
16        chrXVI         948066    FALSE           XVI assembled-molecule
17          chrM          85779     TRUE            MT assembled-molecule
   GenBankAccn
1   BK006935.2
2   BK006936.2
3   BK006937.2
4   BK006938.2
5   BK006939.2
6   BK006940.2
7   BK006941.2
8   BK006934.2
9   BK006942.2
10  BK006943.2
11  BK006944.2
12  BK006945.2
13  BK006946.2
14  BK006947.3
15  BK006948.2
16  BK006949.2
17        <NA>
> 
> ## But the sacCer2 UCSC genome is not:
> sacCer2_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer2")
Warning message:
In FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs = supported_genome$circ_seqs,  :
  sacCer2 UCSC genome is not based on an NCBI assembly
> sacCer2_chrominfo
   UCSC_seqlevel UCSC_seqlength circular
1           chrI         230208    FALSE
2          chrII         813178    FALSE
3         chrIII         316617    FALSE
4          chrIV        1531919    FALSE
5           chrV         576869    FALSE
6          chrVI         270148    FALSE
7         chrVII        1090947    FALSE
8        chrVIII         562643    FALSE
9          chrIX         439885    FALSE
10          chrX         745742    FALSE
11         chrXI         666454    FALSE
12        chrXII        1078175    FALSE
13       chrXIII         924429    FALSE
14        chrXIV         784333    FALSE
15         chrXV        1091289    FALSE
16        chrXVI         948062    FALSE
17          chrM          85779     TRUE
18       2micron           6318     TRUE
> 
> ## ---------------------------------------------------------------------
> ## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON
> ##    THE GRCh38 GENOME
> ## ---------------------------------------------------------------------
> 
> ## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
> library(BSgenome)
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> genome <- getBSgenome("GRCh38")  # this loads the
>                                  # BSgenome.Hsapiens.NCBI.GRCh38 package
> 
> ## A quick look at the GRCh38 seqlevels:
> length(seqlevels(genome))
[1] 455
> head(seqlevels(genome), n=30)
 [1] "1"                       "2"                      
 [3] "3"                       "4"                      
 [5] "5"                       "6"                      
 [7] "7"                       "8"                      
 [9] "9"                       "10"                     
[11] "11"                      "12"                     
[13] "13"                      "14"                     
[15] "15"                      "16"                     
[17] "17"                      "18"                     
[19] "19"                      "20"                     
[21] "21"                      "22"                     
[23] "X"                       "Y"                      
[25] "MT"                      "HSCHR1_CTG1_UNLOCALIZED"
[27] "HSCHR1_CTG2_UNLOCALIZED" "HSCHR1_CTG3_UNLOCALIZED"
[29] "HSCHR1_CTG4_UNLOCALIZED" "HSCHR1_CTG5_UNLOCALIZED"
> 
> ## Fetch the extended chromosomes info for the hg38 genome:
> hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
> dim(hg38_chrominfo)
[1] 455   6
> head(hg38_chrominfo, n=30)
         UCSC_seqlevel UCSC_seqlength circular    NCBI_seqlevel
1                 chr1      248956422    FALSE                1
2                 chr2      242193529    FALSE                2
3                 chr3      198295559    FALSE                3
4                 chr4      190214555    FALSE                4
5                 chr5      181538259    FALSE                5
6                 chr6      170805979    FALSE                6
7                 chr7      159345973    FALSE                7
8                 chr8      145138636    FALSE                8
9                 chr9      138394717    FALSE                9
10               chr10      133797422    FALSE               10
11               chr11      135086622    FALSE               11
12               chr12      133275309    FALSE               12
13               chr13      114364328    FALSE               13
14               chr14      107043718    FALSE               14
15               chr15      101991189    FALSE               15
16               chr16       90338345    FALSE               16
17               chr17       83257441    FALSE               17
18               chr18       80373285    FALSE               18
19               chr19       58617616    FALSE               19
20               chr20       64444167    FALSE               20
21               chr21       46709983    FALSE               21
22               chr22       50818468    FALSE               22
23                chrX      156040895    FALSE                X
24                chrY       57227415    FALSE                Y
25                chrM          16569     TRUE               MT
26 chr1_GL383518v1_alt         182439    FALSE   HSCHR1_1_CTG31
27 chr1_GL383519v1_alt         110268    FALSE   HSCHR1_2_CTG31
28 chr1_GL383520v2_alt         366580    FALSE   HSCHR1_3_CTG31
29 chr1_KI270759v1_alt         425601    FALSE HSCHR1_1_CTG32_1
30 chr1_KI270760v1_alt         109528    FALSE   HSCHR1_1_CTG11
         SequenceRole GenBankAccn
1  assembled-molecule  CM000663.2
2  assembled-molecule  CM000664.2
3  assembled-molecule  CM000665.2
4  assembled-molecule  CM000666.2
5  assembled-molecule  CM000667.2
6  assembled-molecule  CM000668.2
7  assembled-molecule  CM000669.2
8  assembled-molecule  CM000670.2
9  assembled-molecule  CM000671.2
10 assembled-molecule  CM000672.2
11 assembled-molecule  CM000673.2
12 assembled-molecule  CM000674.2
13 assembled-molecule  CM000675.2
14 assembled-molecule  CM000676.2
15 assembled-molecule  CM000677.2
16 assembled-molecule  CM000678.2
17 assembled-molecule  CM000679.2
18 assembled-molecule  CM000680.2
19 assembled-molecule  CM000681.2
20 assembled-molecule  CM000682.2
21 assembled-molecule  CM000683.2
22 assembled-molecule  CM000684.2
23 assembled-molecule  CM000685.2
24 assembled-molecule  CM000686.2
25 assembled-molecule    J01415.2
26       alt-scaffold  GL383518.1
27       alt-scaffold  GL383519.1
28       alt-scaffold  GL383520.2
29       alt-scaffold  KI270759.1
30       alt-scaffold  KI270760.1
> 
> ## 2 sanity checks:
> ##   1. Check the NCBI seqlevels:
> stopifnot(setequal(hg38_chrominfo$NCBI_seqlevel, seqlevels(genome)))
> ##   2. Check that the sequence lengths in 'hg38_chrominfo' (which are
> ##      coming from the same 'chromInfo' table as the UCSC seqlevels)
> ##      are the same as in 'genome':
> stopifnot(
+   identical(hg38_chrominfo$UCSC_seqlength,
+             unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevel]))
+ )
> 
> ## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as
> ## the names:
> hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevel,
+                            hg38_chrominfo$NCBI_seqlevel)
> 
> ## Set the hg38 seqlevels on 'genome':
> seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)]
> head(seqlevels(genome), n=30)
 [1] "chr1"                   "chr2"                   "chr3"                  
 [4] "chr4"                   "chr5"                   "chr6"                  
 [7] "chr7"                   "chr8"                   "chr9"                  
[10] "chr10"                  "chr11"                  "chr12"                 
[13] "chr13"                  "chr14"                  "chr15"                 
[16] "chr16"                  "chr17"                  "chr18"                 
[19] "chr19"                  "chr20"                  "chr21"                 
[22] "chr22"                  "chrX"                   "chrY"                  
[25] "chrM"                   "chr1_KI270706v1_random" "chr1_KI270707v1_random"
[28] "chr1_KI270708v1_random" "chr1_KI270709v1_random" "chr1_KI270710v1_random"
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>