A single string specifying the UCSC genome e.g. "sacCer3".
goldenPath_url
A single string specifying the URL to the UCSC goldenPath location.
This URL is used internally to build the full URL to the 'chromInfo'
MySQL dump containing chromosomes information for genome.
See Details section below.
quiet
TRUE or FALSE (the default). If TRUE then some
warnings are suppressed. See below for the details.
Details
Chromosomes information (e.g. names and lengths) for any UCSC genome
is stored in the UCSC database in the 'chromInfo' table, and is normally
available as a MySQL dump at:
goldenPath_url/<genome>/database/chromInfo.txt.gz
fetchExtendedChromInfoFromUCSC downloads and imports that table
into a data frame, keeps only the UCSC_seqlevel and
UCSC_seqlength columns (after renaming them), and adds the
circular logical column.
Then, if this UCSC genome is based on an NCBI assembly (e.g. hg38 is based
on GRCh38), the NCBI seqlevels and GenBank accession numbers are extracted
from the NCBI assembly report and the UCSC seqlevels matched to them (using
some guided heuristic). Finally the NCBI seqlevels and GenBank accession
numbers are added to the returned data frame.
Value
A data frame with 1 row per seqlevel in the UCSC genome, and at least 3
columns:
UCSC_seqlevel: Character vector with no NAs. This is the
chrom field of the UCSC 'chromInfo' table for the
genome. See Details section above.
UCSC_seqlength: Integer vector with no NAs. This is the
size field of the UCSC 'chromInfo' table for the
genome. See Details section above.
circular: Logical vector with no NAs. This knowledge is
stored in the GenomeInfoDb package itself for the supported
genomes.
If the UCSC genome is *not* based on an NCBI assembly (e.g. gasAcu1, ce10,
sacCer2), there are no additional columns and a warning is emitted (unless
quiet is set to TRUE). In this case, the rows are sorted
by UCSC seqlevel rank as determined by rankSeqlevels().
If the UCSC genome is based on an NCBI assembly (e.g. sacCer3),
the returned data frame has 3 additional columns:
NCBI_seqlevel: Character vector. This information is
obtained from the NCBI assembly report for the genome. Will contain
NAs for UCSC seqlevels with no corresponding NCBI seqlevels (e.g.
for chrM in hg18 or chrUextra in dm3), in which case
fetchExtendedChromInfoFromUCSC emits a warning (unless
quiet is set to TRUE).
SequenceRole: Factor with levels assembled-molecule,
alt-scaffold, unlocalized-scaffold,
unplaced-scaffold, and pseudo-scaffold. For
UCSC seqlevels with corresponding NCBI seqlevels this information
is obtained from the NCBI assembly report. Otherwise it is
obtained from a base of knowledge included in the GenomeInfoDb
package. Can contain NAs but no warning is emitted in that case.
GenBankAccn: Character vector. This information is obtained
from the NCBI assembly report for the genome. Can contain NAs but no
warning is emitted in that case.
In this case, the rows are sorted first by level in the SequenceRole
column, that is, assembled-molecules first, then alt-scaffolds,
etc, and NAs last. Then within each group they are sorted by UCSC seqlevel
rank as determined by rankSeqlevels().
Note
fetchExtendedChromInfoFromUCSC queries the UCSC Genome Browser as
well as the FTP site at NCBI and thus requires internet access.
Only supports the hg38, hg19, hg18, panTro4, panTro3, panTro2, bosTau8,
bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10, mm9, mm8,
susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3, gasAcu1, danRer7,
apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, and sacCer2 genomes at
the moment. More will come...
Author(s)
H. Pages
See Also
The seqlevels getter and setter.
The rankSeqlevels function for ranking sequence names.
The seqlevelsStyle getter and setter.
The getBSgenome utility in the
BSgenome package for searching the installed BSgenome
data packages.
Examples
## All the examples below require internet access!
## ---------------------------------------------------------------------
## A. BASIC EXAMPLE
## ---------------------------------------------------------------------
## The sacCer3 UCSC genome is based on an NCBI assembly (RefSeq Assembly
## ID is GCF_000146045.2):
sacCer3_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3")
sacCer3_chrominfo
## But the sacCer2 UCSC genome is not:
sacCer2_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer2")
sacCer2_chrominfo
## ---------------------------------------------------------------------
## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON
## THE GRCh38 GENOME
## ---------------------------------------------------------------------
## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
library(BSgenome)
genome <- getBSgenome("GRCh38") # this loads the
# BSgenome.Hsapiens.NCBI.GRCh38 package
## A quick look at the GRCh38 seqlevels:
length(seqlevels(genome))
head(seqlevels(genome), n=30)
## Fetch the extended chromosomes info for the hg38 genome:
hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
dim(hg38_chrominfo)
head(hg38_chrominfo, n=30)
## 2 sanity checks:
## 1. Check the NCBI seqlevels:
stopifnot(setequal(hg38_chrominfo$NCBI_seqlevel, seqlevels(genome)))
## 2. Check that the sequence lengths in 'hg38_chrominfo' (which are
## coming from the same 'chromInfo' table as the UCSC seqlevels)
## are the same as in 'genome':
stopifnot(
identical(hg38_chrominfo$UCSC_seqlength,
unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevel]))
)
## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as
## the names:
hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevel,
hg38_chrominfo$NCBI_seqlevel)
## Set the hg38 seqlevels on 'genome':
seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)]
head(seqlevels(genome), n=30)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GenomeInfoDb)
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/GenomeInfoDb/fetchExtendedChromInfoFromUCSC.Rd_%03d_medium.png", width=480, height=480)
> ### Name: fetchExtendedChromInfoFromUCSC
> ### Title: Fetching chromosomes info for some of the UCSC genomes
> ### Aliases: fetchExtendedChromInfoFromUCSC
> ### Keywords: manip
>
> ### ** Examples
>
> ## All the examples below require internet access!
>
> ## ---------------------------------------------------------------------
> ## A. BASIC EXAMPLE
> ## ---------------------------------------------------------------------
>
> ## The sacCer3 UCSC genome is based on an NCBI assembly (RefSeq Assembly
> ## ID is GCF_000146045.2):
> sacCer3_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer3")
> sacCer3_chrominfo
UCSC_seqlevel UCSC_seqlength circular NCBI_seqlevel SequenceRole
1 chrI 230218 FALSE I assembled-molecule
2 chrII 813184 FALSE II assembled-molecule
3 chrIII 316620 FALSE III assembled-molecule
4 chrIV 1531933 FALSE IV assembled-molecule
5 chrV 576874 FALSE V assembled-molecule
6 chrVI 270161 FALSE VI assembled-molecule
7 chrVII 1090940 FALSE VII assembled-molecule
8 chrVIII 562643 FALSE VIII assembled-molecule
9 chrIX 439888 FALSE IX assembled-molecule
10 chrX 745751 FALSE X assembled-molecule
11 chrXI 666816 FALSE XI assembled-molecule
12 chrXII 1078177 FALSE XII assembled-molecule
13 chrXIII 924431 FALSE XIII assembled-molecule
14 chrXIV 784333 FALSE XIV assembled-molecule
15 chrXV 1091291 FALSE XV assembled-molecule
16 chrXVI 948066 FALSE XVI assembled-molecule
17 chrM 85779 TRUE MT assembled-molecule
GenBankAccn
1 BK006935.2
2 BK006936.2
3 BK006937.2
4 BK006938.2
5 BK006939.2
6 BK006940.2
7 BK006941.2
8 BK006934.2
9 BK006942.2
10 BK006943.2
11 BK006944.2
12 BK006945.2
13 BK006946.2
14 BK006947.3
15 BK006948.2
16 BK006949.2
17 <NA>
>
> ## But the sacCer2 UCSC genome is not:
> sacCer2_chrominfo <- fetchExtendedChromInfoFromUCSC("sacCer2")
Warning message:
In FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs = supported_genome$circ_seqs, :
sacCer2 UCSC genome is not based on an NCBI assembly
> sacCer2_chrominfo
UCSC_seqlevel UCSC_seqlength circular
1 chrI 230208 FALSE
2 chrII 813178 FALSE
3 chrIII 316617 FALSE
4 chrIV 1531919 FALSE
5 chrV 576869 FALSE
6 chrVI 270148 FALSE
7 chrVII 1090947 FALSE
8 chrVIII 562643 FALSE
9 chrIX 439885 FALSE
10 chrX 745742 FALSE
11 chrXI 666454 FALSE
12 chrXII 1078175 FALSE
13 chrXIII 924429 FALSE
14 chrXIV 784333 FALSE
15 chrXV 1091289 FALSE
16 chrXVI 948062 FALSE
17 chrM 85779 TRUE
18 2micron 6318 TRUE
>
> ## ---------------------------------------------------------------------
> ## B. USING fetchExtendedChromInfoFromUCSC() TO PUT UCSC SEQLEVELS ON
> ## THE GRCh38 GENOME
> ## ---------------------------------------------------------------------
>
> ## Load the BSgenome.Hsapiens.NCBI.GRCh38 package:
> library(BSgenome)
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer
> genome <- getBSgenome("GRCh38") # this loads the
> # BSgenome.Hsapiens.NCBI.GRCh38 package
>
> ## A quick look at the GRCh38 seqlevels:
> length(seqlevels(genome))
[1] 455
> head(seqlevels(genome), n=30)
[1] "1" "2"
[3] "3" "4"
[5] "5" "6"
[7] "7" "8"
[9] "9" "10"
[11] "11" "12"
[13] "13" "14"
[15] "15" "16"
[17] "17" "18"
[19] "19" "20"
[21] "21" "22"
[23] "X" "Y"
[25] "MT" "HSCHR1_CTG1_UNLOCALIZED"
[27] "HSCHR1_CTG2_UNLOCALIZED" "HSCHR1_CTG3_UNLOCALIZED"
[29] "HSCHR1_CTG4_UNLOCALIZED" "HSCHR1_CTG5_UNLOCALIZED"
>
> ## Fetch the extended chromosomes info for the hg38 genome:
> hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
> dim(hg38_chrominfo)
[1] 455 6
> head(hg38_chrominfo, n=30)
UCSC_seqlevel UCSC_seqlength circular NCBI_seqlevel
1 chr1 248956422 FALSE 1
2 chr2 242193529 FALSE 2
3 chr3 198295559 FALSE 3
4 chr4 190214555 FALSE 4
5 chr5 181538259 FALSE 5
6 chr6 170805979 FALSE 6
7 chr7 159345973 FALSE 7
8 chr8 145138636 FALSE 8
9 chr9 138394717 FALSE 9
10 chr10 133797422 FALSE 10
11 chr11 135086622 FALSE 11
12 chr12 133275309 FALSE 12
13 chr13 114364328 FALSE 13
14 chr14 107043718 FALSE 14
15 chr15 101991189 FALSE 15
16 chr16 90338345 FALSE 16
17 chr17 83257441 FALSE 17
18 chr18 80373285 FALSE 18
19 chr19 58617616 FALSE 19
20 chr20 64444167 FALSE 20
21 chr21 46709983 FALSE 21
22 chr22 50818468 FALSE 22
23 chrX 156040895 FALSE X
24 chrY 57227415 FALSE Y
25 chrM 16569 TRUE MT
26 chr1_GL383518v1_alt 182439 FALSE HSCHR1_1_CTG31
27 chr1_GL383519v1_alt 110268 FALSE HSCHR1_2_CTG31
28 chr1_GL383520v2_alt 366580 FALSE HSCHR1_3_CTG31
29 chr1_KI270759v1_alt 425601 FALSE HSCHR1_1_CTG32_1
30 chr1_KI270760v1_alt 109528 FALSE HSCHR1_1_CTG11
SequenceRole GenBankAccn
1 assembled-molecule CM000663.2
2 assembled-molecule CM000664.2
3 assembled-molecule CM000665.2
4 assembled-molecule CM000666.2
5 assembled-molecule CM000667.2
6 assembled-molecule CM000668.2
7 assembled-molecule CM000669.2
8 assembled-molecule CM000670.2
9 assembled-molecule CM000671.2
10 assembled-molecule CM000672.2
11 assembled-molecule CM000673.2
12 assembled-molecule CM000674.2
13 assembled-molecule CM000675.2
14 assembled-molecule CM000676.2
15 assembled-molecule CM000677.2
16 assembled-molecule CM000678.2
17 assembled-molecule CM000679.2
18 assembled-molecule CM000680.2
19 assembled-molecule CM000681.2
20 assembled-molecule CM000682.2
21 assembled-molecule CM000683.2
22 assembled-molecule CM000684.2
23 assembled-molecule CM000685.2
24 assembled-molecule CM000686.2
25 assembled-molecule J01415.2
26 alt-scaffold GL383518.1
27 alt-scaffold GL383519.1
28 alt-scaffold GL383520.2
29 alt-scaffold KI270759.1
30 alt-scaffold KI270760.1
>
> ## 2 sanity checks:
> ## 1. Check the NCBI seqlevels:
> stopifnot(setequal(hg38_chrominfo$NCBI_seqlevel, seqlevels(genome)))
> ## 2. Check that the sequence lengths in 'hg38_chrominfo' (which are
> ## coming from the same 'chromInfo' table as the UCSC seqlevels)
> ## are the same as in 'genome':
> stopifnot(
+ identical(hg38_chrominfo$UCSC_seqlength,
+ unname(seqlengths(genome)[hg38_chrominfo$NCBI_seqlevel]))
+ )
>
> ## Extract the hg38 seqlevels and put the GRCh38 seqlevels on it as
> ## the names:
> hg38_seqlevels <- setNames(hg38_chrominfo$UCSC_seqlevel,
+ hg38_chrominfo$NCBI_seqlevel)
>
> ## Set the hg38 seqlevels on 'genome':
> seqlevels(genome) <- hg38_seqlevels[seqlevels(genome)]
> head(seqlevels(genome), n=30)
[1] "chr1" "chr2" "chr3"
[4] "chr4" "chr5" "chr6"
[7] "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12"
[13] "chr13" "chr14" "chr15"
[16] "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21"
[22] "chr22" "chrX" "chrY"
[25] "chrM" "chr1_KI270706v1_random" "chr1_KI270707v1_random"
[28] "chr1_KI270708v1_random" "chr1_KI270709v1_random" "chr1_KI270710v1_random"
>
>
>
>
>
> dev.off()
null device
1
>