R: Aligned reads from RNAseq experiment: Transcription profiling...
RNAseqData.HNRNPC.bam.chr14-package
R Documentation
Aligned reads from RNAseq experiment: Transcription profiling by high
throughput sequencing of HNRNPC knockdown and control HeLa cells
Description
The package contains 8 BAM files, 1 per sequencing run. Each
BAM file was obtained by (1) aligning the reads (paired-end) to the
full hg19 genome with TopHat2, and then (2) subsetting to keep only
alignments on chr14. See accession number E-MTAB-1147 in the
ArrayExpress database for details about the experiment, including links
to the published study (by Zarnack et al., 2012) and to the
FASTQ files.
RNAseqData.HNRNPC.bam.chr14_RUNNAMES is a predefined character vector
containing the names of the 8 runs of the RNAseq experiment. The name of
each run is its accession number on the European Nucleotide Archive (ENA).
RNAseqData.HNRNPC.bam.chr14_BAMFILES is a predefined named character
vector of length 8 containing the paths to the 8 BAM files obtained by
aligning the reads from each sequencing run. The reads were aligned to the
full hg19 genome with TopHat2 and then the resulting BAM file was subset
to keep only alignments located on chr14.
The scripts/ folder in the package contains the scripts that
were used to generate the data contained in the package (note that
all the data contained in the package is located in its extdata/ folder).
The README.TXT file in the scripts/ folder describes how to use those
scripts to re-generate the data.
References
Direct Competition between hnRNP C and U2AF65 Protects the
Transcriptome from the Exonization of Alu Elements.
Kathi Zarnack, Julian Konig, Mojca Tajnik, Inigo Martincorena,
Sebastian Eustermann, Isabelle Stevant, Alejandro Reyes, Simon Anders,
Nicholas M. Luscombe, Jernej Ule.
Cell 2012, Europe PMC 23374342
TopHat2: accurate alignment of transcriptomes in the presence of
insertions, deletions and gene fusions.
Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley
and Steven L Salzberg.
Genome biology 2013, 14:R36
The readGAlignmentPairs function
in the GenomicAlignments package for more information about
how to read paired-end reads from a BAM file into Bioconductor.
The ScanBamParam function in the
Rsamtools package for how to get control over what to
import from a BAM file.
Examples
RNAseqData.HNRNPC.bam.chr14_RUNNAMES
RNAseqData.HNRNPC.bam.chr14_BAMFILES
if (require(GenomicAlignments)) {
bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
param <- ScanBamParam(tag="NM")
galp <- readGAlignmentPairs(bamfiles[1L], use.names=TRUE, param=param)
galp
first(galp) # 1st segment in each pair
last(galp) # 2nd segment in each pair
## The alignments contain insertions, deletions, and junctions (I, D,
## and N CIGAR operations, respectively):
colSums(cigarOpTable(cigar(first(galp))))
colSums(cigarOpTable(cigar(last(galp))))
## njunc() returns the nb of junctions for each alignment:
table(njunc(first(galp)))
table(njunc(last(galp)))
table(njunc(first(galp)) + njunc(last(galp))) # up to 5 junctions
# per pair!
## The NM tag gives the edit distance of each alignment to the
## reference (ignoring the junction gaps):
table(mcols(first(galp))$NM)
table(mcols(last(galp))$NM)
table(mcols(first(galp))$NM + mcols(last(galp))$NM)
}
## See README.TXT in scripts/ folder:
README_file <- system.file("scripts", "README.TXT",
package="RNAseqData.HNRNPC.bam.chr14")
cat(readLines(README_file), sep="\n")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(RNAseqData.HNRNPC.bam.chr14)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/RNAseqData.HNRNPC.bam.chr14/RNAseqData.HNRNPC.bam.chr14-package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: RNAseqData.HNRNPC.bam.chr14-package
> ### Title: Aligned reads from RNAseq experiment: Transcription profiling by
> ### high throughput sequencing of HNRNPC knockdown and control HeLa cells
> ### Aliases: RNAseqData.HNRNPC.bam.chr14-package
> ### RNAseqData.HNRNPC.bam.chr14 RNAseqData.HNRNPC.bam.chr14_RUNNAMES
> ### RNAseqData.HNRNPC.bam.chr14_BAMFILES
> ### Keywords: package
>
> ### ** Examples
>
> RNAseqData.HNRNPC.bam.chr14_RUNNAMES
[1] "ERR127306" "ERR127307" "ERR127308" "ERR127309" "ERR127302" "ERR127303"
[7] "ERR127304" "ERR127305"
> RNAseqData.HNRNPC.bam.chr14_BAMFILES
ERR127306
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam"
ERR127307
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127307_chr14.bam"
ERR127308
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127308_chr14.bam"
ERR127309
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127309_chr14.bam"
ERR127302
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127302_chr14.bam"
ERR127303
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127303_chr14.bam"
ERR127304
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127304_chr14.bam"
ERR127305
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127305_chr14.bam"
>
> if (require(GenomicAlignments)) {
+ bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
+ param <- ScanBamParam(tag="NM")
+ galp <- readGAlignmentPairs(bamfiles[1L], use.names=TRUE, param=param)
+ galp
+
+ first(galp) # 1st segment in each pair
+ last(galp) # 2nd segment in each pair
+
+ ## The alignments contain insertions, deletions, and junctions (I, D,
+ ## and N CIGAR operations, respectively):
+ colSums(cigarOpTable(cigar(first(galp))))
+ colSums(cigarOpTable(cigar(last(galp))))
+
+ ## njunc() returns the nb of junctions for each alignment:
+ table(njunc(first(galp)))
+ table(njunc(last(galp)))
+ table(njunc(first(galp)) + njunc(last(galp))) # up to 5 junctions
+ # per pair!
+
+ ## The NM tag gives the edit distance of each alignment to the
+ ## reference (ignoring the junction gaps):
+ table(mcols(first(galp))$NM)
+ table(mcols(last(galp))$NM)
+ table(mcols(first(galp))$NM + mcols(last(galp))$NM)
+ }
Loading required package: GenomicAlignments
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools
0 1 2 3 4 5 6 7 8 9 10
251192 77852 33649 14143 9129 5426 4155 3112 791 289 117
11 12 13 14
82 63 31 23
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, :
376 alignments with ambiguous pairing were dumped.
Use 'getDumpedAlignments()' to retrieve them from the dump environment.
>
> ## See README.TXT in scripts/ folder:
> README_file <- system.file("scripts", "README.TXT",
+ package="RNAseqData.HNRNPC.bam.chr14")
> cat(readLines(README_file), sep="\n")
This README.TXT file is located in the scripts/ folder of the
RNAseqData.HNRNPC.bam.chr14 package. This folder contains the scripts that
were used to generate the data contained in the package (note that all the
data contained in the package is located in its extdata/ folder).
This README.TXT file describes how to use those scripts to re-generate the
content of the extdata/ folder.
0. Requirements
---------------
- Good hardware. Running step "6. Align the reads" below will take a long
time (between 20h and 3 or 4 days, depending on your hardware). This step
was run on one of the very crowded rhino servers at the Fred Hutchinson
Cancer Research Center in May 2013 to generate the data contained in the
RNAseqData.HNRNPC.bam.chr14 package. Each rhino server is a 64-bit Ubuntu
system with 16 cores and 384Gb of RAM. Step "6. Align the reads" took about
3 days to complete. If no other users were competing for resources, it
would still probably take about the same amount of time on a Linux server
with only 4 cores and 16Gb of RAM.
- 50Gb of available disk space.
- Fast internet access (there is about 30Gb of data to download).
- Software: Bowtie2 + TopHat2, SAMtools, and R + Bioconductor (Rsamtools
package). A brief overview of how to install Bowtie2 + TopHat2 is given
below.
1. Set SCRIPTS_DIR
------------------
Set the SCRIPTS_DIR shell variable to the *absolute* path of the scripts/
folder of the RNAseqData.HNRNPC.bam.chr14 package. If the package is installed,
you can get that path with:
pkgname="RNAseqData.HNRNPC.bam.chr14"
rscript="cat(system.file('scripts', package='$pkgname'))"
SCRIPTS_DIR=`echo "$rscript" | R --slave`
Make sure SCRIPTS_DIR is set to something:
echo $SCRIPTS_DIR
Otherwise, download the package source tarball from the Bioconductor package
repository (http://bioconductor.org/packages/release/data/experiment/) and
extract it with 'tar zxf <path/to/tarball>'. This creates the package source
tree in your current directory. Right after extraction is complete (and
*before* you move to another directory with 'cd'), do:
SCRIPTS_DIR="`pwd`/RNAseqData.HNRNPC.bam.chr14/inst/scripts"
In either case, make sure SCRIPTS_DIR is set correctly by doing:
cat $SCRIPTS_DIR/README.TXT
It should display the content of this README.TXT file.
2. Create the "main working directory"
--------------------------------------
For example with:
mkdir ~/E-MTAB-1147
Also create the 4 following subdirs:
cd ~/E-MTAB-1147
mkdir fastq tmp_downloads tophat2_out bam_chr14
3. Download the 16 FASTQ files (2 per sequencing run)
-----------------------------------------------------
This will take a long time: there is about 30Gb of data to download!
cd ~/E-MTAB-1147/fastq
$SCRIPTS_DIR/download-fastq.sh
Alternatively, you can do this in R with:
library(RNAseqData.HNRNPC.bam.chr14)
setwd("~/E-MTAB-1147/fastq")
cmd <- system.file("scripts", "download-fastq.sh",
package="RNAseqData.HNRNPC.bam.chr14")
system(cmd)
4. Download and install Bowtie2 + TopHat2
-----------------------------------------
(a) Download the latest pre-compiled binary of Bowtie2 from
http://sourceforge.net/projects/bowtie-bio/files/bowtie2/
The version that was used in May 2013 for generating the data in this
package is the 2.1.0 version for 64-bit Linux (i.e. file
bowtie2-2.1.0-linux-x86_64.zip) downloaded from:
http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.1.0/
(b) Download the latest pre-compiled binary of TopHat2 from
http://tophat.cbcb.umd.edu/tutorial.shtml
The version that was used in May 2013 for generating the data in this
package is the 2.0.8b version for 64-bit Linux (i.e. file
tophat-2.0.8b.Linux_x86_64.tar.gz) downloaded from:
http://tophat.cbcb.umd.edu/downloads/
(c) Extract them in your home. Make sure the executables located in
~/bowtie2-2.1.0/ and ~/tophat-2.0.8b.Linux_x86_64/ are in your PATH.
Create the Bowtie indexes subdirectory with:
mkdir ~/bowtie2-2.1.0/indexes
5. Create Bowtie2 index for hg19
--------------------------------
This took about 2h on rhino02:
cd ~/E-MTAB-1147/tmp_downloads
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar zxvf chromFa.tar.gz
hg19_seqnames=`cat $SCRIPTS_DIR/hg19_seqnames.txt`
reference_in=""
for seqname in $hg19_seqnames; do
file=${seqname}.fa
if [ "$reference_in" == "" ]; then
reference_in="$file"
else
reference_in="$reference_in,$file"
fi
done
bowtie2-build $reference_in hg19
This produces 6 files: hg19.1.bt2, hg19.2.bt2, hg19.3.bt2, hg19.4.bt2,
hg19.rev.1.bt2, hg19.rev.2.bt2
Move them to the Bowtie indexes subdirectory with:
mv -i hg19.*.bt2 ~/bowtie2-2.1.0/indexes
Test that the index is properly installed:
export BOWTIE2_INDEXES=~/bowtie2-2.1.0/indexes
bowtie2 -c hg19 GTTTTAGTAGAGACAAGGTCTCACTGTGCTGCCCTGGTGGGTCTCAAATTCCTGA
6. Align the reads
------------------
There are 8 sequencing runs in this experiment so we're going to fire 1
instance of TopHat2 per run. Each instance of TopHat2 will write its output
to a folder under ~/E-MTAB-1147/tophat2_out
Have a look at the start-tophat2-on-all-runs.sh script and make any necessary
adjustements to it. Then run it with:
cd ~/E-MTAB-1147
$SCRIPTS_DIR/start-tophat2-on-all-runs.sh
The script will exit almost instantly after starting the tophat2 commands
in the background. You can log out and come back any time to the server to
check how things are going e.g. with:
cd ~/E-MTAB-1147
tail tophat2-*.log
Or with:
tail -f tophat2-ERR127306.log
to follow the output of 1 TopHat2 instance in real-time.
Depending on your hardware, it should take between 20h and 3 or 4 days for
the 8 TopHat2 instances to complete.
7. Extract the alignments located on chr14
------------------------------------------
We do this for the 8 BAM files produced in step 6:
cd ~/E-MTAB-1147
R --vanilla <$SCRIPTS_DIR/subset_bam_chr14.R
This will produce 16 files: 1 <run_name>_chr14.bam and 1
<run_name>_chr14.bam.bai file per sequencing run. Those are the files
located in the extdata/ folder of the RNAseqData.HNRNPC.bam.chr14 package.
>
>
>
>
>
> dev.off()
null device
1
>