Last data update: 2014.03.03

R: Aligned reads from RNAseq experiment: Transcription profiling...
RNAseqData.HNRNPC.bam.chr14-packageR Documentation

Aligned reads from RNAseq experiment: Transcription profiling by high throughput sequencing of HNRNPC knockdown and control HeLa cells

Description

The package contains 8 BAM files, 1 per sequencing run. Each BAM file was obtained by (1) aligning the reads (paired-end) to the full hg19 genome with TopHat2, and then (2) subsetting to keep only alignments on chr14. See accession number E-MTAB-1147 in the ArrayExpress database for details about the experiment, including links to the published study (by Zarnack et al., 2012) and to the FASTQ files.

RNAseqData.HNRNPC.bam.chr14_RUNNAMES is a predefined character vector containing the names of the 8 runs of the RNAseq experiment. The name of each run is its accession number on the European Nucleotide Archive (ENA).

RNAseqData.HNRNPC.bam.chr14_BAMFILES is a predefined named character vector of length 8 containing the paths to the 8 BAM files obtained by aligning the reads from each sequencing run. The reads were aligned to the full hg19 genome with TopHat2 and then the resulting BAM file was subset to keep only alignments located on chr14.

Usage

RNAseqData.HNRNPC.bam.chr14_RUNNAMES
RNAseqData.HNRNPC.bam.chr14_BAMFILES

Details

The scripts/ folder in the package contains the scripts that were used to generate the data contained in the package (note that all the data contained in the package is located in its extdata/ folder). The README.TXT file in the scripts/ folder describes how to use those scripts to re-generate the data.

References

Direct Competition between hnRNP C and U2AF65 Protects the Transcriptome from the Exonization of Alu Elements. Kathi Zarnack, Julian Konig, Mojca Tajnik, Inigo Martincorena, Sebastian Eustermann, Isabelle Stevant, Alejandro Reyes, Simon Anders, Nicholas M. Luscombe, Jernej Ule. Cell 2012, Europe PMC 23374342

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley and Steven L Salzberg. Genome biology 2013, 14:R36

See Also

Examples

RNAseqData.HNRNPC.bam.chr14_RUNNAMES
RNAseqData.HNRNPC.bam.chr14_BAMFILES

if (require(GenomicAlignments)) {
    bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
    param <- ScanBamParam(tag="NM")
    galp <- readGAlignmentPairs(bamfiles[1L], use.names=TRUE, param=param)
    galp

    first(galp)  # 1st segment in each pair
    last(galp)  # 2nd segment in each pair

    ## The alignments contain insertions, deletions, and junctions (I, D,
    ## and N CIGAR operations, respectively):
    colSums(cigarOpTable(cigar(first(galp))))
    colSums(cigarOpTable(cigar(last(galp))))

    ## njunc() returns the nb of junctions for each alignment:
    table(njunc(first(galp)))
    table(njunc(last(galp)))
    table(njunc(first(galp)) + njunc(last(galp)))  # up to 5 junctions
                                                   # per pair!

    ## The NM tag gives the edit distance of each alignment to the
    ## reference (ignoring the junction gaps):
    table(mcols(first(galp))$NM)
    table(mcols(last(galp))$NM)
    table(mcols(first(galp))$NM + mcols(last(galp))$NM)
}

## See README.TXT in scripts/ folder:
README_file <- system.file("scripts", "README.TXT",
                           package="RNAseqData.HNRNPC.bam.chr14")
cat(readLines(README_file), sep="\n")

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(RNAseqData.HNRNPC.bam.chr14)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/RNAseqData.HNRNPC.bam.chr14/RNAseqData.HNRNPC.bam.chr14-package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: RNAseqData.HNRNPC.bam.chr14-package
> ### Title: Aligned reads from RNAseq experiment: Transcription profiling by
> ###   high throughput sequencing of HNRNPC knockdown and control HeLa cells
> ### Aliases: RNAseqData.HNRNPC.bam.chr14-package
> ###   RNAseqData.HNRNPC.bam.chr14 RNAseqData.HNRNPC.bam.chr14_RUNNAMES
> ###   RNAseqData.HNRNPC.bam.chr14_BAMFILES
> ### Keywords: package
> 
> ### ** Examples
> 
> RNAseqData.HNRNPC.bam.chr14_RUNNAMES
[1] "ERR127306" "ERR127307" "ERR127308" "ERR127309" "ERR127302" "ERR127303"
[7] "ERR127304" "ERR127305"
> RNAseqData.HNRNPC.bam.chr14_BAMFILES
                                                                                 ERR127306 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam" 
                                                                                 ERR127307 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127307_chr14.bam" 
                                                                                 ERR127308 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127308_chr14.bam" 
                                                                                 ERR127309 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127309_chr14.bam" 
                                                                                 ERR127302 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127302_chr14.bam" 
                                                                                 ERR127303 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127303_chr14.bam" 
                                                                                 ERR127304 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127304_chr14.bam" 
                                                                                 ERR127305 
"/home/ddbj/local/lib64/R/library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127305_chr14.bam" 
> 
> if (require(GenomicAlignments)) {
+     bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
+     param <- ScanBamParam(tag="NM")
+     galp <- readGAlignmentPairs(bamfiles[1L], use.names=TRUE, param=param)
+     galp
+ 
+     first(galp)  # 1st segment in each pair
+     last(galp)  # 2nd segment in each pair
+ 
+     ## The alignments contain insertions, deletions, and junctions (I, D,
+     ## and N CIGAR operations, respectively):
+     colSums(cigarOpTable(cigar(first(galp))))
+     colSums(cigarOpTable(cigar(last(galp))))
+ 
+     ## njunc() returns the nb of junctions for each alignment:
+     table(njunc(first(galp)))
+     table(njunc(last(galp)))
+     table(njunc(first(galp)) + njunc(last(galp)))  # up to 5 junctions
+                                                    # per pair!
+ 
+     ## The NM tag gives the edit distance of each alignment to the
+     ## reference (ignoring the junction gaps):
+     table(mcols(first(galp))$NM)
+     table(mcols(last(galp))$NM)
+     table(mcols(first(galp))$NM + mcols(last(galp))$NM)
+ }
Loading required package: GenomicAlignments
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools

     0      1      2      3      4      5      6      7      8      9     10 
251192  77852  33649  14143   9129   5426   4155   3112    791    289    117 
    11     12     13     14 
    82     63     31     23 
Warning message:
In .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode,  :
    376 alignments with ambiguous pairing were dumped.
    Use 'getDumpedAlignments()' to retrieve them from the dump environment.
> 
> ## See README.TXT in scripts/ folder:
> README_file <- system.file("scripts", "README.TXT",
+                            package="RNAseqData.HNRNPC.bam.chr14")
> cat(readLines(README_file), sep="\n")
This README.TXT file is located in the scripts/ folder of the
RNAseqData.HNRNPC.bam.chr14 package. This folder contains the scripts that
were used to generate the data contained in the package (note that all the
data contained in the package is located in its extdata/ folder).

This README.TXT file describes how to use those scripts to re-generate the
content of the extdata/ folder.

0. Requirements
---------------

- Good hardware. Running step "6. Align the reads" below will take a long
  time (between 20h and 3 or 4 days, depending on your hardware). This step
  was run on one of the very crowded rhino servers at the Fred Hutchinson
  Cancer Research Center in May 2013 to generate the data contained in the
  RNAseqData.HNRNPC.bam.chr14 package. Each rhino server is a 64-bit Ubuntu
  system with 16 cores and 384Gb of RAM. Step "6. Align the reads" took about
  3 days to complete. If no other users were competing for resources, it
  would still probably take about the same amount of time on a Linux server
  with only 4 cores and 16Gb of RAM.

- 50Gb of available disk space.

- Fast internet access (there is about 30Gb of data to download). 

- Software: Bowtie2 + TopHat2, SAMtools, and R + Bioconductor (Rsamtools
  package). A brief overview of how to install Bowtie2 + TopHat2 is given
  below.

1. Set SCRIPTS_DIR
------------------

Set the SCRIPTS_DIR shell variable to the *absolute* path of the scripts/
folder of the RNAseqData.HNRNPC.bam.chr14 package. If the package is installed,
you can get that path with:

  pkgname="RNAseqData.HNRNPC.bam.chr14"
  rscript="cat(system.file('scripts', package='$pkgname'))"
  SCRIPTS_DIR=`echo "$rscript" | R --slave`

Make sure SCRIPTS_DIR is set to something:

  echo $SCRIPTS_DIR

Otherwise, download the package source tarball from the Bioconductor package
repository (http://bioconductor.org/packages/release/data/experiment/) and
extract it with 'tar zxf <path/to/tarball>'. This creates the package source
tree in your current directory. Right after extraction is complete (and
*before* you move to another directory with 'cd'), do:

  SCRIPTS_DIR="`pwd`/RNAseqData.HNRNPC.bam.chr14/inst/scripts"

In either case, make sure SCRIPTS_DIR is set correctly by doing:

  cat $SCRIPTS_DIR/README.TXT

It should display the content of this README.TXT file.

2. Create the "main working directory"
--------------------------------------

For example with:

  mkdir ~/E-MTAB-1147

Also create the 4 following subdirs:

  cd ~/E-MTAB-1147
  mkdir fastq tmp_downloads tophat2_out bam_chr14

3. Download the 16 FASTQ files (2 per sequencing run)
-----------------------------------------------------

This will take a long time: there is about 30Gb of data to download!

  cd ~/E-MTAB-1147/fastq
  $SCRIPTS_DIR/download-fastq.sh

Alternatively, you can do this in R with:

  library(RNAseqData.HNRNPC.bam.chr14)
  setwd("~/E-MTAB-1147/fastq")
  cmd <- system.file("scripts", "download-fastq.sh",
                     package="RNAseqData.HNRNPC.bam.chr14")
  system(cmd)

4. Download and install Bowtie2 + TopHat2
-----------------------------------------

(a) Download the latest pre-compiled binary of Bowtie2 from

      http://sourceforge.net/projects/bowtie-bio/files/bowtie2/

    The version that was used in May 2013 for generating the data in this
    package is the 2.1.0 version for 64-bit Linux (i.e. file
    bowtie2-2.1.0-linux-x86_64.zip) downloaded from:

      http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.1.0/

(b) Download the latest pre-compiled binary of TopHat2 from

      http://tophat.cbcb.umd.edu/tutorial.shtml

    The version that was used in May 2013 for generating the data in this
    package is the 2.0.8b version for 64-bit Linux (i.e. file
    tophat-2.0.8b.Linux_x86_64.tar.gz) downloaded from:

      http://tophat.cbcb.umd.edu/downloads/
 
(c) Extract them in your home. Make sure the executables located in
    ~/bowtie2-2.1.0/ and ~/tophat-2.0.8b.Linux_x86_64/ are in your PATH.
    Create the Bowtie indexes subdirectory with:

      mkdir ~/bowtie2-2.1.0/indexes

5. Create Bowtie2 index for hg19
--------------------------------

This took about 2h on rhino02:

  cd ~/E-MTAB-1147/tmp_downloads
  wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
  tar zxvf chromFa.tar.gz
  hg19_seqnames=`cat $SCRIPTS_DIR/hg19_seqnames.txt`
  reference_in=""
  for seqname in $hg19_seqnames; do
      file=${seqname}.fa
      if [ "$reference_in" == "" ]; then
          reference_in="$file"
      else
          reference_in="$reference_in,$file"
      fi
  done
  bowtie2-build $reference_in hg19

This produces 6 files: hg19.1.bt2, hg19.2.bt2, hg19.3.bt2, hg19.4.bt2,
                       hg19.rev.1.bt2, hg19.rev.2.bt2

Move them to the Bowtie indexes subdirectory with:

  mv -i hg19.*.bt2 ~/bowtie2-2.1.0/indexes

Test that the index is properly installed:

  export BOWTIE2_INDEXES=~/bowtie2-2.1.0/indexes
  bowtie2 -c hg19 GTTTTAGTAGAGACAAGGTCTCACTGTGCTGCCCTGGTGGGTCTCAAATTCCTGA

6. Align the reads
------------------

There are 8 sequencing runs in this experiment so we're going to fire 1
instance of TopHat2 per run. Each instance of TopHat2 will write its output
to a folder under ~/E-MTAB-1147/tophat2_out
Have a look at the start-tophat2-on-all-runs.sh script and make any necessary
adjustements to it. Then run it with:

  cd ~/E-MTAB-1147
  $SCRIPTS_DIR/start-tophat2-on-all-runs.sh

The script will exit almost instantly after starting the tophat2 commands
in the background. You can log out and come back any time to the server to
check how things are going e.g. with:

  cd ~/E-MTAB-1147
  tail tophat2-*.log

Or with:

  tail -f tophat2-ERR127306.log

to follow the output of 1 TopHat2 instance in real-time.

Depending on your hardware, it should take between 20h and 3 or 4 days for
the 8 TopHat2 instances to complete.

7. Extract the alignments located on chr14
------------------------------------------

We do this for the 8 BAM files produced in step 6:

  cd ~/E-MTAB-1147
  R --vanilla <$SCRIPTS_DIR/subset_bam_chr14.R

This will produce 16 files: 1 <run_name>_chr14.bam and 1
<run_name>_chr14.bam.bai file per sequencing run. Those are the files
located in the extdata/ folder of the RNAseqData.HNRNPC.bam.chr14 package.

> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>