R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Summarize the reads assigned to a SplicingGraphs object

countReads-methods

R Documentation

Summarize the reads assigned to a SplicingGraphs object

Description

countReads counts the reads assigned to a SplicingGraphs object. The counting can be done by splicing graph edge, reduced splicing graph edge, transcript, or gene.

reportReads is similar to countReads but returns right before the final counting step, that is, the returned DataFrame contains the reads instead of their counts.

Usage

countReads(x, by=c("sgedge", "rsgedge", "tx", "gene"))
reportReads(x, by=c("sgedge", "rsgedge", "tx", "gene"))

Arguments

`x`	A SplicingGraphs object.
`by`	Can be `"sgedge"`, `"rsgedge"`, `"tx"`, or `"gene"`. Specifies the level of resolution that summarization should be performed at. See Details section below.

Details

Levels of resolution

countReads and reportReads allow summarization of the reads at different levels of resolution. The level of resolution is determined by the type of feature that one chooses via the by argument. The supported resolutions are (from highest to lowest resolution):

by="sgedge" for summarization at the splicing graph edge level (i.e. at the exons/intron level);
by="rsgedge" for summarization at the reduced splicing graph edge level;
by="tx" for summarization at the transcript level;
by="gene" for summarization at the gene level.

Relationship between levels of resolution

There is a parent-child relationship between the features corresponding to a given level of resolution (the parent features) and those corresponding to a higher level of resolution (the child features).

For example, in the case of the 2 first levels of resolution listed above, the parent-child relationship is the following: the parent features are the reduced splicing graph edges, the child features are the splicing graph edges, and each parent feature is obtained by merging one or more child features together. Similarly, transcripts can be seen as parent features of reduced splicing graph edges, and genes as parent features of transcripts. Note that, the rsgedge/sgedge and gene/tx relationships are one-to-many, but the tx/rsgedge relationship is many-to-many because a given edge can belong to more than one transcript.

Finally the parent-child relationships between 2 arbitrary levels of resolution is defined by combining the relationships between consecutive levels. All possible parent-child relationships are summarized in the following table:

                    | to: sgedge   | to: rsgedge  | to: tx
      --------------+--------------+--------------+------------
      from: rsgedge | one-to-many  |              |            
      from: tx      | many-to-many | many-to-many |            
      from: gene    | one-to-many  | one-to-many  | one-to-many

Multiple hits and ambiguous reads

An important distinction needs to be made between a read that hits a given feature multiple times and a read that hits more than one feature.

If the former, the read is counted/reported only once for that feature. For example, when summarizing at the transcript level, a read is counted/reported only once for a given transcript, even if that read hits more than one splicing graph edge (or reduced splicing graph edge) associated with that transcript.

If the latter, the read is said to be ambiguous. An ambiguous read is currently counted/reported for each feature where it has a hit. This is a temporary situation: in the near future the user will be offered options to handle ambiguous reads in different ways.

Ambiguous reads and levels of resolution

A read might be ambiguous at one level of resolution but not at the other. Also the number of ambiguous reads is typically affected by the level of resolution. However, even though higher resolution generally means more ambiguous reads, this is only true when the switch from one level of resolution to the other implies a parent-child relationship between features that is one-to-many. So, based on the above table, this is always true, except when switching from using by="tx" to using by="sgedge" or by="rsgedge". In those cases, the switch can produce more ambiguities but it can also produce less.

Value

A DataFrame object with one row per:

unique splicing graph edge, if by="sgedge";
unique reduced splicing graph edge, if by="rsgedge";
transcript if by="tx";
gene if by="gene".

And with one column per sample (containing the counts for that sample for countReads, and the reads for that sample for reportReads), plus the two following left columns:

if by="sgedge": "sgedge_id", containing the global splicing graph edge ids, and "ex_or_in", containing the type of edge (exon or intron);
if by="rsgedge": "rsgedge_id", containing the global reduced splicing graph edge ids, and "ex_or_in", containing the type of edge (exon, intron, or mixed);
if by="tx": "tx_id" and "gene_id";
if by="gene": "gene_id" and "tx_id".

For countReads, each column of counts is of type integer and is named after the corresponding sample. For reportReads, each column of reads is a CharacterList object and its name is the name of the corresponding sample with the ".hits" suffix added to it. In both cases, the name of the sample is the name that was passed to assignReads when the reads of a given sample were initially assigned. See ?assignReads for more information.

Author(s)

H. Pages

Examples

## ---------------------------------------------------------------------
## 1. Make SplicingGraphs object 'sg' from toy gene model and assign toy
##    reads to it (see '?assignReads')
## ---------------------------------------------------------------------
example(assignReads)

## ---------------------------------------------------------------------
## 2. Summarize the reads by splicing graph edge
## ---------------------------------------------------------------------
countReads(sg)
reportReads(sg) 

## ---------------------------------------------------------------------
## 3. Summarize the reads by reduced splicing graph edge
## ---------------------------------------------------------------------
countReads(sg, by="rsgedge")
reportReads(sg, by="rsgedge")

## ---------------------------------------------------------------------
## 4. Summarize the reads by transcript
## ---------------------------------------------------------------------
countReads(sg, by="tx")
reportReads(sg, by="tx")

## ---------------------------------------------------------------------
## 5. Summarize the reads by gene
## ---------------------------------------------------------------------
countReads(sg, by="gene")
reportReads(sg, by="gene")

## ---------------------------------------------------------------------
## 6. A close look at ambiguous reads
## ---------------------------------------------------------------------
resolutions <- c("sgedge", "rsgedge", "tx", "gene")

reported_reads <- lapply(resolutions,
    function(by) {
        reported_reads <- reportReads(sg, by=by)
        unlist(reported_reads$TOYREADS.hits)
    })

## The set of reported reads is the same at all levels of resolution:
unique_reported_reads <- lapply(reported_reads, unique)
stopifnot(identical(unique_reported_reads,
                    rep(unique_reported_reads[1], 4)))

## Extract ambigous reads for each level of resolution:
ambiguous_reads <- lapply(reported_reads,
                          function(x) unique(x[duplicated(x)]))
names(ambiguous_reads) <- resolutions
ambiguous_reads

## Reads that are ambiguous at the "rsgedge" level must also be
## ambiguous at the "sgedge" level:
stopifnot(all(ambiguous_reads$rsgedge %in% ambiguous_reads$sgedge))

## However, there is no reason why reads that are ambiguous at the
## "tx" level should also be ambiguous at the "sgedge" or "rsgedge"
## level!

## ---------------------------------------------------------------------
## 7. Remove the reads from 'sg'.
## ---------------------------------------------------------------------
sg <- removeReads(sg)
countReads(sg)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(SplicingGraphs)
Loading required package: GenomicFeatures
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment
Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools
Loading required package: Rgraphviz
Loading required package: graph

Attaching package: 'graph'

The following object is masked from 'package:Biostrings':

    complement

Loading required package: grid

Attaching package: 'Rgraphviz'

The following objects are masked from 'package:IRanges':

    from, to

The following objects are masked from 'package:S4Vectors':

    from, to

Warning messages:
1: replacing previous import 'IRanges::from' by 'Rgraphviz::from' when loading 'SplicingGraphs' 
2: replacing previous import 'IRanges::to' by 'Rgraphviz::to' when loading 'SplicingGraphs' 
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/SplicingGraphs/countReads-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: countReads-methods
> ### Title: Summarize the reads assigned to a SplicingGraphs object
> ### Aliases: countReads-methods countReads countReads,SplicingGraphs-method
> 
> ### ** Examples
> 
> ## ---------------------------------------------------------------------
> ## 1. Make SplicingGraphs object 'sg' from toy gene model and assign toy
> ##    reads to it (see '?assignReads')
> ## ---------------------------------------------------------------------
> example(assignReads)

assgnR> ## ---------------------------------------------------------------------
assgnR> ## 1. Make SplicingGraphs object 'sg' from toy gene model (see
assgnR> ##    '?SplicingGraphs')
assgnR> ## ---------------------------------------------------------------------
assgnR> example(SplicingGraphs) 

SplcnG> ## ---------------------------------------------------------------------
SplcnG> ## 1. Load a toy gene model as a TxDb object
SplcnG> ## ---------------------------------------------------------------------
SplcnG> 
SplcnG> library(GenomicFeatures)

SplcnG> suppressWarnings(
SplcnG+   toy_genes_txdb <- makeTxDbFromGFF(toy_genes_gff())
SplcnG+ )
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK

SplcnG> ## ---------------------------------------------------------------------
SplcnG> ## 2. Compute all the splicing graphs (1 graph per gene) and return them
SplcnG> ##    in a SplicingGraphs object
SplcnG> ## ---------------------------------------------------------------------
SplcnG> 
SplcnG> ## Extract the exons grouped by transcript:
SplcnG> ex_by_tx <- exonsBy(toy_genes_txdb, by="tx", use.names=TRUE)

SplcnG> ## Extract the transcripts grouped by gene:
SplcnG> tx_by_gn <- transcriptsBy(toy_genes_txdb, by="gene")

SplcnG> sg <- SplicingGraphs(ex_by_tx, tx_by_gn)

SplcnG> sg
SplicingGraphs object with 5 gene(s) and 13 transcript(s)

SplcnG> ## Alternatively 'sg' can be constructed directly from the TxDb
SplcnG> ## object:
SplcnG> sg2 <- SplicingGraphs(toy_genes_txdb)  # same as 'sg'

SplcnG> sg2
SplicingGraphs object with 5 gene(s) and 13 transcript(s)

SplcnG> ## Note that because SplicingGraphs objects have a slot that is an
SplcnG> ## environment (for caching the bubbles), they cannot be compared with
SplcnG> ## 'identical()' (will always return FALSE). 'all.equal()' should be
SplcnG> ## used instead:
SplcnG> stopifnot(isTRUE(all.equal(sg2, sg)))

SplcnG> ## 'sg' has 1 element per gene and 'names(sg)' gives the gene ids:
SplcnG> length(sg)
[1] 5

SplcnG> names(sg)
[1] "geneA" "geneB" "geneC" "geneD" "geneE"

SplcnG> ## ---------------------------------------------------------------------
SplcnG> ## 3. Basic manipulation of a SplicingGraphs object
SplcnG> ## ---------------------------------------------------------------------
SplcnG> 
SplcnG> ## Basic accessors:
SplcnG> seqnames(sg)
geneA geneB geneC geneD geneE 
 chrX  chrX  chrX  chrX  chrX 
Levels: chrX

SplcnG> strand(sg)
geneA geneB geneC geneD geneE 
    +     -     +     +     + 
Levels: + - *

SplcnG> seqinfo(sg)
Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chrX             NA         NA   <NA>

SplcnG> ## Number of transcripts per gene:
SplcnG> elementNROWS(sg)
geneA geneB geneC geneD geneE 
    2     2     3     4     2 

SplcnG> ## The transcripts of a given gene can be extracted with [[. The result
SplcnG> ## is an *unnamed* GRangesList object containing the exons grouped by
SplcnG> ## transcript:
SplcnG> sg[["geneD"]]
GRangesList object of length 4:
[[1]] 
GRanges object with 2 ranges and 5 metadata columns:
      seqnames     ranges strand |   exon_id   exon_name exon_rank start_SSid
         <Rle>  <IRanges>  <Rle> | <integer> <character> <integer>  <integer>
  [1]     chrX [601, 630]      + |        10         Dx2         1          1
  [2]     chrX [666, 675]      + |        12         Dx4         2          5
       end_SSid
      <integer>
  [1]         3
  [2]         6

[[2]] 
GRanges object with 2 ranges and 5 metadata columns:
      seqnames     ranges strand | exon_id exon_name exon_rank start_SSid
  [1]     chrX [601, 620]      + |       9       Dx1         1          1
  [2]     chrX [651, 700]      + |      11       Dx3         2          4
      end_SSid
  [1]        2
  [2]        8

[[3]] 
GRanges object with 3 ranges and 5 metadata columns:
      seqnames     ranges strand | exon_id exon_name exon_rank start_SSid
  [1]     chrX [601, 620]      + |       9       Dx1         1          1
  [2]     chrX [666, 675]      + |      12       Dx4         2          5
  [3]     chrX [691, 700]      + |      13       Dx5         3          7
      end_SSid
  [1]        2
  [2]        6
  [3]        8

...
<1 more element>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

SplcnG> ## See '?plotTranscripts' for how to plot those transcripts.
SplcnG> 
SplcnG> ## The transcripts of all the genes can be extracted with unlist(). The
SplcnG> ## result is a *named* GRangesList object containing the exons grouped
SplcnG> ## by transcript. The names on the object are the gene ids:
SplcnG> ex_by_tx <- unlist(sg)

SplcnG> ex_by_tx
GRangesList object of length 13:
$geneA 
GRanges object with 1 range and 5 metadata columns:
      seqnames    ranges strand |   exon_id   exon_name exon_rank start_SSid
         <Rle> <IRanges>  <Rle> | <integer> <character> <integer>  <integer>
  [1]     chrX  [11, 50]      + |         2         Ax2         1          1
       end_SSid
      <integer>
  [1]         3

$geneA 
GRanges object with 2 ranges and 5 metadata columns:
      seqnames    ranges strand | exon_id exon_name exon_rank start_SSid
  [1]     chrX [11,  40]      + |       1       Ax1         1          1
  [2]     chrX [71, 100]      + |       3       Ax3         2          4
      end_SSid
  [1]        2
  [2]        5

$geneB 
GRanges object with 2 ranges and 5 metadata columns:
      seqnames     ranges strand | exon_id exon_name exon_rank start_SSid
  [1]     chrX [251, 300]      - |      23       Bx1         1          3
  [2]     chrX [201, 230]      - |      20       Bx2         2          6
      end_SSid
  [1]        1
  [2]        4

...
<10 more elements>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

assgnR> sg
SplicingGraphs object with 5 gene(s) and 13 transcript(s)

assgnR> ## 'sg' has 1 element per gene and 'names(sg)' gives the gene ids.
assgnR> names(sg)
[1] "geneA" "geneB" "geneC" "geneD" "geneE"

assgnR> ## ---------------------------------------------------------------------
assgnR> ## 2. Load toy reads
assgnR> ## ---------------------------------------------------------------------
assgnR> ## Load toy reads (single-end) from a BAM file. We filter out secondary
assgnR> ## alignments, reads not passing quality controls, and PCR or optical
assgnR> ## duplicates (see ?scanBamFlag in the Rsamtools package for more
assgnR> ## information):
assgnR> flag0 <- scanBamFlag(isSecondaryAlignment=FALSE,
assgnR+                      isNotPassingQualityControls=FALSE,
assgnR+                      isDuplicate=FALSE)

assgnR> param0 <- ScanBamParam(flag=flag0)

assgnR> gal <- readGAlignments(toy_reads_bam(), use.names=TRUE, param=param0)

assgnR> gal
GAlignments object with 42 alignments and 0 metadata columns:
       seqnames strand       cigar    qwidth     start       end     width
          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  r001     chrX      -         10M        10        12        21        10
  r002     chrX      +     7M15N3M        10        15        39        25
  r003     chrX      +         10M        10        21        30        10
  r004     chrX      -         10M        10        26        35        10
  r005     chrX      +         10M        10        31        40        10
   ...      ...    ...         ...       ...       ...       ...       ...
  r038     chrX      -         10M        10       278       287        10
  r039     chrX      +         10M        10       280       289        10
  r040     chrX      +         10M        10       285       294        10
  r041     chrX      -     5M12N5M        10       288       309        22
  r042     chrX      +         10M        10       290       299        10
           njunc
       <integer>
  r001         0
  r002         1
  r003         0
  r004         0
  r005         0
   ...       ...
  r038         0
  r039         0
  r040         0
  r041         1
  r042         0
  -------
  seqinfo: 1 sequence from an unspecified genome

assgnR> ## ---------------------------------------------------------------------
assgnR> ## 3. Assign the reads to the exons and introns in 'sg'
assgnR> ## ---------------------------------------------------------------------
assgnR> ## The same read can be assigned to more than 1 exon or intron (e.g. a
assgnR> ## junction read with 1 gap can be assigned to 2 exons and 1 intron).
assgnR> sg <- assignReads(sg, gal, sample.name="TOYREADS")

assgnR> ## See the assignments to the splicing graph edges.
assgnR> edge_by_tx <- sgedgesByTranscript(sg, with.hits.mcols=TRUE)

assgnR> edge_data <- mcols(unlist(edge_by_tx))

assgnR> colnames(edge_data)
[1] "from"          "to"            "sgedge_id"     "ex_or_in"     
[5] "tx_id"         "TOYREADS.hits"

assgnR> head(edge_data)
DataFrame with 6 rows and 6 columns
         from          to   sgedge_id ex_or_in       tx_id      TOYREADS.hits
  <character> <character> <character> <factor> <character>    <CharacterList>
1           1           3   geneA:1,3       ex          A1 r001,r003,r004,...
2           1           2   geneA:1,2       ex          A2 r001,r003,r004,...
3           2           4   geneA:2,4       in          A2               r006
4           4           5   geneA:4,5       ex          A2 r006,r008,r009,...
5           1           3   geneB:1,3       ex          B1 r027,r028,r029,...
6           3           4   geneB:3,4       in          B1     r027,r028,r029

assgnR> edge_data[ , c("sgedge_id", "TOYREADS.hits")]
DataFrame with 51 rows and 2 columns
      sgedge_id      TOYREADS.hits
    <character>    <CharacterList>
1     geneA:1,3 r001,r003,r004,...
2     geneA:1,2 r001,r003,r004,...
3     geneA:2,4               r006
4     geneA:4,5 r006,r008,r009,...
5     geneB:1,3 r027,r028,r029,...
...         ...                ...
47    geneE:5,6                   
48    geneE:6,7                   
49    geneE:7,8                   
50   geneE:8,11                   
51  geneE:11,12                   

assgnR> edge_by_gene <- sgedgesByGene(sg, with.hits.mcols=TRUE)

assgnR> mcols(unlist(edge_by_gene))
DataFrame with 41 rows and 6 columns
           from          to   sgedge_id ex_or_in           tx_id
    <character> <character> <character> <factor> <CharacterList>
1             1           3   geneA:1,3       ex              A1
2             1           2   geneA:1,2       ex              A2
3             2           4   geneA:2,4       in              A2
4             4           5   geneA:4,5       ex              A2
5             1           3   geneB:1,3       ex              B1
...         ...         ...         ...      ...             ...
37           11          12 geneE:11,12       ex           E1,E2
38            2           5   geneE:2,5       in              E2
39            6           7   geneE:6,7       in              E2
40            7           8   geneE:7,8       ex              E2
41            8          11  geneE:8,11       in              E2
         TOYREADS.hits
       <CharacterList>
1   r001,r003,r004,...
2   r001,r003,r004,...
3                 r006
4   r006,r008,r009,...
5   r027,r028,r029,...
...                ...
37                    
38                    
39                    
40                    
41                    

assgnR> ## See the assignments to the reduced splicing graph edges.
assgnR> redge_by_gene <- rsgedgesByGene(sg, with.hits.mcols=TRUE)

assgnR> mcols(unlist(redge_by_gene))
DataFrame with 28 rows and 6 columns
           from          to      rsgedge_id ex_or_in           tx_id
    <character> <character>     <character> <factor> <CharacterList>
1             1           3       geneA:1,3       ex              A1
2             1           5   geneA:1,2,4,5    mixed              A2
3             1           3       geneB:1,3       ex              B1
4             3           4       geneB:3,4       in           B1,B2
5             4           6       geneB:4,6       ex              B1
...         ...         ...             ...      ...             ...
24            5           6       geneE:5,6       ex           E1,E2
25            6          11 geneE:6,9,10,11    mixed              E1
26           11          12     geneE:11,12       ex           E1,E2
27            2           5       geneE:2,5       in              E2
28            6          11  geneE:6,7,8,11    mixed              E2
         TOYREADS.hits
       <CharacterList>
1   r001,r003,r004,...
2   r001,r003,r004,...
3   r027,r028,r029,...
4       r027,r028,r029
5   r014,r015,r016,...
...                ...
24                    
25                    
26                    
27                    
28                    

assgnR> ## ---------------------------------------------------------------------
assgnR> ## 4. Summarize the reads assigned to 'sg' and eventually remove them
assgnR> ## ---------------------------------------------------------------------
assgnR> ## See '?countReads'.
assgnR> 
assgnR> 
assgnR> 
> 
> ## ---------------------------------------------------------------------
> ## 2. Summarize the reads by splicing graph edge
> ## ---------------------------------------------------------------------
> countReads(sg)
DataFrame with 41 rows and 3 columns
      sgedge_id ex_or_in  TOYREADS
    <character> <factor> <integer>
1     geneA:1,3       ex         5
2     geneA:1,2       ex         5
3     geneA:2,4       in         1
4     geneA:4,5       ex         5
5     geneB:1,3       ex        14
...         ...      ...       ...
37  geneE:11,12       ex         0
38    geneE:2,5       in         0
39    geneE:6,7       in         0
40    geneE:7,8       ex         0
41   geneE:8,11       in         0
> reportReads(sg) 
DataFrame with 41 rows and 3 columns
      sgedge_id ex_or_in      TOYREADS.hits
    <character> <factor>    <CharacterList>
1     geneA:1,3       ex r001,r003,r004,...
2     geneA:1,2       ex r001,r003,r004,...
3     geneA:2,4       in               r006
4     geneA:4,5       ex r006,r008,r009,...
5     geneB:1,3       ex r027,r028,r029,...
...         ...      ...                ...
37  geneE:11,12       ex                   
38    geneE:2,5       in                   
39    geneE:6,7       in                   
40    geneE:7,8       ex                   
41   geneE:8,11       in                   
> 
> ## ---------------------------------------------------------------------
> ## 3. Summarize the reads by reduced splicing graph edge
> ## ---------------------------------------------------------------------
> countReads(sg, by="rsgedge")
DataFrame with 28 rows and 3 columns
         rsgedge_id ex_or_in  TOYREADS
        <character> <factor> <integer>
1         geneA:1,3       ex         5
2     geneA:1,2,4,5    mixed         9
3         geneB:1,3       ex        14
4         geneB:3,4       in         3
5         geneB:4,6       ex        13
...             ...      ...       ...
24        geneE:5,6       ex         0
25  geneE:6,9,10,11    mixed         0
26      geneE:11,12       ex         0
27        geneE:2,5       in         0
28   geneE:6,7,8,11    mixed         0
> reportReads(sg, by="rsgedge")
DataFrame with 28 rows and 3 columns
         rsgedge_id ex_or_in      TOYREADS.hits
        <character> <factor>    <CharacterList>
1         geneA:1,3       ex r001,r003,r004,...
2     geneA:1,2,4,5    mixed r001,r003,r004,...
3         geneB:1,3       ex r027,r028,r029,...
4         geneB:3,4       in     r027,r028,r029
5         geneB:4,6       ex r014,r015,r016,...
...             ...      ...                ...
24        geneE:5,6       ex                   
25  geneE:6,9,10,11    mixed                   
26      geneE:11,12       ex                   
27        geneE:2,5       in                   
28   geneE:6,7,8,11    mixed                   
> 
> ## ---------------------------------------------------------------------
> ## 4. Summarize the reads by transcript
> ## ---------------------------------------------------------------------
> countReads(sg, by="tx")
DataFrame with 13 rows and 3 columns
          tx_id     gene_id  TOYREADS
    <character> <character> <integer>
1            A1       geneA         5
2            A2       geneA         9
3            B1       geneB        24
4            B2       geneB        11
5            C1       geneC         0
...         ...         ...       ...
9            D1       geneD         0
10           D2       geneD         0
11           D3       geneD         0
12           E1       geneE         0
13           E2       geneE         0
> reportReads(sg, by="tx")
DataFrame with 13 rows and 3 columns
          tx_id     gene_id      TOYREADS.hits
    <character> <character>    <CharacterList>
1            A1       geneA r001,r003,r004,...
2            A2       geneA r001,r003,r004,...
3            B1       geneB r027,r028,r029,...
4            B2       geneB r027,r028,r029,...
5            C1       geneC                   
...         ...         ...                ...
9            D1       geneD                   
10           D2       geneD                   
11           D3       geneD                   
12           E1       geneE                   
13           E2       geneE                   
> 
> ## ---------------------------------------------------------------------
> ## 5. Summarize the reads by gene
> ## ---------------------------------------------------------------------
> countReads(sg, by="gene")
DataFrame with 5 rows and 3 columns
          gene_id           tx_id  TOYREADS
      <character> <CharacterList> <integer>
geneA       geneA           A1,A2        10
geneB       geneB           B1,B2        24
geneC       geneC        C1,C2,C3         0
geneD       geneD    D4,D3,D2,...         0
geneE       geneE           E1,E2         0
> reportReads(sg, by="gene")
DataFrame with 5 rows and 3 columns
          gene_id           tx_id      TOYREADS.hits
      <character> <CharacterList>    <CharacterList>
geneA       geneA           A1,A2 r001,r003,r004,...
geneB       geneB           B1,B2 r027,r028,r029,...
geneC       geneC        C1,C2,C3                   
geneD       geneD    D4,D3,D2,...                   
geneE       geneE           E1,E2                   
> 
> ## ---------------------------------------------------------------------
> ## 6. A close look at ambiguous reads
> ## ---------------------------------------------------------------------
> resolutions <- c("sgedge", "rsgedge", "tx", "gene")
> 
> reported_reads <- lapply(resolutions,
+     function(by) {
+         reported_reads <- reportReads(sg, by=by)
+         unlist(reported_reads$TOYREADS.hits)
+     })
> 
> ## The set of reported reads is the same at all levels of resolution:
> unique_reported_reads <- lapply(reported_reads, unique)
> stopifnot(identical(unique_reported_reads,
+                     rep(unique_reported_reads[1], 4)))
> 
> ## Extract ambigous reads for each level of resolution:
> ambiguous_reads <- lapply(reported_reads,
+                           function(x) unique(x[duplicated(x)]))
> names(ambiguous_reads) <- resolutions
> ambiguous_reads
$sgedge
 [1] "r001" "r003" "r004" "r005" "r006" "r027" "r028" "r029" "r031" "r032"
[11] "r033" "r034" "r035" "r021" "r022" "r023"

$rsgedge
 [1] "r001" "r003" "r004" "r005" "r027" "r028" "r029" "r031" "r032" "r033"
[11] "r034" "r035" "r021" "r022" "r023"

$tx
 [1] "r001" "r003" "r004" "r005" "r027" "r028" "r029" "r031" "r032" "r033"
[11] "r034" "r035" "r021" "r022" "r023"

$gene
character(0)

> 
> ## Reads that are ambiguous at the "rsgedge" level must also be
> ## ambiguous at the "sgedge" level:
> stopifnot(all(ambiguous_reads$rsgedge %in% ambiguous_reads$sgedge))
> 
> ## However, there is no reason why reads that are ambiguous at the
> ## "tx" level should also be ambiguous at the "sgedge" or "rsgedge"
> ## level!
> 
> ## ---------------------------------------------------------------------
> ## 7. Remove the reads from 'sg'.
> ## ---------------------------------------------------------------------
> sg <- removeReads(sg)
> countReads(sg)
DataFrame with 41 rows and 2 columns
      sgedge_id ex_or_in
    <character> <factor>
1     geneA:1,3       ex
2     geneA:1,2       ex
3     geneA:2,4       in
4     geneA:4,5       ex
5     geneB:1,3       ex
...         ...      ...
37  geneE:11,12       ex
38    geneE:2,5       in
39    geneE:6,7       in
40    geneE:7,8       ex
41   geneE:8,11       in
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>