R: Functions for processing files of various formats into an...
readMethods
R Documentation
Functions for processing files of various formats into an ‘alignmentData’ object.
Description
These functions take alignment files of various formats to produce an object
(see Details) describing the alignment of sequencing tags from
different libraries. At present, BAM and text files are supported.
Directory (or directories) in which the files can be found.
replicates
A vector defining the replicate structure if the group. If and
only if the ith library is a replicate of the jth library
then @replicates[i] == @replicates[j]. This argument
may be given in any form but will be stored as a factor.
libnames
Names of the libraries defined by the file names.
chrs
A chracter vector defining (a selection of) the chromosome names
used in the alignment files.
chrlens
Lengths of the chromosomes to which the alignments were made.
cols
A named character vector which describes which column of the input
files contains which data. See Details.
countID
A (two-character) string used by the BAM file to identify the
‘counts’ of individual sequenced reads; that is, how many times a
given read appears in the sequenced library. If NULL, it is assumed
that the data are redundant (see Details).
header
Do the input files have a header line? Defaults
to TRUE. See Details.
minlen
Minimum length for reads.
maxlen
Maximum length for reads.
multireads
The functions will discard any read that aligns to
the genome in more locations than given by this value. Set to Inf to
keep everything. Defaults to 1000.
polyLength
If given, an integer value N defining the length of (approximate)
homopolymers which will be removed from the data. If a tag contains
a sequence of N+1 reads consisting of at least N identical bases, it
will be removed. If not given, all data is used.
estimationType
The estimationType that will be used by the ‘baySeq’ function
getLibsizes to infer the library sizes of the
samples.
verbose
Should processing information be displayed? Defaults to TRUE.
filterReport
If not NULL, this should be a string defining a file to which will
be written those data filtered on the basis of chromsome choices,
widths of sequences, multireads or polyBase.
...
Additional parameters to be passed to
read.table. In particular, the ‘sep’ and ‘skip’
arguments may be useful.
Details
readBAM:
This function takes a set of BAM files and generates the
'alignmentData' object from these. If a character string for
‘countID’ is given, the function assumes the data are non-redundant
and that ‘countID’ identifies the count data (i.e., how many times
each read appears in the sequenced library) in each BAM file. If
‘countID’ is NULL, then it is assumed that the data are redundant, and
the count data are inferred from the file.
readGeneric:
The purpose of this function is to take a set of plain text files
and produce an 'alignmentData' object. The function uses
read.table to read in the columns of data in the files
and so by default columns are separated by any white
space. Alternative separators can be used by passing the appropriate
value for 'sep' to read.table.
The files may contain columns with column names
'chr', 'tag', 'count', 'start',
'end', 'strand' in which case the ‘cols’ argument can be
ommitted and ‘header’ set to TRUE. If this is the case, there is no
requirement for all the files to have the same ordering of columns
(although all must have these column names).
Alternatively, the columns of data in the input files can be specified by
the ‘cols’ argument in the form of a named character vector (e.g;
'cols = c(chr = 1, tag = 2, count = 3, start = 4, end = 5,
strand = 6)' would cause the function to assume that the first column
contains the chromosome information, the second column contained the
tag information, etc. If ‘cols’ is specified then information in the
header is ignored. If ‘cols’ is missing and ‘header’ is FALSE, then it
is assumed that the data takes the form described in the example above.
The 'tag', 'count' and 'strand' columns may optionally be
omitted from either the file column headers or the ‘cols’ argument. If
the 'tag' column is omitted, then the data will not account for
duplicated sequences when estimating the number of counts in loci. If
the 'count' column is omitted, the 'readGeneric' function
will assume that the file contains the alignments of each copy of each
sequence tag, rather than an aggregated alignment of each unique
sequence. The unique alignments will be identified and the number of
sequence tags aligning to each position will be calculated. If
'strand' is omitted, the strand will simply be ignored.
Value
An alignmentData object.
Author(s)
Thomas J. Hardcastle
See Also
alignmentData
Examples
# Define the chromosome lengths for the genome of interest.
chrlens <- c(2e6, 1e6)
# Define the files containing sample information.
datadir <- system.file("extdata", package = "segmentSeq")
libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt")
# Establish the library names and replicate structure.
libnames <- c("SL9", "SL10", "SL26", "SL32")
replicates <- c(1,1,2,2)
# Process the files to produce an `alignmentData' object.
alignData <- readGeneric(file = libfiles, dir = datadir, replicates =
replicates, libnames = libnames, chrs = c(">Chr1", ">Chr2"), chrlens =
chrlens)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(segmentSeq)
Loading required package: baySeq
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: abind
Loading required package: perm
Loading required package: ShortRead
Loading required package: BiocParallel
Loading required package: Biostrings
Loading required package: XVector
Loading required package: Rsamtools
Loading required package: GenomicAlignments
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/segmentSeq/readMethods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: readMethods
> ### Title: Functions for processing files of various formats into an
> ### 'alignmentData' object.
> ### Aliases: readGeneric readBAM
> ### Keywords: files
>
> ### ** Examples
>
>
> # Define the chromosome lengths for the genome of interest.
>
> chrlens <- c(2e6, 1e6)
>
> # Define the files containing sample information.
>
> datadir <- system.file("extdata", package = "segmentSeq")
> libfiles <- c("SL9.txt", "SL10.txt", "SL26.txt", "SL32.txt")
>
> # Establish the library names and replicate structure.
>
> libnames <- c("SL9", "SL10", "SL26", "SL32")
> replicates <- c(1,1,2,2)
>
> # Process the files to produce an `alignmentData' object.
>
> alignData <- readGeneric(file = libfiles, dir = datadir, replicates =
+ replicates, libnames = libnames, chrs = c(">Chr1", ">Chr2"), chrlens =
+ chrlens)
Reading files........done!
Analysing tags...........done!
>
>
>
>
>
>
> dev.off()
null device
1
>