R: Occurrence of motifs in a set of ordered sequences
motifScanHits
R Documentation
Occurrence of motifs in a set of ordered sequences
Description
Finds positions of sequence motif hits above a specified threshold in a list of
sequences of the same length ordered by a provided index. Motif is specified by
a position weight matrix (PWM) that contains estimated probability of base b at
position i and is usually constructed via call to PWM function.
Position of each motif hit is specified in two-dimensional matrix, i.e.
the first coordinate provides the ordinal number of the sequence and the second
coordinate gives the position within the sequence where the motif occurs.
A DNAStringSet object. Set of sequences of the same length
in which to search for the motif hits.
motifPWM
A numeric matrix representing the Position Weight Matrix (PWM), such as
returned by PWM function. Can contain either probabilities
or log2 probability ratio of base b at position i.
minScore
The minimum score for counting a motif hit. Can be given as a character
string containing a percentage (e.g."85%") of the
PWM score or a single number specifying score threshold. If a percentage
is given, it is converted to a score value taking into account both
minimal and maximal possible PWM scores as follows:
minPWMscore + percThreshold/100 * (maxPWMscore - minPWMscore)
This differs from the formula in the matchPWM function
from the Biostrings package which takes into account only the
maximal possible PWM score and considers the given percentage as the
percentage of that maximal score:
percThreshold/100 * maxPWMscore
seqOrder
Integer vector specifying the order of the provided input sequences.
Must have the same length as the number of sequences in the
regionSeq. The default value will order the sequences as they are
ordered in the input regionSeq object.
Details
This function uses the matchPWM function to find matches to
given motif in a set of input sequences. Only matches above specified
minScore are considered as hits. Input sequences must all be of the
same length and are ordered according to the index provided in the
seqOrder argument, creating a n * m matrix, where n is
the number of sequences and m is the length of the sequences.
Positions of motif hits in the resulting matrix are returned as
two-dimensional coordinates.
Value
The function returns a data.frame with positions of the motif hits in
the set of input sequences. The input sequences of the same length are
sorted according to the index in seqOrder argument and the positions
of motif hits in the resulting n * m matrix (where n is the
number of sequences and m is the length of the sequence) are
provided. The sequence column in the data.frame provides the ordinal
number of the sequence in the ordered list of sequences and the
position column provides the start position of the motif hit within
that sequence.
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(seqPattern)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/seqPattern/motifScanHits.Rd_%03d_medium.png", width=480, height=480)
> ### Name: motifScanHits
> ### Title: Occurrence of motifs in a set of ordered sequences
> ### Aliases: motifScanHits motifScanHits,DNAStringSet,matrix-method
>
> ### ** Examples
>
> library(GenomicRanges)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomeInfoDb
> load(system.file("data", "zebrafishPromoters.RData", package="seqPattern"))
> promoterWidth <- elementMetadata(zebrafishPromoters)$interquantileWidth
>
> load(system.file("data", "TBPpwm.RData", package="seqPattern"))
>
> motifOccurrence <- motifScanHits(regionsSeq = zebrafishPromoters,
+ motifPWM = TBPpwm, minScore = "85%",
+ seqOrder = order(promoterWidth))
There were 12 warnings (use warnings() to see them)
> head(motifOccurrence)
sequence position value
1 1 76 1
2 1 227 1
3 1 288 1
4 1 290 1
5 1 298 1
6 1 643 1
>
>
>
>
>
> dev.off()
null device
1
>