R: Apply R/S-Plus functions to remote groups of DBMS rows...
postgresqlDBApply
R Documentation
Apply R/S-Plus functions to remote groups of DBMS rows (experimental)
Description
Applies R/S-Plus functions to groups of remote DBMS rows without
bringing an entire result set all at once. The result set
is expected to be sorted by the grouping field.
Usage
postgresqlDBApply(res, INDEX, FUN = stop("must specify FUN"),
begin = NULL,
group.begin = NULL,
new.record = NULL,
end = NULL,
batchSize = 100, maxBatch = 1e6,
..., simplify = TRUE)
Arguments
res
a result set (see dbSendQuery).
INDEX
a character or integer specifying the field name or
field number that defines the various groups.
FUN
a function to be invoked upon identifying the last
row from every group. This function will be passed
a data frame holding the records of the current group,
a character string with the group label, plus any
other arguments passed to dbApply as "...".
begin
a function of no arguments to be invoked just prior to
retrieve the first row from the result set.
end
a function of no arguments to be invoked just after retrieving
the last row from the result set.
group.begin
a function of one argument (the group label) to be
invoked upon identifying a row from a new group
.
new.record
a function to be invoked as each individual record
is fetched. The first argument to this function is a
one-row data.frame holding the new record.
batchSize
the default number of rows to bring from the remote
result set. If needed, this is automatically extended
to hold groups bigger than batchSize.
maxBatch
the absolute maximum of rows per group that may
be extracted from the result set.
...
any additional arguments to be passed to FUN.
simplify
Not yet implemented
Details
dbApply
This function is meant to handle somewhat gracefully(?) large amounts
of data from the DBMS by bringing into R manageable chunks (about
batchSize records at a time, but not more than maxBatch);
the idea is that the data from individual groups can be handled by R, but
not all the groups at the same time.
The PostgreSQL implementation postgresqlDBApply allows us to register R
functions that get invoked
when certain fetching events occur. These include the “begin” event
(no records have been yet fetched), “begin.group” (the record just
fetched belongs to a new group), “new record” (every fetched record
generates this event), “group.end” (the record just fetched was the
last row of the current group), “end” (the very last record from the
result set). Awk and perl programmers will find this paradigm very
familiar (although SAP's ABAP language is closer to what we're doing).
Value
A list with as many elements as there were groups in the
result set.
Note
This is an experimental version implemented only in R (there are
plans, time permitting, to implement it in S-Plus).
The terminology that we're using is closer to SQL than R. In R
what we're referring to “groups” are the individual levels of
a factor (grouping field in our terminology).
See Also
PostgreSQL, dbSendQuery, fetch.
Examples
## Not run:
drv <- dbDriver(RPostgreSQL)
con <- dbConnect(drv, user ="usrname", password="pword", dname="database")
res <- dbSendQuery(con,
"select Agent, ip_addr, DATA from pseudo_data order by Agent")
out <- dbApply(res, INDEX = "Agent",
FUN = function(x, grp) quantile(x$DATA, names=FALSE))
## End(Not run)