R: ff classes for representing (large) atomic data
ff
R Documentation
ff classes for representing (large) atomic data
Description
The ff package provides atomic data structures that are stored on disk but behave (almost) as if they were in RAM by
mapping only a section (pagesize) into main memory (the effective main memory consumption per ff object).
Several access optimization techniques such as Hyrid Index Preprocessing (as.hi, update.ff) and Virtualization (virtual, vt, vw) are implemented to achieve good performance even with large datasets.
In addition to the basic access functions, the ff package also provides compatibility functions that facilitate writing code for ff and ram objects (clone, as.ff, as.ram) and very basic support for operating on ff objects (ffapply).
While the (possibly packed) raw data is stored on a flat file, meta
informations about the atomic data structure such as its dimension,
virtual storage mode (vmode), factor level encoding,
internal length etc.. are stored as an ordinary R object (external
pointer plus attributes) and can be saved in the workspace.
The raw flat file data encoding is always in native machine format for
optimal performance and provides several packing schemes for different
data types such as logical, raw, integer and double (in an extended version
support for more tighly packed virtual data types is supported).
flatfile data files can be shared among ff objects in the same R process or
even from different R processes due to Memory-Mapping, although the
caching effects have not been tested extensively.
Please do read and understand the limitations and warnings in LimWarn before you do anything serious with package ff.
extended feature: non-NULL scalar requires fixed diagonal for symmetric matrix (default NULL is free diagonal)
names
NOT taken from initdata, see names
dimnames
NOT taken from initdata, see dimnames
ramclass
class attribute attached when moving all or parts of this ff into ram, see ramclass
ramattribs
additional attributes attached when moving all or parts of this ff into ram, see ramattribs
vmode
virtual storage mode (default: derive from 'initdata'), see vmode and as.vmode
update
set to FALSE to avoid updating with 'initdata' (default TRUE) (used by ffdf)
pattern
root pattern with or without path for automatic ff filename creation (default NULL translates to "ff"), see also argument 'filename'
filename
ff filename with or without path (default tmpfile with 'pattern' prefix); without path the file is created in getOption("fftempdir"), with path '.' the file is created in getwd. Note that files created in getOption("fftempdir") have default finalizer "delete" while other files have default finalizer "close". See also arguments 'pattern' and 'finalizer' and physical
overwrite
set to TRUE to allow overwriting existing files (default FALSE)
readonly
set to TRUE to forbid writing to existing files
pagesize
pagesize in bytes for the memory mapping (default from getOptions("ffpagesize") initialized by getdefaultpagesize), see also physical
caching
caching scheme for the backend, currently 'mmnoflush' or 'mmeachflush' (flush mmpages at each swap, default from getOptions("ffcaching") initialized with 'mmeachflush'), see also physical
finalizer
name of finalizer function called when ff object is removed (default: ff files created in getOptions("fftempdir") are considered temporary and have default finalizer delete, files created in other locations have default finalizer close); available finalizer generics are "close", "delete" and "deleteIfOpen", available methods are close.ff, delete.ff and deleteIfOpen.ff, see also argument 'finonexit' and finalizer
finonexit
logical scalar determining whether and finalize is also called when R is closed via q, (default TRUE from getOptions("fffinonexit"))
FF_RETURN
logical scalar or ff object to be used. The default TRUE creates a new ff file. FALSE returns a ram object. Handing over an ff object here uses this or stops if not ffsuitable
BATCHSIZE
integer scalar limiting the number of elements to be processed in update.ff when length(initdata)>1, default from .Machine$integer.max
BATCHBYTES
integer scalar limiting the number of bytes to be processed in update.ff when length(initdata)>1, default from getOption("ffbatchbytes"), see also .rambytes
VERBOSE
set to TRUE for verbosing in update.ff when length(initdata)>1, default FALSE
Details
The atomic data is stored in filename as a native encoded raw flat file on disk, OS specific limitations of the file system apply.
The number of elements per ff object is limited to the integer indexing, i.e. .Machine$integer.max.
Atomic objects created with ff are is.open, a C++ object is ready to access the file via memory-mapping.
Currently the C++ backend provides two caching schemes: 'mmnoflush' let the OS decide when to flash memory mapped pages
and 'mmeachflush' will flush memory mapped pages at each page swap per ff file.
These minimal memory ressources can be released by closeing or deleteing the ff file.
ff objects can be saved and loaded across R sessions. If the ff file still exists in the same location,
it will be opened automatically at the first attempt to access its data. If the ff object is removed,
at the next garbage collection (see gc) the ff object's finalizer is invoked.
Raw data files can be made accessible as an ff object by explicitly given the filename and vmode but no size information (length or dim).
The ff object will open the file and handle the data with respect to the given vmode.
The close finalizer will close the ff file, the delete finalizer will delete the ff file.
The default finalizer deleteIfOpen will delete open files and do nothing for closed files. If the default finalizer is used,
two actions are needed to protect the ff file against deletion: create the file outside the standard 'fftempdir' and close the ff object before removing it or before quitting R.
When R is exited through q, the finalizer will be invoked depending on the 'fffinonexit' option, furthermore the 'fftempdir' is unlinked.
Value
If (!FF_RETURN) then a ram object like those generated by vector, matrix, array but with attributes 'vmode', 'physical' and 'virtual' accessible via vmode, physical and virtual
If (FF_RETURN) an object of class 'ff' which is a a list with two components:
physical
an external pointer of class 'ff_pointer' which carries attributes with copy by reference semantics: changing a physical attribute of a copy changes the original
virtual
an empty list which carries attributes with copy by value semantics: changing a virtual attribute of a copy does not change the original
Physical object component
The 'ff_pointer' carries the following 'physical' or readonly attributes, which are accessible via physical:
vmode
see vmode
maxlength
see maxlength
pattern
see parameter 'pattern'
filename
see filename
pagesize
see parameter 'pagesize'
caching
see parameter 'caching'
finalizer
see parameter 'finalizer'
finonexit
see parameter 'finonexit'
readonly
see is.readonly
class
The external pointer needs class 'ff_pointer' to allow method dispatch of finalizers
Virtual object component
The 'virtual' component carries the following attributes (some of which might be NULL):
Length
see length.ff
Levels
see levels.ff
Names
see names.ff
VW
see vw.ff
Dim
see dim.ff
Dimorder
see dimorder
Symmetric
see symmetric.ff
Fixdiag
see fixdiag.ff
ramclass
see ramclass
ramattribs
see ramattribs
Class
You should not rely on the internal structure of ff objects or their ram versions. Instead use the accessor functions like vmode, physical and virtual.
Still it would be wise to avoid attributes AND classes 'vmode', 'physical' and 'virtual' in any other packages.
Note that the 'ff' object's class attribute also has copy-by-value semantics ('virtual').
For the 'ff' object the following class attritibutes are known:
vector
c("ff_vector","ff")
matrix
c("ff_matrix","ff_array","ff")
array
c("ff_array","ff")
symmetric matrix
c("ff_symm","ff")
distance matrix
c("ff_dist","ff_symm","ff")
reserved for future use
c("ff_mixed","ff")
Methods
The following methods and functions are available for ff objects:
Type
Name
Assign
Comment
Basic functions
function
ff
constructor for ff and ram objects
generic
update
updates one ff object with the content of another
generic
clone
clones an ff object optionally changing some of its features
method
print
print ff
method
str
ff object structure
Class test and coercion
function
is.ff
check if inherits from ff
generic
as.ff
coerce to ff, if not yet
generic
as.ram
coerce to ram retaining some of the ff information
generic
as.bit
coerce to bit
Virtual storage mode
generic
vmode
<-
get and set virtual mode (setting only for ram, not for ff objects)
generic
as.vmode
coerce to vmode (only for ram, not for ff objects)
Physical attributes
function
physical
<-
set and get physical attributes
generic
filename
<-
get and set filename
generic
pattern
<-
get pattern and set filename path and prefix via pattern
generic
maxlength
get maxlength
generic
is.sorted
<-
set and get if is marked as sorted
generic
na.count
<-
set and get NA count, if set to non-NA only swap methods can change and na.count is maintained automatically
generic
is.readonly
get if is readonly
Virtual attributes
function
virtual
<-
set and get virtual attributes
method
length
<-
set and get length
method
dim
<-
set and get dim
generic
dimorder
<-
set and get the order of dimension interpretation
generic
vt
virtually transpose ff_array
method
t
create transposed clone of ff_array
generic
vw
<-
set and get virtual windows
method
names
<-
set and get names
method
dimnames
<-
set and get dimnames
generic
symmetric
get if is symmetric
generic
fixdiag
<-
set and get fixed diagonal of symmetric matrix
method
levels
<-
levels of factor
generic
recodeLevels
recode a factor to different levels
generic
sortLevels
sort the levels and recoce a factor
method
is.factor
if is factor
method
is.ordered
if is ordered (factor)
generic
ramclass
get ramclass
generic
ramattribs
get ramattribs
Access functions
function
get.ff
get single ff element (currently [[ is a shortcut)
function
set.ff
set single ff element (currently [[<- is a shortcut)
function
getset.ff
set single ff element and get old value in one access operation
function
read.ff
get vector of contiguous elements
function
write.ff
set vector of contiguous elements
function
readwrite.ff
set vector of contiguous elements and get old values in one access operation
method
[
get vector of indexed elements, uses HIP, see hi
method
[<-
set vector of indexed elements, uses HIP, see hi
generic
swap
set vector of indexed elements and get old values in one access operation
generic
add
(almost) unifies '+=' operation for ff and ram objects
generic
bigsample
sample from ff object
Opening/Closing/Deleting
generic
is.open
check if ff is open
method
open
open ff object (is done automatically on access)
method
close
close ff object (releases C++ memory and protects against file deletion if deleteIfOpen) finalizer is used
generic
delete
deletes ff file (unconditionally)
generic
deleteIfOpen
deletes ff file if ff object is open (finalization method)
generic
finalizer
<-
get and set finalizer
generic
finalize
force finalization
Other
function
geterror.ff
get error code
function
geterrstr.ff
get error message
ff options
Through options or getOption one can change and query global features of the ff package:
option
description
default
fftempdir
default directory for creating ff files
tempdir
fffinalizer
name of default finalizer
deleteIfOpen
fffinonexit
default for invoking finalizer on exit of R
TRUE
ffpagesize
default pagesize
getdefaultpagesize
ffcaching
caching scheme for the C++ backend
'mmnoflush'
ffdrop
default for the drop parameter in the ff subscript methods
TRUE
ffbatchbytes
default for the byte limit in batched/chunked processing
R package redesign; Hybrid Index Preprocessing; transparent object creation and finalization; vmode design; virtualization and hybrid copying; arrays with dimorder and bydim; symmetric matrices; factors and POSIXct; virtual windows and transpose; new generics update, clone, swap, add, as.ff and as.ram; ffapply and collapsing functions. R-coding, C-coding and Rd-documentation.
C++ generic file vectors, vmode implementation and low-level bit-packing/unpacking, arithmetic operations and NA handling, Memory-Mapping and backend caching. C++ coding and platform ports. R-code extensions for opening existing flat files readonly and shared.
Licence
Package under GPL-2, included C++ code released by Daniel Adler under the less restrictive ISCL
Note
Note that the standard finalizers are generic functions, their dispatch to the 'ff_pointer' method happens at finalization time, their 'ff' methods exist for direct calling.
See Also
vector, matrix, array, as.ff, as.ram
Examples
message("make sure you understand the following ff options
before you start using the ff package!!")
oldoptions <- options(fffinalizer="deleteIfOpen", fffinonexit="TRUE", fftempdir=tempdir())
message("an integer vector")
ff(1:12)
message("a double vector of length 12")
ff(0, 12)
message("a 2-bit logical vector of length 12 (vmode='boolean' has 1 bit)")
ff(vmode="logical", length=12)
message("an integer matrix 3x4 (standard colwise physical layout)")
ff(1:12, dim=c(3,4))
message("an integer matrix 3x4 (rowwise physical layout, but filled in standard colwise order)")
ff(1:12, dim=c(3,4), dimorder=c(2,1))
message("an integer matrix 3x4 (standard colwise physical layout, but filled in rowwise order
aka matrix(, byrow=TRUE))")
ff(1:12, dim=c(3,4), bydim=c(2,1))
gc()
options(oldoptions)
if (ffxtensions()){
message("a 26-dimensional boolean array using 1-bit representation
(file size 8 MB compared to 256 MB int in ram)")
a <- ff(vmode="boolean", dim=rep(2, 26))
dimnames(a) <- dummy.dimnames(a)
rm(a); gc()
}
## Not run:
message("This 2GB biglm example can take long, you might want to change
the size in order to define a size appropriate for your computer")
require(biglm)
b <- 1000
n <- 100000
k <- 3
memory.size(max = TRUE)
system.time(
x <- ff(vmode="double", dim=c(b*n,k), dimnames=list(NULL, LETTERS[1:k]))
)
memory.size(max = TRUE)
system.time(
ffrowapply({
l <- i2 - i1 + 1
z <- rnorm(l)
x[i1:i2,] <- z + matrix(rnorm(l*k), l, k)
}, X=x, VERBOSE=TRUE, BATCHSIZE=n)
)
memory.size(max = TRUE)
form <- A ~ B + C
first <- TRUE
system.time(
ffrowapply({
if (first){
first <- FALSE
fit <- biglm(form, as.data.frame(x[i1:i2,,drop=FALSE]))
}else
fit <- update(fit, as.data.frame(x[i1:i2,,drop=FALSE]))
}, X=x, VERBOSE=TRUE, BATCHSIZE=n)
)
memory.size(max = TRUE)
first
fit
summary(fit)
rm(x); gc()
## End(Not run)