Oracle Data Mining (ODM) is an option of Oracle's Relational Database
Management System (RDBMS) Enterprise Edition (EE). It contains several data
mining and data analysis algorithms for classification, prediction, regression,
clustering, associations, feature selection, anomaly detection, feature
extraction, and specialized analytics. It provides means for the creation,
management and operational deployment of data mining models inside the
database environment.
RODM is an interface that provides a powerful environment for prototyping
data analysis and data mining methodologies. It facilitates the prototyping of
vertical applications and makes ODM and the RDBMS environment easily accessible
to statisticians and data analysts familiar with R but not experts in SQL.
In this way it provides an ideal environment for demos and proof of concept
studies. It also facilitates the benchmarking and testing of functionality
including statistics and graphics of performance metrics and enables the
scripting and control of production data mining methodologies from a high-level
environment.
Details
Package:
RODM
Type:
Package
Version:
1.0-2
Date:
2010-05-01
License:
LGPL
RODM is a package that provides access to Oracle's in-database data mining
functionality.
Requirements
RODM requires the use of an Oracle release 11g database. If you don't have an
installed Oracle database in place and need to install one from scratch we
stronly recommend you follow the guidelines in the Oracle Data Mining
Administrator's Guide. RODM requires R release > 2.10.1.
Connecting to an Oracle database:
RODM_open_dbms_connection
RODM_close_dbms_connection
The above routines are used to establish a connection to an Oracle 11g
database using ODBC. RODM uses the RODBC package as a means to manage the
database connection. A data source name must be provided, as well as a
username and password. The user that is connecting needs sufficient
privileges for performing mining operations in the database. We have tested
RODM using the Oracle ODBC driver that comes with the Oracle RDBMS. We recommend
the use of this ODBC driver instead of others.
Pushing data to the database:
RODM_create_dbms_table
RODM_drop_dbms_table
Once a valid database connection has been established, in-database mining can
begin. If the data to be mined exists within R (e.g., in a data frame), it
first needs to be pushed to the database and placed in a table. The above
routines leverage the RODBC package to push data to a database table, which
can then be accessed for mining by ODM.
The above routines are used under-the-covers when building ODM models. They
do not need to be invoked directly. They are present merely to improve
maintainability and modularity.
The above nine routines are used to build ODM models in the database. They
share many of the same arguments. All of these routines require a database
connection (as retrieved via RODM_open_dbms_connection) and a table/view in the
database (either pre-existing or created via RODM_create_dbms_table) which
will provide the training data.
All routines accept a case identifier column name. This is the name of a
column which can be used to uniquely identify a training record. Most
routines do not need a case identifier, but some may provide extra information
if one is present (e.g., cluster assignments).
All supervised algorithms require a target column name.
A model name can be specified (or defaults to an algorithm-specific model
name). When created, the model will exist in Oracle as a database schema object.
Most algorithms accept a parameter to direct ODM to enable automatic data
preparation (default TRUE). This feature will request that ODM prepare data
as befitting individual algorithm needs (e.g., outlier treatment, binning,
normalization, missing value imputation).
Many algorithms accept a number of expert settings. These expert settings
will differ from algorithm to algorithm, and ODM is designed to identify
values for these settings without user input, hence they do not need to be
specified by the user in most situations.
When the models are created in the database, information regarding the
models can be retrieved and returned to the R environment. The
retrieve_outputs_to_R parameter tells RODM whether or not this information
should be pulled back into R for further analysis in R.
As these models are database schema objects, they can be left in the database
for future use. They can be applied to new data as desired. The default
behavior is to leave the models in the database, but they can be automatically
cleaned up by changing the leave_model_in_dbms parameter. If a model with the
same name already exists in the database schema when another is being created,
the previous model will be automatically dropped.
Finally, the RODM package is envisioned as a quick proof of concept mechanism,
with the potential of deploying the resulting methodology wholly within
Oracle. As such, it is necessary to capture the SQL that would be used in the
database. The sql.log.file parameter can be used to have RODM produce a file
with the relevant SQL statements that comprise the work being performed.
Further operations involving ODM models:
RODM_list_dbms_models
RODM_apply_model
RODM_drop_model
Finally, there are a few routines involving ODM models once they are built.
The list of accessible ODM models can be retrieved, and individual models can
be dropped.
Models (other than those used for Attribute Importance and Associations) can
be applied to new data in the database. Regression models will produce the
expected value given the input variables. Classification models will produce
the probability distribution across target classes for each case, as well as
a column indicating the winning class. Clustering models will produce the
probability distribution across clusters for each case, as well as a column
indicating the most likely cluster assignment.
In all cases, additional columns from the test/apply dataset can be included
in the output via the supplemental_cols parameter. It is necessary to provide
some information here if there is a desire to link the results back to the
original data. Either a case identifier should be provided, or the list of
columns which will yield sufficient information for future analysis.
P. Tamayo, C. Berger, M. M. Campos, J. S. Yarmus, B. L.Milenova, A. Mozes, M. Taft, M. Hornick, R. Krishnan, S.Thomas, M. Kelly, D. Mukhin, R. Haberstroh, S. Stephens and J. Myczkowski. Oracle Data Mining - Data Mining in the Database Environment. In Part VII of Data Mining and Knowledge Discovery Handbook, Maimon, O.; Rokach, L. (Eds.) 2005, p315-1329, ISBN-10: 0-387-24435-2.
Oracle Data Mining: Mining Gold from Your Warehouse, (Oracle In-Focus series), by Dr. Carolyn Hamm.