Last data update: 2014.03.03

R: Small Molecule Solubility (LogS) Data
LogSR Documentation

Small Molecule Solubility (LogS) Data

Description

Aqueous solubility datas for 1,606 small molecules. PaDEL descriptors have been computed for these molecules. The data has been split into a training (70%) and a test (30%) set.

Details

This dataset comprises the aqueous solubility (S) values at a temperature of 20-25 Celsius degrees in mol/L, expressed as logS, for 1,708 small molecules reported by Wang et al. Compound structures were standardized with the function StandardiseMolecules from the R package camb using the default parameters: (i) all molecules were kept irrespective of the numbers of fluorines, iodines, chlorine, and bromines present in their strucuture, or (ii) of their molecular mass. 905 one-dimensional topological and physicochemical descriptors were calculated with the function GeneratePadelDescriptors from the R package camb which invokes the PaDEL-Descriptor Java library. Near zero variance and highly-correlated descriptors were removed with the functions (i) RemoveNearZeroVarianceFeatures (cut-off value of 30/1), and (ii) RemoveHighlyCorrelatedFeatures (cut-off value of 0.95) After applying these steps the dataset consists of 1,606 molecules encoded with 211 descriptors.

Using data(LogS) exposes 4 objects:

(i) LogSDescsTrain is a data frame with PaDEL descriptors for the datapoints in the training set (70% of the data). (ii) LogSTrain is a numeric vector containing the data solubility values for the datapoints in the training set. (iii) LogSDescsTest is a data frame with PaDEL descriptors for the datapoints in the test set (30% of the data). (iv) LogSTest is a numeric vector containing the data solubility values for the datapoints in the test set.

References

Wang et al. J. Chem. Inf. Model., 2007, 47 (4), pp 1395-1404 DOI: 10.1021/ci700096r http://pubs.acs.org/doi/abs/10.1021/ci700096r

Examples

# To use the data
data(LogS)

Results