sdfStream: Streaming through large SD files
In girke-lab/ChemmineR: Cheminformatics Toolkit for R

sdfStream

R Documentation

Streaming through large SD files

Description

Streaming function to compute descriptors for large SD Files without consuming much memory. In addition to descriptor values, it returns a line index that defines the positions of each molecule in the source SD File. This line index can be used by the read.SDFindex function to retrieve specific compounds of interest from large SD Files without reading the entire file into memory.

Usage

sdfStream(input, output, append=FALSE, fct, Nlines = 10000, startline=1, restartNlines=10000, silent = FALSE, ...)

Arguments

`input`	file name of input SD file
`output`	file name of tabular descriptor file
`append`	if `append=FALSE`, a new output file will be created, if one with the same name exists it will be overwritten; whereas `append=TRUE` will appended to this file.
`fct`	Function to select descriptor sets; any combination of descriptors, supported by `ChemmineR`, can be chosen here, as long as they can be represented in tabular format.
`Nlines`	Number of lines to read from input SD File at a time; the memory consumption will be proportional to this value.
`startline`	For restarting sdfStream at specific line assigned to `startline` argument. If assigned `startline` value does not match the first line of a molecule in the SD file then it will be reset to the start position of the next molecule in the SD file.
`restartNlines`	Number of lines to parse when `startline > 1` in order to identify proper molecule start position. The default value of 10,000 is usually a good choice.
`silent`	if `silent=FALSE`, the processing status will be printed to the screen, while `silent=TRUE` suppresses this output.
`...`	Arguments to be passed to/from other methods.

Details

...

Value

Writes a descriptor matrix to a tabular file. The first and last line number (position index) of each molecule is specified in the first two columns of the tabular output file, respectively.

Author(s)

Thomas Girke

References

SDF format definition: http://www.symyx.com/downloads/public/ctfile/ctfile.jsp

Examples

## Load sample data
library(ChemmineR)
data(sdfsample); sdfset <- sdfsample
## Not run: write.SDF(sdfset, "test.sdf")

## Define descriptor set in a simple function
desc <- function(sdfset) {
        cbind(SDFID=sdfid(sdfset), 
              # datablock2ma(datablocklist=datablock(sdfset)), 
              MW=MW(sdfset), 
              groups(sdfset), 
              # AP=sdf2ap(sdfset, type="character"),
              rings(sdfset, type="count", upper=6, arom=TRUE)
        )
}

## Run sdfStream with desc function and write results to a file called 'matrix.xls'
sdfStream(input="test.sdf", output="matrix.xls", append=FALSE, fct=desc, Nlines=1000)

## Same as before but starting in SD file at line number 950
sdfStream(input="test.sdf", output="matrix.xls", append=FALSE, fct=desc, Nlines=1000, startline=950)

## Select molecules from SD File using line index from sdfStream
indexDF <- read.delim("matrix.xls", row.names=1)[,1:4]
indexDFsub <- indexDF[indexDF$MW < 400, ] # Selects molecules with MW < 400
sdfset <- read.SDFindex(file="test.sdf", index=indexDFsub, type="SDFset")

## Write result directly to SD file without storing larger numbers of molecules in memory
read.SDFindex(file="test.sdf", index=indexDFsub, type="file", outfile="sub.sdf")

## Read atom pair string representation from file into APset
apset <- read.AP(file="matrix.xls", colid="AP")
cid(apsdf) <- as.character(indexDF$SDFID)  

## End(Not run)

girke-lab/ChemmineR documentation built on July 28, 2023, 10:36 a.m.