trainNetwork: Trains a (mostly) LSTM model on genomic data. Designed for...
In hiddengenome/altum: deepG

Depth and number of neurons per layer of the netwok can be specified. First layer can be a Convolutional Neural Network (CNN) that is designed to capture codons. If a path to a folder where FASTA files are located is provided, batches will ge generated using an external generator which is recommended for big training sets. Alternative, a dataset can be supplied that holds the preprocessed batches (generated by preprocessSemiRedundant()) and keeps them in RAM. Supports also training on instances with multiple GPUs and scales linear with number of GPUs present.

trainNetwork(
  train_type = "lm",
  model_path = NULL,
  model = NULL,
  path = NULL,
  path.val = NULL,
  dataset = NULL,
  checkpoint_path,
  validation.split = 0.2,
  run.name = "run",
  batch.size = 64,
  epochs = 10,
  max.queue.size = 100,
  lr.plateau.factor = 0.9,
  patience = 5,
  cooldown = 5,
  steps.per.epoch = 1000,
  step = 1,
  randomFiles = FALSE,
  initial_epoch = NULL,
  vocabulary = c("a", "c", "g", "t"),
  tensorboard.log,
  save_best_only = TRUE,
  compile = TRUE,
  learning.rate = NULL,
  solver = NULL,
  max_iter = 1000,
  seed = c(1234, 4321),
  shuffleFastaEntries = FALSE,
  output = list(none = FALSE, checkpoints = TRUE, tensorboard = TRUE, log = TRUE,
    serialize_model = TRUE, full_model = TRUE),
  format = "fasta",
  fileLog = NULL,
  labelVocabulary = NULL,
  numberOfFiles = NULL,
  reverseComplements = FALSE
)

`train_type`	Either "lm" for language model, "label_header" or "label_folder". Language model is trained to predict next character in sequence. label_header/label_folder are trained to predict a corresponding class, given a sequence as input. If "label_header", class will be read from fasta headers. If "label_folder", class will be read from folder, i.e. all fasta files in one folder must belong to the same class. mailab
`model_path`	Path to a pretrained model.
`model`	A keras model.
`path`	Path to folder where individual or multiple FASTA files are located for training. If `train_type` is `label_folder`, should be a vector containing a path for each class.
`path.val`	Path to folder where individual or multiple FASTA files are located for validation.If `train_type` is `label_folder`, should be a vector containing a path for each class.
`dataset`	Dataframe holding training samples in RAM instead of using generator.
`checkpoint_path`	Path to checkpoints folder.
`validation.split`	Defines the fraction of the batches that will be used for validation (compared to size of training data).
`run.name`	Name of the run (without file ending). Name will be used to identify output from callbacks.
`batch.size`	Number of samples that are used for one network update.
`epochs`	Number of iterations.
`max.queue.size`	Queue on fit_generator().
`lr.plateau.factor`	Factor of decreasing learning rate when plateau is reached.
`patience`	Number of epochs waiting for decrease in loss before reducing learning rate.
`cooldown`	Number of epochs without changing learning rate.
`steps.per.epoch`	Number of batches to finish one epoch.
`step`	Frequency of sampling steps.
`randomFiles`	TRUE/FALSE go through files sequentially or shuffle beforehand.
`initial_epoch`	Epoch at which to start training, set to 0 if no `model_path` argument is given. Note that network will run for (`epochs` - `initial_epochs`) rounds and not `epochs` rounds.
`vocabulary`	Vector of allowed characters, character outside vocabulary get encoded as 0-vector.
`tensorboard.log`	Path to tensorboard log directory.
`save_best_only`	Only save model that improved on best val_loss score.
`compile`	Whether to compile the model after loading.
`learning.rate`	Learning rate for optimizer. Only used when pretrained model is given (`model_path` is not NULL) and compile is FALSE. Otherwise learning rate is determined when model is created.
`solver`	Optimization method, options are "adam", "adagrad", "rmsprop" or "sgd". Only used when pretrained model is given (`model_path` is not NULL) and compile is FALSE. Otherwise solver is determined when model is created.
`max_iter`	Stop after max_iter number of iterations failed to produce new sample.
`seed`	Sets seed for set.seed function, for reproducible results when using `randomFiles` or `shuffleFastaEntries`
`shuffleFastaEntries`	Logical, shuffle entries in file.
`output`	List of optional outputs, no output if none is TRUE.
`format`	File format, "fasta" or "fastq".
`fileLog`	Write name of files to csv file if path is specified.
`labelVocabulary`	Character vector of possible targets. Targets outside `labelVocabulary` will get discarded.
`numberOfFiles`	Use only specified number of files, ignored if greater than number of files in corpus.dir.
`reverseComplements`	Logical, half of batch contains sequences and other its reverse complements. Reverse complement is given by reversed order of sequence and switching A/T and C/G. `batch.size` argument has to be even, otherwise 1 will be added to `batch.size`