trainNetwork: Trains a (mostly) LSTM model on genomic data. Designed for...

Description Usage Arguments

View source: R/train.R

Description

Depth and number of neurons per layer of the netwok can be specified. First layer can be a Convolutional Neural Network (CNN) that is designed to capture codons. If a path to a folder where FASTA files are located is provided, batches will ge generated using an external generator which is recommended for big training sets. Alternative, a dataset can be supplied that holds the preprocessed batches (generated by preprocessSemiRedundant()) and keeps them in RAM. Supports also training on instances with multiple GPUs and scales linear with number of GPUs present.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
trainNetwork(
  train_type = "lm",
  model_path = NULL,
  model = NULL,
  path = NULL,
  path.val = NULL,
  dataset = NULL,
  checkpoint_path,
  validation.split = 0.2,
  run.name = "run",
  batch.size = 64,
  epochs = 10,
  max.queue.size = 100,
  lr.plateau.factor = 0.9,
  patience = 5,
  cooldown = 5,
  steps.per.epoch = 1000,
  step = 1,
  randomFiles = FALSE,
  initial_epoch = NULL,
  vocabulary = c("a", "c", "g", "t"),
  tensorboard.log,
  save_best_only = TRUE,
  compile = TRUE,
  learning.rate = NULL,
  solver = NULL,
  max_iter = 1000,
  seed = c(1234, 4321),
  shuffleFastaEntries = FALSE,
  output = list(none = FALSE, checkpoints = TRUE, tensorboard = TRUE, log = TRUE,
    serialize_model = TRUE, full_model = TRUE),
  format = "fasta",
  fileLog = NULL,
  labelVocabulary = NULL,
  numberOfFiles = NULL,
  reverseComplements = FALSE
)

Arguments

train_type

Either "lm" for language model, "label_header" or "label_folder". Language model is trained to predict next character in sequence. label_header/label_folder are trained to predict a corresponding class, given a sequence as input. If "label_header", class will be read from fasta headers. If "label_folder", class will be read from folder, i.e. all fasta files in one folder must belong to the same class. mailab

model_path

Path to a pretrained model.

model

A keras model.

path

Path to folder where individual or multiple FASTA files are located for training. If train_type is label_folder, should be a vector containing a path for each class.

path.val

Path to folder where individual or multiple FASTA files are located for validation.If train_type is label_folder, should be a vector containing a path for each class.

dataset

Dataframe holding training samples in RAM instead of using generator.

checkpoint_path

Path to checkpoints folder.

validation.split

Defines the fraction of the batches that will be used for validation (compared to size of training data).

run.name

Name of the run (without file ending). Name will be used to identify output from callbacks.

batch.size

Number of samples that are used for one network update.

epochs

Number of iterations.

max.queue.size

Queue on fit_generator().

lr.plateau.factor

Factor of decreasing learning rate when plateau is reached.

patience

Number of epochs waiting for decrease in loss before reducing learning rate.

cooldown

Number of epochs without changing learning rate.

steps.per.epoch

Number of batches to finish one epoch.

step

Frequency of sampling steps.

randomFiles

TRUE/FALSE go through files sequentially or shuffle beforehand.

initial_epoch

Epoch at which to start training, set to 0 if no model_path argument is given. Note that network will run for (epochs - initial_epochs) rounds and not epochs rounds.

vocabulary

Vector of allowed characters, character outside vocabulary get encoded as 0-vector.

tensorboard.log

Path to tensorboard log directory.

save_best_only

Only save model that improved on best val_loss score.

compile

Whether to compile the model after loading.

learning.rate

Learning rate for optimizer. Only used when pretrained model is given (model_path is not NULL) and compile is FALSE. Otherwise learning rate is determined when model is created.

solver

Optimization method, options are "adam", "adagrad", "rmsprop" or "sgd". Only used when pretrained model is given (model_path is not NULL) and compile is FALSE. Otherwise solver is determined when model is created.

max_iter

Stop after max_iter number of iterations failed to produce new sample.

seed

Sets seed for set.seed function, for reproducible results when using randomFiles or shuffleFastaEntries

shuffleFastaEntries

Logical, shuffle entries in file.

output

List of optional outputs, no output if none is TRUE.

format

File format, "fasta" or "fastq".

fileLog

Write name of files to csv file if path is specified.

labelVocabulary

Character vector of possible targets. Targets outside labelVocabulary will get discarded.

numberOfFiles

Use only specified number of files, ignored if greater than number of files in corpus.dir.

reverseComplements

Logical, half of batch contains sequences and other its reverse complements. Reverse complement is given by reversed order of sequence and switching A/T and C/G. batch.size argument has to be even, otherwise 1 will be added to batch.size


hiddengenome/altum documentation built on April 22, 2020, 9:33 p.m.