make_counts | R Documentation |
make_counts
uses parallel processing to generate a count table.
make_counts(
input,
trimming = NULL,
threads = 1,
save_temp = FALSE,
plot = TRUE,
parse = "default_illumina",
evidence = c(experiment = 2, sample = 1),
optimize = list(on_disk = FALSE, chunk_size = NULL, rm_wild = FALSE, seq_min = NULL)
)
input |
A path to a directory containing input fastq-files. The script will recursively search this directory for the .fastq|.fastq.gz extension. |
trimming |
Character indicating what type of trimming tool that should
be used. If |
threads |
Integer stating the number of parallel jobs. Note, that
reading multiple fastq files drains memory fast, using up to 10Gb per fastq
file. To avoid crashing the system due to memory shortage, make sure that
each thread on the machine have at least 10 Gb of memory available, unless
your fastq files are very small. Use |
save_temp |
Logical whether temporary files (including trimmed fastq files) should be saved or not. Note, the function will print the path to the temprorary folder in the console. |
plot |
Logical whether evidence plots should be printed and saved in the returned list (default=TRUE). |
parse |
Character strings defining the command that should be parsed to
|
evidence |
Character vector with two inputs named 'experiment' and
'sample' that controls the low-level evidence filter. Users may already at
this point reduce the level of noise in the counts table by
specifying the number of independent evidence that a specific sequence must
have to be included. As default,
|
optimize |
List that controls the behavior of make_counts, possibly enabling analysis on low-end computers on the expense of processing time. Has four items named: list(on_disk=, chunk_size=, rm_wild=, seq_min=). 1. on_disk: Logical whether or not some processes should be handled on-disk instead of in-memory. Important, this option will generate large temporary files. For a max fastq size of 1 GB (.gz compressed), keep at least 20 GB of free disk space. (default=FALSE). 2. chunk_size: Integer, if set, determines whether reading and processing should be handled in chunks and how big those should be. If NULL sample are not handled in chunks. If handling a challenging dataset on low-end computers, a chunk_size of 50000000 is a good starting point. (default=NULL). 3. rm_wild. Logical (panic filter). If evidence filter fails while on_disk=TRUE then rm_wild=TRUE will remove all sequences containing wild-card characters (N), either across the experiment or for individual samples. (default=FALSE). 4. seq_min. Integer (panic filter).If evidence filter fails while on_disk=TRUE then seq_min will remove sequences shorter than seq_min, either across the experiment or for individual samples.(default=NULL) Note, Panic filters (2. and 3.) are applied at two places in the script as last resort if on_disk processing fails. First this may happen when unique sequences (UNI) fulfilling the evidence filter are extracted across the whole experiment. Here panic filters will be applied to all samples. The second place is when read sequences fulfilling the evidence filter are extracted from the trimmed fastq. Thus, the panic filter is applied to individual samples. To alert users, warnings are given in addition to notes in the panic_type column of the progress report. Here, "none" means no panic, "filt" means that panic filters were applied, and "chunk" means that loading of filtered fastq was forced into chunks. The last does not involve panic filters but helps loading very large fastqs on the expense of loading time. |
Given a path to fastq files this function performs low-level evidence filtering, generating a counts table of sequences passing the filter and plots summary statistics.
A list containing three objects:
1. counts (data frame) = count table.
2. progress_report = progress report from trimming and evidence filter.
3. evidence_plots = bar graphs showing the impact of evidence filter.
https://github.com/Danis102 for updates on the current package.
Other PAC generation:
PAC_check()
,
create_PAC()
,
make_PAC()
,
make_cutadapt()
,
make_pheno()
,
make_trim()
,
merge_lanes()
############################################################
### Seqpac fastq trimming with the make_counts function
### using default settings for NEBNext small RNA adaptor
# Seqpac includes strongly down-sampled smallRNA fastq.
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
input <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
full.names = TRUE)
# Now we can run make_counts
# Notice that make_counts will generate another temp folder, that will
# be emptied on finalization. By setting save_temp=TRUE you may save the
# content. You may also use your own trimming settings by using the parse
# option. See ?make_trim for more details.
counts <- make_counts(input, threads=1, parse="default_neb",
trimming="seqpac", plot=TRUE,
evidence=c(experiment=2, sample=1))
head(counts$counts)
head(counts$progress_report)
# Notice that that there are fewer unique sequences than number of reads
# passing the filter. In normal, large, fastq we keep 95-99% of all reads
# while filtering out 30-40% of the sequences that only appear in 1 fastq
# Thus, the evidence filter may gain you performance in later analysis by
# avoiding nonsense sequences.
#############################################################
### Lets change the evidence filter
###
# 2 evidence over two indepenent samples, saving single sample
# sequences reaching 3 counts:
test <- make_counts(input=input, trimming="seqpac",
parse="default_neb",
evidence=c(experiment=2, sample=3))
extras <- apply(test$counts, 1, function(x){sum(!x==0)})
test$counts[extras==1,] # 6 single sample sequences reached 3 counts
# 2 evidence over two independent samples, saving single sample
# sequences reaching 2 counts
test <- make_counts(input=input, trimming="seqpac",
parse="default_neb",
evidence=c(experiment=2, sample=2))
extras <- apply(test$counts, 1, function(x){sum(!x==0)})
test$counts[extras==1,] # 120 single sample sequences reached 2 counts
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.