qAlign: Align reads
In QuasR: Quantify and Annotate Short Reads in R

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/qAlign.R

Create read alignments against reference genome and optional auxiliary targets if not yet existing. If necessary, also build target indices for the aligner.

qAlign(sampleFile, 
       genome, 
       auxiliaryFile=NULL, 
       aligner="Rbowtie", 
       maxHits=1, 
       paired=NULL, 
       splicedAlignment=FALSE, 
       snpFile=NULL, 
       bisulfite="no", 
       alignmentParameter=NULL, 
       projectName="qProject", 
       alignmentsDir=NULL, 
       lib.loc=NULL, 
       cacheDir=NULL, 
       clObj=NULL,
       checkOnly=FALSE,
       geneAnnotation=NULL)

`sampleFile`	the name of a text file listing input sequence files and sample names (see ‘Details’).
`genome`	the reference genome for primary alignments, one of: a string referring to a “BSgenome” package (e.g. “"BSgenome.Hsapiens.UCSC.hg19"”), which will be downloaded automatically from Bioconductor if not present the name of a fasta sequence file containing one or several sequences (chromosomes) to be used as a reference. The aligner index will be created when neccessary and stored in a default location (see ‘Details’).
`auxiliaryFile`	the name of a text file listing sequences to be used as additional targets for alignment of reads not mapping to the reference genome (see ‘Details’).
`aligner`	selects the aligner program to be used for aligning the reads. Currently, only “Rbowtie” and “Rhisat2” are supported, which are R wrapper packages for ‘bowtie’ / ‘SpliceMap’ and ‘hisat2’, respectively (see `Rbowtie-package` and `Rhisat2-package` packages).
`maxHits`	sets the maximal number of allowed mapping positions per read (default: 1). If a read produces more than `maxHits` alignments, no alignments will be reported for it. In case of a multi-mapping read, a single alignment is randomly selected
`paired`	defines the type of paired-end library and can be set to one of `no` (single read experiment, default), `fr` (fw/rev), `ff` (fw/fw) or `rf` (rev/fw).
`splicedAlignment`	if `TRUE`, reads will be aligned using a spliced aligner, depending on the value of `aligner` described above: `aligner="Rhisat2"`: This is the recommended setting for spliced alignments and will use hisat2 from the `Rhisat2-package`. See also the `geneAnnotation` argument below for providing known exon-exon junctions. `aligner="Rbowtie"`: This is not recommended and only available for legacy reasons. It will use SpliceMap to produce spliced alignments (without using a database of known exon-exon junctions). Compared to the alternative alignment modes (non-spliced or spliced using `Rhisat2` as aligner), this alignment mode is about ten-fold slower and also less sensitive. Furthermore, SpliceMap can only be used for reads with a minimal length of 50nt; SpliceMap ignores reads that are shorter, and these reads will not be contained in the BAM file, neither as mapped or unmapped reads.
`snpFile`	the name of a text file listing single nucleotide polymorphisms to be used for allele-specific alignment and quantification (see ‘Details’).
`bisulfite`	for bisulfite-converted samples (Bis-seq), the type of bisulfite library (“dir” for directional libraries, “undir” for undirectional libraries).
`alignmentParameter`	a optional string containing command line parameters to be used for the aligner, to overrule the default alignment parameters used by `QuasR`. Please use with caution; some alignment parameters may break assumptions made by `QuasR`. Default parameters are listed in ‘Details’.
`projectName`	an optional name for the alignment project.
`alignmentsDir`	the directory to be used for storing alignments (bam files). If set to `NULL` (default), bam files will be generated at the location of the input sequence files.
`lib.loc`	can be used to change the default library path of R. The library path is used by `QuasR` to store aligner index packages created from `BSgenome` reference genomes, or to install newly downloaded `BSgenome` packages.
`cacheDir`	specifies the location to store (potentially huge) temporary files. If set to `NULL` (default), the temporary directory of the current R session as returned by `tempdir()` will be used.
`clObj`	a cluster object, created by the package parallel, to enable parallel processing and speed up the alignment process.
`checkOnly`	if `TRUE`, prevents the automatic creation of alignments or aligner indices. This allows to quickly check for missing alignment files without starting the potentially long process of their creation. In the case of missing alignments or indices, an exception is thrown.
`geneAnnotation`	Only used if `aligner` is `"Rhisat2"`. The path to either a gtf file or a sqlite database generated by exporting a `TxDb` object. This file is used to generate a splice site file for `Rhisat2`, that will be used to guide the spliced alignment.

Before generating new alignments, qAlign looks for previously generated alignments as well as for an aligner index. If no aligner index exists, it will be automatically created and stored in the same directory as the provided fasta file, or as an R package in the case of a BSgenome reference. The name of this R package will be the same as the BSgenome package name, with an additional suffix from the aligner (e.g. BSgenome.Hsapiens.UCSC.hg19.Rbowtie). The generated bam files contain both aligned und unaligned reads. For paired-end samples, by default no alignments will be reported for read pairs where only one of the reads could be aligned.

sampleFile is a tab-delimited text file listing all the input sequences to be included in a given analysis. The file has either two (single-end) or three columns (paired-end). The first row contains the column names, and additional rows contain relative or absolute path and name of input sequence file(s), as well as the according sample name. Three input file formats are supported (fastq, fasta and bam). All input files in one sampleFile need to be in the same format, and are recognized by their extension (.fq, .fastq, .fa, .fasta, .fna, .bam), in raw or compressed form (e.g. .fastq.gz). If bam files are provided, then no alignments are generated by qAlign, and the alignments contained in the bam files will be used instead.

The column names in sampleFile have to match to the ones in the examples below, for a single-read experiment:

FileName	SampleName
chip_1_1.fq.bz2	Sample1
chip_2_1.fq.bz2	Sample2

and for a paired-end experiment:

FileName1	FileName2	SampleName
rna_1_1.fq.bz2	rna_1_2.fq.bz2	Sample1
rna_2_1.fq.bz2	rna_2_2.fq.bz2	Sample2

The “SampleName” column is the human-readable name for each sample that will be used as sample labels. Multiple sequence files may be associated to the same sample name, which instructs QuasR to combine those files.

auxiliaryFile is a tab-delimited text file listing one or several additional target sequence files in fasta format. Reads that do not map against the reference genome will be aligned against each of these target sequence files. The first row contains the column names which have to match to the ones in the example below:

FileName	AuxName
NC_001422.1.fa	phiX174

snpFile is a tab-delimited text file without a header and contains four columns with chromosome name, position, reference allele and alternative allele, as in the example below:

chr1	8596	G	A
chr1	18443	G	A
chr1	18981	C	T
chr1	19341	G	A

The reference and alternative alleles will be injected into the reference genome, resulting in two separate genomes. All reads will be aligned separately to both of these genomes, and the alignments will be combined, only retaining the best alignment for each read. In the final alignment, each read will be marked with a tag that classifies it into reference (R), alternative (A) or unknown (U), if the reads maps equally well to both genomes.

If bisulfite is set to “dir” or “undir”, reads will be C-to-T converted and aligned to a similarly converted genome.

If alignmentParameter is NULL (recommended), qAlign will select default parameters that are suitable for the experiment type. Please note that for bisulfite or allele-specific experiments, each read is aligned multiple times, and resulting alignments need to be combined. This requires special settings for the alignment parameters that are not recommended to be changed. For ‘simple’ experiments (neither bisulfite, allele-specific, nor spliced), alignments are generated using the parameters -m maxHits --best --strata. This will align reads with up to “maxHits” best hits in the genome and selects one of them randomly.

A qProject object.

Anita Lerch, Dimos Gaidatzis, Charlotte Soneson and Michael Stadler

qProject, makeCluster from package parallel, Rbowtie-package package, Rhisat2-package package

## Not run: 
    # see qCount, qMeth and qProfile manual pages for examples
    example(qCount)
    example(qMeth)
    example(qProfile)

## End(Not run)