Introduction

Informative and efficient visualization of genomic structural variation (SV) is an important step to evaluate structurally complex regions of the genome and help us to begin drawing biological conclusions. With advances in long-read sequencing technologies, such as HiFi (high-fidelity) PacBio [@Wenger2019-tn] and ONT (Oxford Nanopore Technologies) [@Deamer2002-hq], we are now able to fully assemble even the most complex regions of the genome. Thus, efficient and informative visualization tools are needed to evaluate and directly observe structural differences between two or more genomes. We have developed SVbyEye exactly for this purpose such that we are able to directly observe complexity of a human (or other organism) genome in question with respect to a linear genome reference. SVbyEye is inspired by the previously developed tool Miropeats [@Parsons1995-nc] and brings its visuals to the popular scripting language R and visualization paradigm using ggplot2 [@ggplot2-book].

\newpage

Generation of input alignments in PAF format

In order to create input alignments for SVbyEye visualization, we usually use minimap2 (version 2.24 or higher) [@Li2016-ha]; however, any sequence-to-sequence aligner that can export alignments in PAF format should be sufficient. We note, however, that we tested our tool only using minimap2 alignments.

When running minimap2 alignments ensure that parameter -a is not set because it would export alignment in SAM format. Also make sure that parameter -c is set in order to export CIGAR strings as 'cg' tag in the output PAF. Lastly, consider setting the parameter '--eqx' in order to output CIGAR string with '=/X' operators to define sequence match/mismatch, respectively.

Generate assembly-to-reference minimap alignments

minimap2 -x asm20 -c --eqx --secondary=no {reference.fasta} {query.fasta} > {output.alignment}

Generate sequence-to-sequence minimap alignments

minimap2 -x asm20 -c --eqx --secondary=no {target.fasta} {query.fasta} > {output.alignment}

Generate all-versus-all minimap alignments

minimap2 -x asm20 -c --eqx -D -P --dual=no {input.multi.fasta} {input.multi.fasta} > {output.ava.alignment}

Reading and filtering alignements in PAF format

The r BiocStyle::Biocpkg("SVbyEye") package expects sequence alignments in PAF format as input. Such alignments can be read using the readPaf() function. Subsequently, such alignments can be filtered using the filterPaf() function. Lastly, one can also orient PAF alignments based on the desired majority orientation (majority.strand = '+' or '-') using the flipPaf() function. With this function, the user can flip all alignments in the PAF file to opposite orientation with parameter force = TRUE.

First load the r BiocStyle::Biocpkg("SVbyEye") package.

## Load the SVbyEye package
library(SVbyEye)
## Get PAF file to read
paf.file <- system.file("extdata", "test1.paf",
    package = "SVbyEye"
)
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file,
    include.paf.tags = TRUE, restrict.paf.tags = "cg"
)
## Example PAF table, SVbyEye expects given column names
paf.table

## Filter alignment based on size
filterPaf(paf.table = paf.table, min.align.len = 100000)

## Force to flip orientation of PAF alignments
flipPaf(paf.table = paf.table, force = TRUE)

Visualization of sequence-to-sequence alignments

The main function of this package is plotMiro() and it can be used for visualization of a single sequence aligned to the reference or more than two sequences aligned to each other. SVbyEye is visualizing such alignments in a horizontal layout with the target sequence being on at the top and the query at the bottom.

To create a simple Miropeats style plot, follow thevinstructions below.

## Get PAF to plot
paf.file <- system.file("extdata", "test1.paf",
    package = "SVbyEye"
)
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file,
    include.paf.tags = TRUE, restrict.paf.tags = "cg"
)
## Make a plot colored by alignment directionality
plotMiro(paf.table = paf.table, color.by = "direction")

Users have control over a number of visual features that are documented within a plotMiro() function. Function documentation can be reported by ?plotMiro.

For instance, users can specify their own color scheme to designate alignments in plus (forward) or minus (reverse) direction.

## Use custom color palette to color alignment directionality
plotMiro(
    paf.table = paf.table,
    color.palette = c("+" = "azure3", "-" = "yellow3")
)

Next, users can decide to color alignments by their identity instead of direction.

## Color alignments by sequence identity
plotMiro(paf.table = paf.table, color.by = "identity")

We advise that coloring alignments by sequence identity is the most helpful in connection with binning the alignments into separate bins. This is useful when investigating regions of high and low sequence identity between the target and query sequences. We note that when the parameter 'binsize' is defined, the parameter 'color.by' is set to 'identity' by default.

## Plot binned alignments
plotMiro(paf.table = paf.table, binsize = 1000)

Currently there are preset breaks of sequence identity (see ?plotMiro) that can be changed by setting the parameter 'perc.identity.breaks'.

Adding annotations to the Miropeats style plot

Often certain regions need to marked between the query and target alignments such as gene position, position of segmental duplications (SDs), or other DNA functional elements. This can be done by adding extra layers of annotation ranges either on top of the target or query alignments. Annotation ranges are expected to be submitted to the addAnnotation() function as a GenomicRanges object and are visualized as either an arrowhead (can reflect range orientation) or rectangle.

## Make a plot
plt <- plotMiro(paf.table = paf.table)
## Load target annotation file
target.annot <- system.file("extdata", "test1_target_annot.txt",
    package = "SVbyEye"
)
target.annot.df <- read.table(target.annot,
    header = TRUE, sep = "\t",
    stringsAsFactors = FALSE
)
target.annot.gr <- GenomicRanges::makeGRangesFromDataFrame(target.annot.df)
## Add target annotation as arrowhead
addAnnotation(
    ggplot.obj = plt,
    annot.gr = target.annot.gr, coordinate.space = "target"
)

## Load query annotation file
query.annot <- system.file("extdata", "test1_query_annot.txt",
    package = "SVbyEye"
)
query.annot.df <- read.table(query.annot,
    header = TRUE, sep = "\t",
    stringsAsFactors = FALSE
)
query.annot.gr <- GenomicRanges::makeGRangesFromDataFrame(query.annot.df)
## Add query annotation as rectangle
addAnnotation(
    ggplot.obj = plt,
    annot.gr = query.annot.gr, shape = "rectangle",
    coordinate.space = "query"
)

SVbyEye also offers extra functionalities to lift between query and target coordinates and vice versa based on the underlying PAF alignment file.

## Lift target annotation to query and plot
target.gr <- as("target.region:19100000-19150000", "GRanges")
lifted.annot.gr <- liftRangesToAlignment(
    paf.table = paf.table,
    gr = target.gr,
    direction = "target2query"
)
plt1 <- addAnnotation(
    ggplot.obj = plt, annot.gr = target.gr, shape = "rectangle",
    coordinate.space = "target"
)
addAnnotation(
    ggplot.obj = plt1, annot.gr = lifted.annot.gr, shape = "rectangle",
    coordinate.space = "query"
)

For instance, inversions are often flanked by long segments of highly identical SDs. These can be visualized as follows. The direction of these segments can be reflected as well in case the strand is defined in the GenomicRanges object.

## Load segmental duplication annotation
sd.annot <- system.file("extdata", "test1.sd.annot.RData", package = "SVbyEye")
sd.annot.gr <- get(load(sd.annot))
## Create a custom discrete levels based on
sd.categ <- findInterval(sd.annot.gr$fracMatch, vec = c(0.95, 0.98, 0.99))
sd.categ <- dplyr::recode(sd.categ,
    "0" = "<95%", "1" = "95-98%",
    "2" = "98-99%", "3" = ">=99%"
)
sd.categ <- factor(sd.categ, levels = c("<95%", "95-98%", "98-99%", ">=99%"))
sd.annot.gr$sd.categ <- sd.categ
## Define a custom color palette
color.palette <- c(
    "<95%" = "gray72", "95-98%" = "gray47", "98-99%" = "#cccc00",
    ">=99%" = "#ff6700"
)
## Add annotation to the plot
addAnnotation(
    ggplot.obj = plt, annot.gr = sd.annot.gr, fill.by = "sd.categ",
    color.palette = color.palette, coordinate.space = "target"
)

Importantly, each annotation layer can be given its own name to distinguish different layers of information added to the plot.

## Give an annotation layer its own label
## Add target annotation as arrowhead
plt1 <- addAnnotation(
    ggplot.obj = plt,
    annot.gr = target.annot.gr, coordinate.space = "target",
    annotation.label = "Region"
)
## Add SD annotation
addAnnotation(
    ggplot.obj = plt1, annot.gr = sd.annot.gr, fill.by = "sd.categ",
    color.palette = color.palette, coordinate.space = "target",
    annotation.label = "SDs"
)

The user also has control over the level where an annotation track should appear. By default each annotation layer is displayed at the level defined by adding 0.05 fraction of the total size of y-axis in the target or query direction. This can of course be increased or decreased to zero, which means the annotation will be added at the same level as the target or query alignments.

## Add target annotation at the increased y-axis level
addAnnotation(
    ggplot.obj = plt, annot.gr = target.annot.gr,
    coordinate.space = "target", annotation.level = 0.2
)

## Add target annotation at the zero level
addAnnotation(
    ggplot.obj = plt, annot.gr = target.annot.gr,
    coordinate.space = "target", annotation.level = 0
)

Adding gene annotations

Gene annotation, such as individual exons or split genomic alignments, can be visualized as a set of genomic ranges that are connected by a horizontal line. This is defined based on a shared identifier (ID) that marks genomic ranges that are supposed to be grouped by such horizontal lines.

## Create gene-like annotation
test.gr <- GenomicRanges::GRanges(
    seqnames = "target.region",
    ranges = IRanges::IRanges(
        start = c(19000000, 19030000, 19070000),
        end = c(19010000, 19050000, 19090000)
    ),
    ID = "gene1"
)
## Add single gene annotation
addAnnotation(
    ggplot.obj = plt, annot.gr = test.gr, coordinate.space = "target",
    shape = "rectangle", annotation.group = "ID", fill.by = "ID"
)

Highlighting whole alignments

Sometimes there is a need to highlight the whole genomic alignment with a unique color. This can be done by overlaying the original alignment with an extra alignment.

## Make a plot
plt1 <- plotMiro(paf.table = paf.table, add.alignment.arrows = FALSE)
## Highlight alignment as a filled polygon
addAlignments(
    ggplot.obj = plt1, paf.table = paf.table[3, ], fill.by = "strand",
    fill.palette = c("+" = "red")
)
## Highlight alignment as outlined polygon by dashed line
addAlignments(ggplot.obj = plt1, paf.table = paf.table[4, ], linetype = "dashed")

Visualization of disjoint alignments

On many occasions alignments over the repetitive regions are redundant at positions of SDs. This results in overlapping alignments that often flank inverted regions. There is a function that allows breaking PAF alignments at these overlaps and reporting them as individual alignments.

## Disjoin PAF at target coordinates
disj.paf.table <- disjoinPafAlignments(
    paf.table = paf.table,
    coordinates = "target"
)
## Plot disjoined alignments
plotMiro(paf.table = disj.paf.table, outline.alignments = TRUE)

Disjoined alignments can also be highlighted as shown earlier.

## Plot disjoined alignments
plt <- plotMiro(
    paf.table = disj.paf.table, outline.alignments = TRUE,
    add.alignment.arrows = FALSE
)
## Highlight disjoined alignments
addAlignments(
    ggplot.obj = plt, paf.table = disj.paf.table[c(4:5), ],
    linetype = "dashed"
)

For convenience, PAF alignments can also be disjoined at user-defined region(s).

## Disjoin PAF at user defined coordinates
disj.gr <- GenomicRanges::GRanges(
    seqnames = "query.region",
    ranges = IRanges::IRanges(
        start = 16300000,
        end = 16350000
    )
)
disj.paf.table <- disjoinPafAlignments(
    paf.table = paf.table,
    coordinates = "query",
    disjoin.gr = disj.gr
)
## Plot disjoined alignments
plt <- plotMiro(paf.table = disj.paf.table, outline.alignments = TRUE)
## Highlight disjoined alignments
addAlignments(
    ggplot.obj = plt, paf.table = disj.paf.table[4, ],
    linetype = "dashed"
)

Detection and visualization of insertions and deletions

Another important feature of SVbyEye is its ability to break PAF alignments at the positions of insertions and deletions. This can be done by setting the minimum size of the insertion and deletion to be reported and setting the way they are marked within the plot, either outlined or filled. By default, deletions are colored red and insertions are blue. Deletions and insertions are defined as sequences that are either missing or are inserted within a query sequence with respect to the target sequence.

## Load the data to plot
paf.file <- system.file("extdata", "test3.paf", package = "SVbyEye")
paf.table <- readPaf(
    paf.file = paf.file, include.paf.tags = TRUE,
    restrict.paf.tags = "cg"
)

## Make plot and break alignment at the position where there are deletions >=50bp
plotMiro(paf.table = paf.table, min.deletion.size = 50, highlight.sv = "outline")

## Highlight detected deletion by filled polygons
plotMiro(paf.table = paf.table, min.deletion.size = 50, highlight.sv = "fill")

One can also opt to break the PAF alignments at insertions and/or deletions and just report them as a data table.

## Break PAF alignment at deletions >=50bp
alns <- breakPaf(paf.table = paf.table, min.deletion.size = 50)
## Print out detected deletions
alns$SVs

Narrowing alignment to a user-defined target region

There is a convenience function to narrow down a PAF alignment into a user-defined region by setting the desired target coordinates defined as a GenomicRanges object or a set of coordinates in this 'chromosome:start-end' format. With this, one can subset and cut PAF alignments at the exact target coordinates.

paf.file <- system.file("extdata", "test1.paf", package = "SVbyEye")
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file, include.paf.tags = TRUE,
    restrict.paf.tags = "cg"
)
## Subset PAF
subsetPafAlignments(
    paf.table = paf.table,
    target.region = "target.region:19050000-19200000"
)

This functionality can be directly used within the plotMiro function by setting the parameter 'target.region'.

Summarizing sequence composition as annotation heatmaps

Lastly, it is possibile to evaluate the sequence composition of either the query or target sequence. The fasta2nucleotideContent() function can calculate the frequency of a specific sequence pattern or an overall frequency of nucleotides within user-defined sequence bins. Subsequently, such nucleotide frequencies can be added to the plot as a heatmap.

## Get PAF to plot
paf.file <- system.file("extdata", "test_getFASTA.paf",
    package = "SVbyEye"
)
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file,
    include.paf.tags = TRUE, restrict.paf.tags = "cg"
)
## Make a plot colored by alignment directionality
plt <- plotMiro(paf.table = paf.table, color.by = "direction")
## Get query sequence composition
fa.file <- paf.file <- system.file("extdata", "test_getFASTA_query.fasta",
    package = "SVbyEye"
)
## Calculate GC content per 5kbp bin
gc.content <- fasta2nucleotideContent(
    fasta.file = fa.file,
    binsize = 5000, nucleotide.content = "GC"
)
## Add GC content as annotation heatmap
addAnnotation(
    ggplot.obj = plt, annot.gr = gc.content, shape = "rectangle",
    fill.by = "GC_nuc.freq", coordinate.space = "query",
    annotation.label = "GCcont"
)

Visualization of all-versus-all sequence alignments

The SVbyEye also allows visualization of alignments between more than two sequences. This can be done by aligning multiple sequences to each other using so-called all-versus-all (AVA) or stacked alignments. In this way, alignments are visualized in subsequent order with alignments of the first sequence being shown with respect to the second and then second sequence to the third, etc. Alternatively, if the order of sequence in such a progressive alignment is known, one can avoid AVA alignment and align each sequence to the next in a defined order and merge these into a single PAF file. This is useful to visualize alignments between multiple sequences of the same or different species.

## Get PAF to plot
paf.file <- system.file("extdata", "test_ava.paf", package = "SVbyEye")
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file, include.paf.tags = TRUE,
    restrict.paf.tags = "cg"
)
## Make a plot colored by alignment directionality
plotAVA(paf.table = paf.table, color.by = "direction")

Many of the same parameter settings used for plotMiro apply to plotAVA as well. Briefly, users can color alignments based on their orientation or identity, define a desired color palette, and bin the alignments into user-defined bins.

## Color by fraction of matched bases in each alignment
plotAVA(
    paf.table = paf.table, color.by = "identity",
    perc.identity.breaks = c(85, 90, 95)
)
## Use custom color palette to color alignment directionality
plotAVA(
    paf.table = paf.table, color.by = "direction",
    color.palette = c("+" = "azure3", "-" = "yellow3")
)
## Bin PAF alignments into user-defined bin and color them by sequence identity
plotAVA(paf.table = paf.table, binsize = 20000)

In addition to the above listed parameters, a user can also define the desired sequence order in which progressive alignments will be reported. Importantly, only samples defined in parameter 'seqnames.order' will be plotted.

## Define custom sample/sequence order
seqnames.order <- c("HG00438_2", "HG01358_2", "HG02630_2", "HG03453_2")
plotAVA(
    paf.table = paf.table, color.by = "direction",
    seqnames.order = seqnames.order
)
## Only samples present in custom sample order are being plotted
seqnames.order <- c("HG00438_2", "HG01358_2", "HG03453_2")
plotAVA(
    paf.table = paf.table, color.by = "direction",
    seqnames.order = seqnames.order
)

Adding annotation to AVA alignments differs slightly from what we learned with the plotMiro function. Since there are more than two sequences, annotation levels are defined based on the sequence identifier they belong to. This means users have to define an extra column that contains sequence identifiers to which each genomic range belongs to.

## Add annotation to all-versus-all alignments ##
plt <- plotAVA(paf.table = paf.table, color.by = "direction")
annot.file <- system.file("extdata", "test_annot_ava.RData", package = "SVbyEye")
annot.gr <- get(load(annot.file))
addAnnotation(
    ggplot.obj = plt, annot.gr = annot.gr, coordinate.space = "self",
    y.label.id = "ID"
)

Again, users can set the annotation level to zero in order to plot each annotation directly on the line corresponding to each unique sequence/sample.

## Add annotation to the same level as the sequence/sample alignments
plt <- plotAVA(paf.table = paf.table, color.by = "direction")
annot.file <- system.file("extdata", "test_annot_ava.RData", package = "SVbyEye")
annot.gr <- get(load(annot.file))
addAnnotation(
    ggplot.obj = plt, annot.gr = annot.gr, coordinate.space = "self",
    y.label.id = "ID", annotation.level = 0
)

When plotting the annotation users can define a custom color palette that needs to be matched to the unique meta column in annotation ranges.

## Add annotation to the same level as the sequence/sample alignments
plt <- plotAVA(paf.table = paf.table, color.by = "direction")
annot.file <- system.file("extdata", "test_annot_ava.RData", package = "SVbyEye")
annot.gr <- get(load(annot.file))
## Define color palette
colors <- setNames(annot.gr$color, annot.gr$Repeat)
## Set fill.by variable to color each range using defined color palette
addAnnotation(
    ggplot.obj = plt, annot.gr = annot.gr, fill.by = "Repeat",
    color.palette = colors, coordinate.space = "self", y.label.id = "ID",
    annotation.level = 0
)

Lastly, there is again a possibility to break PAF alignments at the insertions and deletions and highlight them as outlined or filled polygons.

## Break PAF alignments at insertions and deletions and highlight them by an outline
plotAVA(
    paf.table = paf.table, color.by = "direction", min.deletion.size = 1000,
    min.insertion.size = 1000, color.palette = c("+" = "azure3", "-" = "yellow3"),
    highlight.sv = "outline"
)

## Break PAF alignments at insertions and deletions and highlight them by filled polygons
plotAVA(
    paf.table = paf.table, color.by = "direction", min.deletion.size = 1000,
    min.insertion.size = 1000, color.palette = c("+" = "azure3", "-" = "yellow3"),
    highlight.sv = "fill"
)

Visualization of self-alignments

There are cases when one wants to visualize regions that are homologous to each other within a single DNA sequence. Such regions can be reported by the alignment of a given sequence to itself. Such self-alignments are typical for SDs that are paralogous sequences with high sequence identity.

## Get PAF to plot
paf.file <- system.file("extdata", "test2.paf", package = "SVbyEye")
## Read in PAF
paf.table <- readPaf(
    paf.file = paf.file, include.paf.tags = TRUE,
    restrict.paf.tags = "cg"
)
## Make a plot
## Plot alignment as horizontal dotplots and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "segment")
## Plot alignment as arcs and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "arc")
## Plot alignment as arrows and color by alignment directionality
plotSelf(paf.table = paf.table, color.by = "direction", shape = "arrow")

Again, some of the previous parameters apply to the self-alignments as well. For example, users can define the color palette and bin the alignments as well as break alignments at the insertions and deletions.

## Plot alignment as arcs and color by alignment directionality
plotSelf(
    paf.table = paf.table, color.by = "direction", shape = "arc",
    color.palette = c("+" = "azure3", "-" = "yellow3")
)
## Bin PAF alignments into user-defined bin and color them by sequence identity
plotSelf(paf.table = paf.table, binsize = 1000)

When breaking the self-alignments at insertions and deletions, proximal duplication is considered as a query and the distal duplication as a target. It means that sequence missing in the query is considered a deletion and sequence inserted in the query with respect to target sequence is considered an insertion.

## Highlight structural variants within self-alignments
plotSelf(
    paf.table = paf.table, min.deletion.size = 50, min.insertion.size = 50,
    highlight.sv = "outline", shape = "arc"
)

Visualization of whole-genome alignments

Lastly, SVbyEye allows plottin an overview of a whole-genome assembly or selected chromosomes with respect to a reference using the ‘plotGenome’ function. With this function whole-genome alignments can be visualized to observe large structural rearrangements with respect to a single reference.

## Get PAF to plot ##
paf.file <- system.file("extdata", "PTR_test.paf", package = "SVbyEye")
## Read in PAF
paf.table <- readPaf(paf.file = paf.file)
## Make a plot ##
## Color by alignment directionality
plotGenome(
    paf.table = paf.table, chromosomes = paste0("chr", c(4, 5, 9, 12)),
    chromosome.bar.width = grid::unit(2, "mm"),
    min.query.aligned.bp = 5000000
)

\newpage

References

sessionInfo()

sessionInfo()


daewoooo/SVbyEye documentation built on Oct. 15, 2024, 6:12 a.m.