In gogleva/SecretSanta: flexible pipelines for secretome prediction

1. Background

The SecretSanta package provides an R interface for the integrative prediction of extracellular proteins that are secreted via classical pathways.

Secretome prediction often involves multiple steps. Typically, it starts with identification of short signal peptides at the N-terminal end of a protein (Figure 1 a). Next, it is crucial to ensure the absence of motifs and domains preventing the protein from being secreted or targeting it to specific organelles. Such sequences include transmembrane domains, short ER lumen retention signals and mitochondria/plastid targeting signals (Figure 1 b-d). The ultimate aim of a secretome prediction pipeline is to identify secreted proteins as shown in Figure 1 a and filter out those shown in Figure 1 b-d.

\newline

knitr::include_graphics("motifs_and_domains.png")

Figure 1. Characteristic motifs, domains and their arrangements, distinguishing extracellular proteins from proteins retained inside the cell.

\newline

A number of command line tools and web-interfaces are available to perform predictions of individual motifs and domains (SignalP, TargetP, TMHMM, TOPCONS, WoLF PSORT), however the interface allowing to combine the outputs in a single flexible workflow is lacking.

\newline

The SecretSanta package attempts to bridge this gap. It provides a set of wrapper and parser functions around existing command line tools for predictions of signal peptides and protein subcellular localization. The functions are designed to work together by producing standardized output as an instance of CBSResult class.

\newline

Objects of CBSResult class contain an in_fasta slot with the initial submitted sequences and an out_fasta slot with positive candidates after the method application. Within CBSResult class objects fasta slots are organised in AAStringSetList. in_fasta and out_fasta slots could be extracted separately with designated accessor functions: getInfasta and getOutfasta. Alternatively, all fasta? slots could be extracted at once, organised in AAStringSetList object, with getFastas method. This enables application of all the defined AAStringSetLisst methods (e.g., elementNROWS) to the extracted object and ensures seamless integration into numerous workflows, utilising this core Bioconductor class. Example below illustrates basic manipulations with objects of CBSResult class.

library(SecretSanta)

aa <- readAAStringSet(system.file("extdata", "sample_prot_100.fasta", package = "SecretSanta"))

# create a simple CBSResult object:
inp <- CBSResult(in_fasta = aa,
                 out_fasta = aa[1:10])
class(inp)
inp
# access both fasta slots at once:
getFastas(inp)
# access in_fasta slot:
getInfasta(inp)
# access out_fasta slot:
getOutfasta(inp)

Each particular method then complements this simple class structure with relevant slots. For example, outputs of the signalp() function are organised in SignalpResult objects. Apart from in_fasta and out_fasta slots inherited from CBSResult class, SignalpResult objects contain three additional slots, relevant for the SignalP method:

sp_tibble - parsed SignalP tabular output for positive candidates;
mature_fasta - mature sequence for the candidate secreted proteins, i.e sequences with cleaved N-terminal signal peptides;
sp_version - version of SignalP used to generate this object.

The detailed organisation of class structure is shown in the Figure 2.

knitr::include_graphics("class_structure.png")

Figure2. SecretSanta S4 class structure.

\newline

This uniform class organisation allows the user to easily pipe results between individual predictors to create flexible custom pipelines and also to compare predictions between similar methods. For instance, between TargetP and WoLF PSORT for subcellular localization and between multiple versions of SignalP for signal peptide prediction.

\newline

To speed-up processing of large input fasta files initial steps of the pipeline are automatically run in parallel when the number of input sequences exceeds a certain limit. The package also contains a convenient ask_uniprot() function - to export known information about subcellular localisation. This allows the user to compare and verify predictions of extracellular proteins, however is mostly applicable for well-annotated and curated genomes.

\newline

Taken together SecretSanta provides a platform to build automated multi-step secretome prediction pipelines that can be applied to large protein sets to facilitate comparisons of secretomes across multiple species or under various conditions. For example, an earlier draft of this package was successfully used to compare the secreted protein complement of a plant pathogen throughout an infection time-course in plant roots [@Evangelisti2017].

\newline

To avoid confusion, the names of external command line tools will be shown in bold, and the corresponding R functions from the SecretSanta package will have standard code highlighting. For example:

SignalP - CBS SignalP and its associated derivative files;
signalp() - SecretSanta wrapper function around SignalP;

Majority of the SecretSanta functions could be run in 2 modes:

starter - when initiating the secretome pipeline with the above function;
piper - for downstream/intermediate steps, so that the function expects an output from another wrapper function. For detailed information on piper/starter behavior please see Table 2.

2. Installation of external dependencies

SecretSanta relies on a set of existing command line tools to predict secreted proteins. Please install them and configure according to the listed instructions. Due to limitations imposed by the external dependencies, some of SecretSanta wrapper functions are fully functional only on Linux systems.

\newline

2.1 Download and configure external dependencies

Tools for prediction of signal peptides and cleavage sites:

SignalP 2.0
- This version can run under IRIX, IRIX64, Linux, OSF1, SunOS.
- Download stand alone SignalP 2.0 http://www.cbs.dtu.dk/cgi-bin/sw_request?signalp+2.0
- Unpack the archive {sh unpack signalp-2.0, eval=FALSE} tar -zxvf signalp-2.0.Linux.tar.Z cd signalp-2.0
- Edit "General settings" at the top of the signalp file. Set value of 'SIGNALP' variable to be path to your signalp-2.0 directory. Other variables usually do not require changes. We will not use plotting functions from SignalP, so gnuplot, ppmtogif and ghostview are not required. For more details please check signalp-2.0.readme.
- Since we want to be able to run different versions of SignalP, including the legacy versions, it is important to be able to discriminate between them. R is oblivious to shell aliases, so we will simply rename the signalp script:
{sh rename signalp2, eval = FALSE} mv signalp signalp2

\newline

Signalp 3.0
- This version will run on the most common UNIX platforms.
- Download stand alone Signalp 3.0 http://www.cbs.dtu.dk/cgi-bin/sw_request?signalp+3.0
- Unpack the archive {sh unpack signalp-3.0, eval=FALSE} tar -zxvf signalp-3.0.Linux.tar.Z cd signalp-3.0
- Similar to Signalp 2.0, edit "General settings" at the top of the signalp file. Set value of 'SIGNALP' variable to be path to your signalp-3.0 directory. Other variables usually do not require changes. For more details please check signalp-3.0.readme.
- Rename signalp script to avoid further confusion between the versions: {sh rename signalp3, eval = FALSE} mv signalp signalp3

\newline

SignalP 4.1 - the most recent version
- This version can run under Windows, OS X (Macintosh) and Linux.
- Download stand alone SignalP 4.1 http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp
- Unpack the archive {sh unpack signalp-4.1, eval = FALSE} tar -zxvf signalp-4.1.Linux.tar.Z cd signalp-4.1
- Edit "General settings" at the top of the signalp file. Set values for 'SIGNALP' and 'outputDir' variables. For more details please check signalp-4.1.readme.
- Rename signalp script to avoid further confusion between the versions: {sh rename signalp4, eval = FALSE} mv signalp signalp4

\newline

Tools for prediction of protein subcellular localization:

TargetP 1.1
- TargetP 1.1 will run on the most common UNIX platforms
- Download stand alone TargetP 1.1: http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?targetp
- Unpack the archive: {sh unpack targetp-1.1, eval = FALSE} tar -zxvf targetp-1.1b.Linux.tar.Z cd targetp-1.1
- Edit the paragraph labeled "GENERAL SETTINGS, customize" at the top of the targetp file. Set values for 'TARGETP' and 'TMP' variables. Ensure, that the path to TargetP does not exceed 60 characters, otherwise TargetP 1.1 might fail.

\newline

WoLF PSORT
- Clone WoLF PSORT {sh, download WoLFPSort, eval = FALSE} git clone https://github.com/fmaguire/WoLFPSort.git cd WoLFPSort
- Copy the binaries from the appropriate platform specific binary directory ./bin/binByPlatform/binary-? to ./bin/
- For more details please check the INSTALL file.
- The most important script we need runWolfPsortSummary has a bulky name, we will rename it to wolfpsort for future convenience: {sh rename wolf, eval = FALSE} mv runWolfPsortSummary wolfpsort

\newline

Tools for predicting transmembrane domains

TMHMM 2.0
- TMHMM 2.0* will run on the most common UNIX platforms
- Download stand alone tmhmm (http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm)
- Unpack the archive: {sh, unpack tmhmm-2.0, eval = FALSE} tar -zxvf tmhmm-2.0c.Linux.tar.gz cd tmhmm-2.0c
- Set correct path for Perl 5.x in the first line of bin/tmhmm and bin/tmhmmformat.pl scripts.
- For more details please check the README file.

\newline

2.2 Organise access to the external dependencies

Two options are possible:

Option A: all the external dependencies are accessible from any location.

If you follow this route, there will be no need to provide paths to the respective tool when running SecretSanta functions. For bash users this requires modification of $PATH environment variable. To make the change permanent, edit .profile:

Open ./profile: ```{sh edit $PATH, eval = FALSE} gedit ~/.profile

Add a line with all the path exports.
In this example all the dependencies are installed in the `my_tool` directory:
```{sh, eval = FALSE}
export PATH=
"/home/my_tools/signalp-4.1:\
/home/my_tools/signalp-2.0:\
/home/my_tools/signalp-3.0:\
/home/my_tools/targetp-1.1:\
/home/tmhmm-2.0c/bin:\
/home/my_tools/WoLFPSort/bin:\
$PATH"

Reload .profile: ```{sh, eval = FALSE} . ~/.profile

Reboot to make changes visible to R. 

If you are using csh or tcsh, edit ``.login`` instead of ``.profile`` and use the ``setenv`` command instead of ``export``.

To check that all the tools are installed correctly, we will use ``manage_paths()`` function, it runs small test jobs to verify that all the external dependencies are functional and will not cause downstream problems. 

```r
library(SecretSanta)

To run checks for all the dependencies:

check_all <- manage_paths(in_path = TRUE)

It is also possible to run just a single test for a specific dependency in question:

check_small <- manage_paths(in_path = TRUE, test_tool = 'signalp2')

Option B: paths to external dependencies are supplied in a separate file:

This option should be used when $PATH variable can not be modified. If you follow this route, please specify path to the respective tool executable when running individual wrappers via paths argument.

To supply all the paths at once, create a 2-column space-separated text file without headers with listed full paths for all the external dependencies. Please note, when providing path file you can avoid renaming executable for multiple versions of SignalP.

pp <-
    suppressMessages(read.table(
    system.file("extdata", "santa_paths",
    package = "SecretSanta"),
    sep = ' ',
    header = FALSE
    ))
    names(pp) <- c('tool', 'path')

    knitr::kable(pp,
    format = 'pandoc',
    caption = "**Table1**: Sample paths to external dependencies")

For external tool names please use simple character strings without version numbers. The only exception here are different versions of SignalP, which require version number in one-digit format in the tool column to be distinguishable.

Now, we can check the dependencies using manage_paths() function. In this case we need to provide value for the path_file argument and set in_path = FALSE. Note: manage_paths() is case insensitive and will convert all the tool names provided in the file to the lower case.

check2 <- manage_paths(
    in_path = FALSE,
    path_file = system.file("extdata",
    "sample_paths",
    package = "SecretSanta")
    )

3. Pipers and starters

Here is a short summary of individual methods available via the SecretSanta package:

st <- suppressMessages(read.table(system.file("extdata",
                                    "summary_tools.csv",
                                    package = 'SecretSanta'),
                        sep = ',', header = TRUE))

knitr::kable(st, align = c(rep('c', 2), 'l', rep('c',3)),
                caption = "**Table2**: Summary of individual methods.")

4. Individual methods

4.1 Signalp()

SignalP is a software tool to predict classical signal peptides and cleavage sites [@Nielsen1997] in eukaryotes and bacteria. The method combines prediction of signal peptides and cleavage sites based on a combination of artificial neural networks and hidden Markov models. The latter can distinguish between signal peptides and non-cleaved signal anchors [@Nielsen1998].

Currently signalp() function from SecretSanta provides an interface for the three recent versions of SignalP. By providing access for legacy versions of SignalP - 2.0 [@Nielsen1998] and 3.0 [@Bendtsen2004], we allow the user to perform comparisons with already existing secretome datasets, predicted with the same method versions. The most recent Signalp 4.1 [@Petersen2011] version, applied with default thresholds could be not sensitive enough to predict certain classes of secreted oomycete and fungal effectors [@Sperschneider2015]. Similar, to the web-service and command line tool, there is a way to run signalp in the sensitive mode when 4.1 version is used, by specifying sensitive = TRUE.

signalp() requires providing organism argument; possible values include:

'euk' - for eukaryotes;
'gram+' - for gram-positive bacteria;
'gram-' - for gram-negative bacteria.

When legacy versions of SignalP are used, 2.0 and 3.0, please provide a value for the legacy_method argument. There are two possible options, often producing quite different results:

'hmm' - for HMM-based predictions;
'nn' - for predictions based on neural networks.

To run signalp() prediction, first read fasta file with amino acid sequences and store its contents in a separate variable:

aa <- readAAStringSet(system.file("extdata", "sample_prot_100.fasta",
                                    package = "SecretSanta"))

Initialize object of CBSResult class with aa as in_fasta slot.

inp <- CBSResult(in_fasta = aa)
inp

Since this is the first step in our secretome prediction pipeline, we will run signalp() in a starter mode. Here we select version number 4 (SignalP 4.1) and assume that SignalP is accessible globally, so no path files should be specified. If SignalP 4.1 can not be accessed globally, please provide a relevant path via paths argument. For more details please see Option B. Please note that signalp() truncates all the input sequences longer than 2000 a.a, by default. In this case '_truncated' is added to the original sequence ids to keep track of the changes. An alternative option is to discard long sequences completely before the analysis, to do so set truncate = FALSE when running signalp().

step1_sp4 <- signalp(inp, version = 4,
                    organism = 'euk',
                    run_mode = "starter")

The above code under the hood runs SignalP 4.1 and outputs result as an instance of SignalpResult class, which inherits from a parental CBSResult()class.

class(step1_sp4)

The SignalpResult object contains 5 slots:

in_fasta - original set of input amino acid sequences;
out_fasta - full length amino acid sequences of positive candidates;
mature_fasta - mature sequence for the candidate secreted proteins, i.e sequences with cleaved N-terminal signal peptides;
sp_tibble - parsed SignalP tabular output for positive candidates;
sp_version - version of SignalP used to generate this object.

You can use accessor methods to get contents of individual fasta slots:

getInfasta(step1_sp4)
getOutfasta(step1_sp4)
getMatfasta(step1_sp4)

Alternatively, all the fasta slots could be also retrieved at once with getFastas method, inherited from CBSResult class:

getFastas(step1_sp4)

The remaining slots could be accessed with the designated methods:

getSPtibble(step1_sp4)
getSPversion(step1_sp4)

Next, imagine we would like to run all the SignalP versions on the same input for comparison. We can do this by simply changing value of the version argument:

step1_sp3 <- signalp(inp, 
                    version = 3,
                    organism = 'euk',
                    run_mode = "starter",
                    legacy_method = 'hmm')
step1_sp2 <- signalp(inp,
                    version = 2,
                    organism = 'euk',
                    run_mode = "starter",
                    legacy_method = 'hmm')
step1_sp4 <- signalp(inp,
                    version = 4,
                    organism = 'euk',
                    run_mode = "starter")

In this case application of SignalP 4.1 version resulted in fewer candidates. Please note, that despite differences in the output format generated by multiple versions of SignalP, signalp() returns sp_tibble in a standard format.

To pipe results from different versions of SignalP just switch to the run_mode = 'piper' for the second and other downstream steps. In this case signalp() will run the analysis using contents of out_fasta slot as an input. Say, we want to do the following piping: signalp2 -> signalp3 -> signalp4.

\newline

We will re-use the step1_sp4 object generated earlier in the starter mode for the signalp4 -> signalp3 piping operation:

step2_sp3 <- signalp(step1_sp2,
                    version = 3,
                    organism = 'euk',
                    run_mode = 'piper',
                    legacy_method = 'hmm')

Similar with the signalp3 -> signalp4 piping:

step3_sp4 <- signalp(step2_sp3,
                    version = 4,
                    organism = 'euk',
                    run_mode = "piper")

With the input fasta files containing more than 1000 sequences, signalp() will automatically switch to parallel mode. It will split the input into smaller chunks and run prediction as a massive parallel process using specified number of CPUs (see cores argument). Execution time depends on the number of CPUs available and the input file size. With 48 CPUs it takes ~2 minutes to run signalp() on the input fasta file with more than 40'000 sequences.

4.2 Tmhmm()

TMHMM predicts transmembrane α-helices and identifies integral membrane proteins based on HMMs [@Krogh2001; @Sonnhammer1998]. It is important to exclude proteins with transmembrane domains located after signal peptide, as they will be retained in the membrane. Since TMHMM is often unable to distinguish between N-terminal signal peptides and transmembrane domains, it is recommended to run tmhmmm() on the mature sequences obtained after running signalp() function.

tmhmm() function can handle input objects of SignalpResult class with non-empty mature_fasta slot.

Here we will use output of the signalp() as an input for tmhmm():

tm <- tmhmm(step1_sp4, TM = 0)

Attempts to run tmhmm() on the CBSResult object lacking mature_fasta slot will produce an error:

tm2 <- tmhmm(inp, TM = 0)

tmhmm() outputs instances of TMhmmResult class, also inheriting from CBSResult.

class(tm)
tm

TMhmmResult object contains 5 slots:

in_fasta and out_fasta are common slots of all classes inheriting from the CBSResult class;
in_mature_fasta - mature sequences used as an input for tmhmm
out_mature_fasta - outputted mature sequences not having more transmembrane domains than specified by TM threshold.
tm_tibble - parsed tmhmm tabular output for candidates not having more transmembrane domains than specified by TM threshold;

To get contents of individual slots you could use accessor functions:

getTMtibble(tm)
getInMatfasta(tm)
getOutMatfasta(tm)
getFastas(tm)

4.3 Topcons()

SecretSanta provides a way to use an alternative method for prediction of transmembrane domains - TOPCONS, which uses consensus prediction of multiple tools and can more accurately discriminate between signal peptides and transmembrane regions [@Tsirigos2015]. Currently there are several options to run TOPCONS: using a web-server, WSDL-API scripts or stand-alone version. Since it takes considerable time to run TOPCONS predictions for a full proteome (from few hours to few days), it makes direct integration of this method in SecretSanta wrapper framework not so optimal. Instead we have implemented a parser function topcons(), which converts precomputed TOPCONS output into to a compatible CBSResult object and can be plugged in in versatile SecretSanta pipelines.

Since TOPCONS method itself takes quite a while to run for large protein datasets, topcons() parser is restricted to piper-like behavior and requires an input from a previous analytic step organised in CBSResult object as well as path to directory with generated files. In the example below, we will use precomputed TOPCONS output to convert it to a TopconsResult object, again inheriting from CBSResult.

# first - predict signal peptides
sp <- signalp(inp, 
              version = 2,
              organism = 'euk',
              run_mode = "starter",
              legacy_method = 'hmm')
# extract out_fasta slot with positive candidates and use these sequences to run TOPCONS prediction on a web-server:
writeXStringSet(getOutfasta(sp), 'sp_out.fasta')
# provide file path/file name with archived results:
p_dir <- system.file("extdata", "rst_SVw4hG.zip", package = "SecretSanta")
# integrate TOPCONS predictions:
tpc <- topcons(input_obj = sp,
               parse_dir = p_dir,
               topcons_mode = "WEB-server",
               TM = 0,
               SP = TRUE)
tpc

4.4 Targetp()

TargetP predicts the subcellular localization of secreted eukaryotic proteins based on the presence of signal peptide (SP), chloroplast transit peptides (cTP) or mitochondrial targeting peptides (mTP) in the N-terminus [@Emanuelsson2000]. Including TargetP in the pipeline can provide additional evidence that a protein with a predicted signal peptide is not targeted to plastids or mitochondria and is indeed extracellular. The targetp() function accepts input objects belonging to the CBSResult class and uses out_fasta slot as a direct input to the TargetP method, i.e, predictions are based on the full protein sequences, not their mature versions.

targetp() requires specifying network_type argument to run correctly. Possible values include:

'P' - for plants;
'N' - for non_plants.

It is possible to run targetp() in both piper and starter modes. Imagine, we want to start our pipeline by running targetp() using already created inp object as an input:

tp <- targetp(inp,
        network = 'N',  # for non-plant networks
        run_mode = 'starter')

Alternatively, we can run targetp() on the output of other functions, for example, signalp() in the piper mode:

tp_pipe <- targetp(step1_sp2,
                    network = 'N',
                    run_mode = 'piper')

In both cases targetp() will output an object of TargetpResult class with the following slots:

class(tp)
tp

in_fasta and out_fasta are common slots of all classes inheriting from CBSResult class;
tp_tibble - parsed TargetP tabular output for candidates not targeted to mitochondria or plastids, i.e most likely to be secreted.

To access the tp_tibble slot use the getTPtibble() method:

getTPtibble(tp)

targetp() will automatically switch to a parallel mode if the number of input sequences is greater than 1000.

4.5 Wolfpsort()

WoLF PSORT - predicts protein subcellular localization based on the PSORT principle. It converts amino acid sequences into numerical localization features based on sorting signals, amino acid composition and functional motifs. The converted data is then classified based on k-nearest neighbor algorithm [@Horton2006].

wolfpsort() requires organism argument. Possible values include:

'plant';
'animal';
'fungi'.

The wolfpsort function accepts objects of CBSResult class, and could be run in both starter and piper modes.

wlf <- wolfpsort(step1_sp2, organism = 'fungi', run_mode = 'piper')

wolfpsort returns an object with 3 slots:

class(wlf)
wlf

in_fasta and out_fasta;
wolf_tibble - parsed WoLF PSORT tabular output for sequences with 'extracellular' predicted to be the most probable subcellular localization.

To access contents of the wolf_tibble slot use getWOlFtibble() method:

getWOLFtibble(wlf)

Since wolfpsort() has the same purpose as targetp() it might be useful to run these functions side by side to compare the obtained results (see targetp results).

4.6 Check_khdel()

In addition to having signal peptides, some proteins can have an ER-retention signal in the C-terminal domain that prevents protein from being secreted outside the cell. There are at least 2 known ER-retention motifs (KDEL and HDEL) [@Munro1987]. Currently, several definitions of ER-retention signal exist, check_khdel() function allows the user to specify one of 3 possible definitions with pattern argument:

pattern = 'prosite' - checking for ER-retention motifs defined as "[KRHQSA][DENQ]EL$>";
pattern = 'elm' - checking for ER-retention motifs defined as "[KRHQSAP][DENQT]EL$;
pattern = 'strict' - strict matching for N-terminal KDEL or HDEL motifs;

The check_khdel() uses pattern matching, according to the specified definition to scan amino acid sequences and remove those with an ER retention signal in the C-terminus. To run it, simply pass a CBSResult object as an input.
check_khdel() is restricted to piper-like behavior, because it makes the most sense to check for ER-retention signals in proteins with already predicted signal peptides. The function does not rely on external dependencies, so providing path container is not required.

er_result <- check_khdel(step1_sp2, pattern = 'prosite')

This will return an object with 3 slots:

er_result

in_fasta and out_fasta;
retained_fasta - all the sequences with C-terminal '(K/H)DEL' motifs, in case you would like to have a look at them.

4.7 M_slicer()

This is an experimental option. The m_slicer() function takes the input amino acid sequences and generates all possible subsequences starting with methionine based on the assumption that translation start sites might be mis-predicted in the original set of proteins, which in turn would result in signal peptides also being mis-predicted. The output of this step can be used as an input for secretome prediction pipelines to rescue secreted proteins with mis-predicted start sites. (Making a list and checking it twice...just like Santa does). Sequence ids for the newly generated slices are built by concatenating original sequence ids, 'slice' string and position of the methionine sliced from. For example: ALI_PLTG_3_slice_M86 means that ALI_PLTG_3 sequence was sliced starting from the methionine in the position 86.

m_slicer() has 2 running modes:

slice - to simply slice input fasta regardless of it's origin;

slice <- m_slicer(aa,  # a set of amino acid seqeunces
                run_mode = 'slice',
                min_len = 100 # minimal length of the outputed slices
)

rescue - having output from any other up-stream function it extracts proteins not predicted to be secreted on the initial run and generate slices for them.

rescued <- m_slicer(step1_sp2,  # signalp2 output
                run_mode = 'rescue',
                min_len = 100)

Results of both run modes could be used as an input for other SecretSanta predictors. Say, we want to run signalp() on the rescued proteins. Note, that m_slicer() outputs AAStringSet objects, so in order to pass it to signalp() we first need to create CBSResult object with in_fasta = rescued.

r_inp <- CBSResult(in_fasta = rescued)
sp_rescue <- signalp(r_inp,
                    version = 2,
                    organism = 'euk',
                    run_mode = 'starter',
                    truncate = TRUE,
                    legacy_method = 'hmm')

The slicing procedure returned 24 additional potentially secreted candidates.

4.8 Ask_uniprot()

ask_uniprot() is a convenience function, allowing to retrieve information on subcellular localisation of proteins from UniProtKB in order to compare results obtained within a SecretSanta pipeline with available information on subcellular localisation. This function is mostly relevant for well-annotated and curated genomes. ask_uniprot() currently requires a simple list of UniprotKB ids and returns information about subcellular localisation, key words as well as relevant GO terms organised in a tibble:

id_list <- c('P39864', 'D0N4E2', 'Q5BUB4', 'D0N381', 'B1NNT7', 'D0NP26')
res <- ask_uniprot(id_list)
res

5. Build pipelines

We are now ready to build a pipeline using all of our available methods. Say, we would like to have a relatively short pipeline (Figure 3) with the following analytic steps:

use SignalP4 to predict signal peptides and cleavage sites;
run TMHMM on the output, to ensure that predicted proteins with signal peptides do not contain TM domains, i.e won't be stuck in the membrane;
run TargetP on the output, to make sure that a set of selected proteins is not targeted to plastids or mitochondria;
collect the TargetP output and scan for ER-retention signals - to ensure that proteins won't be stuck in the ER.

knitr::include_graphics("pipes_short.png")

Figure 3: Minimal pipeline for secretome prediction

Now we will run this pipeline using SecretSanta functions, starting with aa - an AAStringSet object containing 100 amino acid sequences.

Step 1: predict signal peptides

input <- CBSResult(in_fasta = aa)

input
step1_sp4 <- signalp(input,
                    version = 4,
                    organism = 'euk',
                    run_mode = 'starter',
                    truncate = TRUE
                    )

getSPtibble(step1_sp4)

getOutfasta(step1_sp4)

getMatfasta(step1_sp4)

Result: 5 candidate proteins with signal peptides.

Step 2: check for presence of TM domains in mature peptides:

# we allow 0 TM domains in mature peptides
step2_tm <- tmhmm(step1_sp4, TM = 0)

getTMtibble(step2_tm)

getOutfasta(step2_tm)

getOutMatfasta(step2_tm)

Result: 5 candidate proteins with signal peptides; 0 TM domains in mature sequences.

Step 3: predict sub-cellular localization

Here we are using targetp, we could have also used wolfpsort.

step3_tp <- targetp(step2_tm, network = 'N',run_mode = 'piper')

getTPtibble(step3_tp)

getInfasta(step3_tp)

getOutfasta(step3_tp)

This step allowed us to filter 1 sequence potentially targeted to mitochondria.

Result: 4 candidate proteins with signal peptides; 0 TM domains in mature sequences; not targeted to plastids or mitochondria.

Step 4: check for ER-retention signals

step4_er <- check_khdel(step3_tp, 'elm')
getOutfasta(step4_er)

Final Result: 4 candidate proteins with signal peptides; 0 TM domains in mature sequences; not targeted to plastids or mitochondria; not retained in ER.

This is an example of a fairly simple pipeline, but you can create more complex ones. A more stringent pipeline might involve multi-step filtration on signal peptide prediction as well as additional predictions for 'rescued' (Figure 4) sequences.

knitr::include_graphics("pipes_long.png")

Figure 4: Example of a stringent pipeline.

Step 5: save the result secretome

writeXStringSet(getOutfasta(step4_er), 'secretome_out.fasta')

Compact pipelienes:

Now, once we have examined all the individual analytic steps, we can directly combine them in a single command by using %>% operator:

pipeline <- signalp(input, version = 4, organism = 'euk', run_mode = 'starter') %>% 
    tmhmm(TM = 0) %>%
    targetp(network = 'N', run_mode = 'piper') %>%
    check_khdel(pattern = 'prosite')

Since pp belongs to the CBSResult class, we can again extract the output fasta field and save it in a separate file:

writeXStringSet(getOutfasta(pipeline), 'secretome_out.fasta')

Use case: predict secretome for a model species and compare with annotations available in UniProtKB

In this example we will download proteome for Phytophthora infestans from the UniprotKB, construct a pipeline for secretome prediction and compare results with known annotations. We will use TOPCONS for prediction of transmembrane domains instead of TMHMM, so to save time we will import already precomputed results and integrate them in the pipeline. To retrieve proteome, we will use the most recent version of biomartr package [@Drost2017].

devtools::install_github("HajkD/biomartr")
library(biomartr)

#download P.infestans proteome from UniProtKB

Pinfestans.proteome.uniprot <- getProteome(db = "uniprot", 
                                   organism = "Phytophthora infestans",
                                   path = "_uniprot_downloads/proteomes")

# Create CBSResult class object and start a pipeline:
pinf <- CBSResult(in_fasta = readAAStringSet(Pinfestans.proteome.uniprot))

# TOPCONS predictions for P.infestans:
top_dir <- system.file("extdata", "rst_7gaVh2.zip", package = "SecretSanta")

# Predict signal peptides, check additional targeting signals and ER-motifs, integrate TOPCONS predictions of TM domains
Pinfestans.secretome <- signalp(pinf,
                      version = 4,
                      organism = 'euk',
                      sensitive = TRUE,
                      run_mode = 'starter',
                      cores = 3) %>%
              targetp(network = 'N', run_mode = 'piper') %>%
              check_khdel(pattern = 'prosite') %>%
              topcons(parse_dir = top_dir, topcons_mode = "WEB-server",
                      TM = 0, SP = TRUE)

# Now, we can check how many of predicted proteins have already known extracellular localisation, according to UniProtKB:
unip_res <- ask_uniprot(names(getOutfasta(Pinfestans.secretome)))
table(unip_res$Subcellular.Location)

6. Running pipelines on HPCs

To demonstrate how SecretSanta pipelines could be applied for large-scale analysis of proteomes, we will download all available fungal proteomes using biomartr package [@Drost2017] and apply the same standard secretome prediction pipeline. To speed-up the process, we will use 20 cores available on our local HPC. Please note, retrieval of proteomes might take a while.

library(biomartr)
meta.retrieval(kingdom = "fungi", db = "refseq", type = 'proteome', 
               path = "fungi_refseq_proteomes")

proteomes <- list.files("fungi_refseq_proteomes/",
                        full.names = TRUE, pattern ="*.fasta")

hpc_pipeline <- function(proteome) {
    input <- CBSResult(in_fasta = readAAStringSet(proteome))
    pipeline <- signalp(input, version = 4, organism = 'euk',
                        run_mode = 'starter', cores = 20) %>% 
        tmhmm(TM = 0) %>%
        targetp(network = 'N', run_mode = 'piper', cores = 20) %>%
        check_khdel(pattern = 'prosite')
    base_proteome <- stringr::str_split(proteome, '/') %>% tail(n = 1)
    writeXStringSet(getOutfasta(pipeline),
                    paste(base_proteome, 'secretome.fasta',
                          sep = '_'))

}

sapply(proteomes, hpc_pipeline)

Session

sessionInfo()

References

gogleva/SecretSanta documentation built on May 30, 2019, 8:02 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

gogleva/SecretSanta
flexible pipelines for secretome prediction

In gogleva/SecretSanta: flexible pipelines for secretome prediction

1. Background

2. Installation of external dependencies

2.1 Download and configure external dependencies

Tools for prediction of signal peptides and cleavage sites:

Tools for prediction of protein subcellular localization:

Tools for predicting transmembrane domains

2.2 Organise access to the external dependencies

Option A: all the external dependencies are accessible from any location.

Option B: paths to external dependencies are supplied in a separate file:

3. Pipers and starters

4. Individual methods

4.1 Signalp()

4.2 Tmhmm()

4.3 Topcons()

4.4 Targetp()

4.5 Wolfpsort()

4.6 Check_khdel()

4.7 M_slicer()

4.8 Ask_uniprot()

5. Build pipelines

Step 1: predict signal peptides

Step 2: check for presence of TM domains in mature peptides:

Step 3: predict sub-cellular localization

Step 4: check for ER-retention signals

Step 5: save the result secretome

Compact pipelienes:

Use case: predict secretome for a model species and compare with annotations available in UniProtKB

6. Running pipelines on HPCs

Session

References

R Package Documentation

Browse R Packages

We want your feedback!

gogleva/SecretSanta flexible pipelines for secretome prediction

In gogleva/SecretSanta: flexible pipelines for secretome prediction

1. Background

2. Installation of external dependencies

2.1 Download and configure external dependencies

Tools for prediction of signal peptides and cleavage sites:

Tools for prediction of protein subcellular localization:

Tools for predicting transmembrane domains

2.2 Organise access to the external dependencies

Option A: all the external dependencies are accessible from any location.

Option B: paths to external dependencies are supplied in a separate file:

3. Pipers and starters

4. Individual methods

4.1 Signalp()

4.2 Tmhmm()

4.3 Topcons()

4.4 Targetp()

4.5 Wolfpsort()

4.6 Check_khdel()

4.7 M_slicer()

4.8 Ask_uniprot()

5. Build pipelines

Step 1: predict signal peptides

Step 2: check for presence of TM domains in mature peptides:

Step 3: predict sub-cellular localization

Step 4: check for ER-retention signals

Step 5: save the result secretome

Compact pipelienes:

Use case: predict secretome for a model species and compare with annotations available in UniProtKB

6. Running pipelines on HPCs

Session

References

R Package Documentation

Browse R Packages

We want your feedback!

gogleva/SecretSanta
flexible pipelines for secretome prediction