LoadSNPData: Load the SNP information and code the genome sequences around...
In chandlerzuo/atSNP: Affinity test for identifying regulatory SNPs

Description Usage Arguments Details Value Author(s) Examples

Load the SNP data.

LoadSNPData(
  filename = NULL,
  genome.lib = "BSgenome.Hsapiens.UCSC.hg38",
  snp.lib = "SNPlocs.Hsapiens.dbSNP144.GRCh38",
  snpids = NULL,
  half.window.size = 30,
  default.par = FALSE,
  mutation = FALSE,
  ...
)

filename

A table containing the SNP information. Must contain at least five columns with exactly the following names:

chr	chromosome.
snp	The nucleotide position of the SNP.
snpid	The names of the SNPs.
a1	The deoxyribose for one allele.
a2	The deoxyribose for the other allele.

If this file exists already, it is used to extract the SNP information. Otherwise, SNP information extracted using argument 'snpids' is outputted to this file.

genome.lib

A string of the library name for the genome version. Default: "BSgenome.Hsapiens.UCSC.hg38".

snp.lib

A string of the library name to obtain the SNP information based on rs ids. Default: "SNPlocs.Hsapiens.dbSNP144.GRCh38".

snpids

A vector of rs ids for the SNPs. This argument is overidden if the file with name filename exists.

half.window.size

An integer for the half window size around the SNP within which the motifs are matched. Default: 30.

default.par

A boolean for whether using the default Markov parameters. Default: FALSE.

mutation

A boolean for whether this is mutation data. See details for more information. Default: FALSE.

...

Other parameters passed to read.table.

This function extracts the nucleotide sequence within a window around each SNP and code them using 1-A, 2-C, 3-G, 4-T.
There are two ways of obtaining the nucleotide sequences. If filename is not NULL and the file exists, it should contain the positions and alleles for each SNP. Based on such information, the sequences around SNP positions are extracted using the Bioconductor annotation package specified by genome.lib. Users should make sure that this annotation package corresponds to the correct species and genome version of the actual data. Alternatively, users can also provide a vector of rs ids via the argument snpids. The SNP locations and allele information is then obtained via the Bioconductor annotation package specified by snp.lib, and passed on to the package specified by genome.lib to further obtain the nucleotide sequences.
If mutation=FALSE (default), this function assumes that the data is for SNP analysis, and the reference genome should be consistent with either the a1 or a2 nucleotide. When extracting the genome sequence around each SNP position, this function compares the nucleotide at the SNP location on the reference genome with both a1 and a2 to distinguish between the reference allele and the SNP allele. If the nucleotide extracted from the reference genome does not match either a1 or a2, the SNP is discarded. The discarded SNPs are in the 'rsid.rm' field in the output.
Alternatively, if mutation=TRUE, this function assumes that the data is for general single nucleotide mutation analysis. After extracting the genome sequence around each SNP position, it replaces the nucleotide at the SNP location by the a1 nucleotide as the 'reference' allele sequence, and by the a2 nucleotide as the 'snp' allele sequence. It does NOT discard the sequence even if neither a1 or a2 matches the reference genome. When this data set is used in other functions, such as ComputeMotifScore, ComputePValues, all the results (i.e. affinity scores and their p-values) for the reference allele are indeed for the a1 allele, and results for the SNP allele are indeed for the a2 allele.
If the input is a list of rsid's, the SNP information extracted from snp.lib may contain more than two alleles for a single location. For such cases, LoadSNPData first extracts all pairs of alleles associated with those locations. If 'mutation=TRUE', all those pairs are considered as pairs of reference and SNP alleles, and their information is contained in 'sequence_matrix', 'a1', 'a2' and 'snpid'. If 'mutation=FALSE', LoadSNPData further filters these pairs based on whether one allele matches to the reference genome nucleotide extracted from genome.lib. Only those pairs with one allele matching the reference genome nucleotide is considered as pairs of reference and SNP alleles, with their information contained in 'sequence_matrix', 'a1', 'a2' and 'snpid'.

A list object containing the following components:

sequence_matrix	A list of integer vectors representing the deroxyribose sequence around each SNP.
a1	An integer vector for the deroxyribose at the SNP location on the reference genome.
a2	An integer vector for the deroxyribose at the SNP location on the SNP genome.
snpid	A string vector for the SNP rsids.
rsid.missing	If the data source is a list of rsids, this field records rsids for SNPs that are discarded because they are not in the SNPlocs package.
rsid.duplicate	If the data source is a list of rsids, this field records rsids for SNPs that based on the SNPlocs package, this locus has more than 2 alleles.
rsid.na	This field records rsids for SNPs that are discarded because the nucleotide sequences contain none ACGT characters.
rsid.rm	If the data source is a table and `mutation=FALSE`, this field records rsids for SNPs that are discarded because the nucleotide on the reference genome matches neither 'a1' or 'a2' in the data source.

The results are coded as: "A"-1, "C"-2, "G"-3, "T"-4.

Chandler Zuo chandler.c.zuo@gmail.com

## Not run: LoadSNPData(snpids = c("rs53576", "rs7412"),
genome.lib ="BSgenome.Hsapiens.UCSC.hg38", snp.lib =
"SNPlocs.Hsapiens.dbSNP144.GRCh38", half.window.size = 30, default.par = TRUE
, mutation = FALSE)
## End(Not run)