View source: R/getChromInfoFromNCBI.R
getChromInfoFromNCBI | R Documentation |
getChromInfoFromNCBI
returns chromosome information
like sequence names, lengths and circularity flags for a given
NCBI assembly e.g. for GRCh38, ARS-UCD1.2, R64, etc...
Note that getChromInfoFromNCBI
behaves slightly differently
depending on whether the assembly is registered in the
GenomeInfoDb package or not. See below for the details.
Use registered_NCBI_assemblies
to list all the NCBI assemblies
currently registered in the GenomeInfoDb package.
getChromInfoFromNCBI(assembly,
assembled.molecules.only=FALSE,
assembly.units=NULL,
recache=FALSE,
as.Seqinfo=FALSE)
registered_NCBI_assemblies(organism=NA)
assembly |
A single string specifying the name of an NCBI assembly (e.g.
|
assembled.molecules.only |
If If |
assembly.units |
If
|
recache |
|
as.Seqinfo |
|
organism |
When |
registered vs unregistered NCBI assemblies:
All NCBI assemblies can be looked up by assembly accession (GenBank or RefSeq) but only registered assemblies can also be looked up by assembly name.
For registered assemblies, the returned circularity flags are guaranteed to be accurate. For unregistered assemblies, a heuristic is used to determine the circular sequences.
Please contact the maintainer of the GenomeInfoDb package to request registration of additional assemblies.
For getChromInfoFromNCBI
: By default, a 10-column data frame
with columns:
SequenceName
: character.
SequenceRole
: factor.
AssignedMolecule
: factor.
GenBankAccn
: character.
Relationship
: factor.
RefSeqAccn
: character.
AssemblyUnit
: factor.
SequenceLength
: integer. Note that this column **can**
contain NAs! For example this is the case in assembly Amel_HAv3.1
where the length of sequence MT is missing or in assembly
Release 5 where the length of sequence Un is missing.
UCSCStyleName
: character.
circular
: logical.
For registered_NCBI_assemblies
: A data frame summarizing all the
NCBI assemblies currently registered in the GenomeInfoDb
package.
H. Pagès
getChromInfoFromUCSC
for getting chromosome
information for a UCSC genome.
getChromInfoFromEnsembl
for getting chromosome
information for an Ensembl species.
Seqinfo objects.
## All registered NCBI assemblies for Triticum aestivum (bread wheat):
registered_NCBI_assemblies("tri")[1:4]
## All registered NCBI assemblies for Homo sapiens:
registered_NCBI_assemblies("homo")[1:4]
## Internet access required!
getChromInfoFromNCBI("GRCh37")
getChromInfoFromNCBI("GRCh37", as.Seqinfo=TRUE)
getChromInfoFromNCBI("GRCh37", assembled.molecules.only=TRUE)
## The GRCh38.p14 assembly only adds "patch sequences" to the GRCh38
## assembly:
GRCh38 <- getChromInfoFromNCBI("GRCh38")
table(GRCh38$SequenceRole)
GRCh38.p14 <- getChromInfoFromNCBI("GRCh38.p14")
table(GRCh38.p14$SequenceRole) # 254 patch sequences (164 fix + 90 novel)
## All registered NCBI assemblies for Arabidopsis thaliana:
registered_NCBI_assemblies("arabi")[1:4]
getChromInfoFromNCBI("TAIR10.1")
getChromInfoFromNCBI("TAIR10.1", assembly.units="non-nuclear")
## Sanity checks:
idx <- match(GRCh38$SequenceName, GRCh38.p14$SequenceName)
stopifnot(!anyNA(idx))
tmp1 <- GRCh38.p14[idx, ]
rownames(tmp1) <- NULL
tmp2 <- GRCh38.p14[-idx, ]
stopifnot(
identical(tmp1[ , -(5:7)], GRCh38[ , -(5:7)]),
identical(tmp2, GRCh38.p14[GRCh38.p14$AssemblyUnit == "PATCHES", ])
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.