makeGRangesFromDataFrame: Make a GRanges object from a data.frame or DataFrame

Description Usage Arguments Value Note Author(s) See Also Examples

View source: R/makeGRangesFromDataFrame.R

Description

makeGRangesFromDataFrame takes a data-frame-like object as input and tries to automatically find the columns that describe genomic ranges. It returns them as a GRanges object.

makeGRangesFromDataFrame is also the workhorse behind the coercion method from data.frame (or DataFrame) to GRanges.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
makeGRangesFromDataFrame(df,
                         keep.extra.columns=FALSE,
                         ignore.strand=FALSE,
                         seqinfo=NULL,
                         seqnames.field=c("seqnames", "seqname",
                                          "chromosome", "chrom",
                                          "chr", "chromosome_name",
                                          "seqid"),
                         start.field="start",
                         end.field=c("end", "stop"),
                         strand.field="strand",
                         starts.in.df.are.0based=FALSE)

Arguments

df

A data.frame or DataFrame object. If not, then the function first tries to turn df into a data frame with as.data.frame(df).

keep.extra.columns

TRUE or FALSE (the default). If TRUE, the columns in df that are not used to form the genomic ranges of the returned GRanges object are then returned as metadata columns on the object. Otherwise, they are ignored. If df has a width column, then it's always ignored.

ignore.strand

TRUE or FALSE (the default). If TRUE, then the strand of the returned GRanges object is set to "*".

seqinfo

Either NULL, or a Seqinfo object, or a character vector of unique sequence names (a.k.a. seqlevels), or a named numeric vector of sequence lengths. When not NULL, seqinfo must be compatible with the genomic ranges in df, that is, it must have one entry for each unique sequence name represented in df. Note that it can have additional entries i.e. entries for seqlevels not represented in df.

seqnames.field

A character vector of recognized names for the column in df that contains the chromosome name (a.k.a. sequence name) associated with each genomic range. Only the first name in seqnames.field that is found in colnames(df) is used. If no one is found, then an error is raised.

start.field

A character vector of recognized names for the column in df that contains the start positions of the genomic ranges. Only the first name in start.field that is found in colnames(df) is used. If no one is found, then an error is raised.

end.field

A character vector of recognized names for the column in df that contains the end positions of the genomic ranges. Only the first name in start.field that is found in colnames(df) is used. If no one is found, then an error is raised.

strand.field

A character vector of recognized names for the column in df that contains the strand associated with each genomic range. Only the first name in strand.field that is found in colnames(df) is used. If no one is found or if ignore.strand is TRUE, then the strand of the returned GRanges object is set to "*".

starts.in.df.are.0based

TRUE or FALSE (the default). If TRUE, then the start positions of the genomic ranges in df are considered to be 0-based and are converted to 1-based in the returned GRanges object. This feature is intended to make it more convenient to handle input that contains data obtained from resources using the "0-based start" convention. A notorious example of such resource is the UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables).

Value

A GRanges object with one element per row in the input.

If the seqinfo argument was supplied, the returned object will have exactly the seqlevels specified in seqinfo and in the same order. Otherwise, the seqlevels are ordered according to the output of the rankSeqlevels function (except if df contains the seqnames in the form of a factor-Rle, in which case the levels of the factor-Rle become the seqlevels of the returned object and with no re-ordering).

If df has non-automatic row names (i.e. rownames(df) is not NULL and is not seq_len(nrow(df))), then they will be used to set names on the returned GRanges object.

Note

Coercing data.frame or DataFrame df into a GRanges object (with as(df, "GRanges")), or calling GRanges(df), are both equivalent to calling makeGRangesFromDataFrame(df, keep.extra.columns=TRUE).

Author(s)

H. Pag<c3><a8>s, based on a proposal by Kasper Daniel Hansen

See Also

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
## ---------------------------------------------------------------------
## BASIC EXAMPLES
## ---------------------------------------------------------------------

df <- data.frame(chr="chr1", start=11:15, end=12:16,
                 strand=c("+","-","+","*","."), score=1:5)
df
makeGRangesFromDataFrame(df)  # strand value "." is replaced with "*"

## The strand column is optional:
df <- data.frame(chr="chr1", start=11:15, end=12:16, score=1:5)
makeGRangesFromDataFrame(df)

gr <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE)
gr2 <- as(df, "GRanges")  # equivalent to the above
stopifnot(identical(gr, gr2))
gr2 <- GRanges(df)        # equivalent to the above
stopifnot(identical(gr, gr2))

makeGRangesFromDataFrame(df, ignore.strand=TRUE)
makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
                             ignore.strand=TRUE)

makeGRangesFromDataFrame(df, seqinfo=paste0("chr", 4:1))
makeGRangesFromDataFrame(df, seqinfo=c(chrM=NA, chr1=500, chrX=100))
makeGRangesFromDataFrame(df, seqinfo=Seqinfo(paste0("chr", 4:1)))

## ---------------------------------------------------------------------
## ABOUT AUTOMATIC DETECTION OF THE seqnames/start/end/strand COLUMNS
## ---------------------------------------------------------------------

## Automatic detection of the seqnames/start/end/strand columns is
## case insensitive:
df <- data.frame(ChRoM="chr1", StarT=11:15, stoP=12:16,
                 STRAND=c("+","-","+","*","."), score=1:5)
makeGRangesFromDataFrame(df)

## It also ignores a common prefix between the start and end columns:
df <- data.frame(seqnames="chr1", tx_start=11:15, tx_end=12:16,
                 strand=c("+","-","+","*","."), score=1:5)
makeGRangesFromDataFrame(df)

## The common prefix between the start and end columns is used to
## disambiguate between more than one seqnames column:
df <- data.frame(chrom="chr1", tx_start=11:15, tx_end=12:16,
                 tx_chr="chr2", score=1:5)
makeGRangesFromDataFrame(df)

## ---------------------------------------------------------------------
## 0-BASED VS 1-BASED START POSITIONS
## ---------------------------------------------------------------------

if (require(rtracklayer)) {
  session <- browserSession()
  genome(session) <- "sacCer2"
  query <- ucscTableQuery(session, "Assembly")
  df <- getTable(query)
  head(df)

  ## A common pitfall is to forget that the UCSC Table Browser uses the
  ## "0-based start" convention:
  gr0 <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
                                      start.field="chromStart",
                                      end.field="chromEnd")
  head(gr0)

  ## The start positions need to be converted into 1-based positions,
  ## to adhere to the convention used in Bioconductor:
  gr1 <- makeGRangesFromDataFrame(df, keep.extra.columns=TRUE,
                                      start.field="chromStart",
                                      end.field="chromEnd",
                                      starts.in.df.are.0based=TRUE)
  head(gr1)
}

Example output

Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, basename, cbind, colMeans, colSums, colnames,
    dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
    intersect, is.unsorted, lapply, lengths, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
    rowMeans, rowSums, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb
   chr start end strand score
1 chr1    11  12      +     1
2 chr1    12  13      -     2
3 chr1    13  14      +     3
4 chr1    14  15      *     4
5 chr1    15  16      .     5
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      +
  [2]     chr1     12-13      -
  [3]     chr1     13-14      +
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      *
  [2]     chr1     12-13      *
  [3]     chr1     13-14      *
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      *
  [2]     chr1     12-13      *
  [3]     chr1     13-14      *
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 1 metadata column:
      seqnames    ranges strand |     score
         <Rle> <IRanges>  <Rle> | <integer>
  [1]     chr1     11-12      * |         1
  [2]     chr1     12-13      * |         2
  [3]     chr1     13-14      * |         3
  [4]     chr1     14-15      * |         4
  [5]     chr1     15-16      * |         5
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      *
  [2]     chr1     12-13      *
  [3]     chr1     13-14      *
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 4 sequences from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      *
  [2]     chr1     12-13      *
  [3]     chr1     13-14      *
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 3 sequences from an unspecified genome
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      *
  [2]     chr1     12-13      *
  [3]     chr1     13-14      *
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 4 sequences from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      +
  [2]     chr1     12-13      -
  [3]     chr1     13-14      +
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1     11-12      +
  [2]     chr1     12-13      -
  [3]     chr1     13-14      +
  [4]     chr1     14-15      *
  [5]     chr1     15-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2     11-12      *
  [2]     chr2     12-13      *
  [3]     chr2     13-14      *
  [4]     chr2     14-15      *
  [5]     chr2     15-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
Loading required package: rtracklayer
Error: package or namespace load failed for 'rtracklayer':
 objects 'DataFrame', 'RangedDataList', 'Rle', 'isSingleString', 'recycleIntegerArg', 'recycleNumericArg', 'isSingleStringOrNA', 'isTRUEorFALSE', 'isSingleNumberOrNA' are not exported by 'namespace:IRanges'

GenomicRanges documentation built on Nov. 8, 2020, 5:46 p.m.