local_chain
in
format_sumstats()
and liftover()
).drop_na_cols
in
format_sumstats()
). By default, SNP, effect columns and P/N columns are
checked. Set to Null to check all columns or choose specific columns.check_no_rs_snp()
check with
imputation_ind=TRUE
.get_genome_builds()
to help with RAM & CPU usage during unit
tests. No change in functionality for end user.make_ordered
from sort_coords()
infer_eff_direction
parameter added so user can decide whether to run the checkcheck_bp_range
ensure that the BP column is numeric.check_no_rs_snp
the order of operations had to be reversed to ensure all
values were present before sorting column headers when imputation_ind=TRUE
and
imputing rsIDs.rmv_chrPrefix
parameter in format_sumstats()
has been replaced with
the new chr_style
parameter, which allows users to specify their desired
chromosome name style. The supported chromosome styles are "NCBI", "UCSC", "dbSNP",
and "Ensembl" with "Ensembl" being the default.check_chr()
now automatically removes all SNPs with nonstandard CHR entries
(anything other than 1-22, X, Y, and MT in the Ensembl naming style).ignore_multi_trait
parameter added which will ignore any multi-trait
p-values if set to TRUE. By default it is false to maintain the current default
running conditions for MSS.write_sumstats
:NULL
to ref_genome
. ref_genome
(only in conditions where its used). sort_coord
:sort_methods
,
including improved/more robust data.table
-native method.test-index_tabular.R
.check_numeric
:sort_coord
, read_header
run_biocheck
rworkflows
. drop_indels
parameter so a user can decide to remove indels from
sumstats. sed -E
rather than sed -r
as its compatible with
mac which has issues with sed -r
log_folder
parameter in format_sumstats()
has been updated.
It is still used to point to the directory for the log files and the log of
MungeSumstats messages to be stored. And the default is still a temporary
directory. However, now the name of the log files (log messages and log outputs)
are the same as the name of the file specified in the save_path
parameter with
the extension '_log_msg.txt' and '_log_output.txt' respectively.es_is_beta
). If set to FALSE, mapping removed.compute_z
input) has
been added: BETA/SE. To use it set compute_z = 'BETA'
to continue to use the
P-value calculation use compute_z = 'P'
. Note the default is stil
compute_z = FALSE
.data.table::fread()
leaves NAs blank instead of including a literal NA. That's fine for CSVs and if
the output is read in by fread, but it breaks other tools for TSVs and is hard
to read. Updated that and added a message when the table is switched to
uncompressed for indexing.read_header
: n=NULL
.seqminer
from all code (too buggy). import_sumstats
:@inheritDotParams format_sumstats
for better documentation. parse_logs
: Added new fields. format_sumstats
: Added time report at the end (minutes taken total).
Since this is a message, will be included in the logs,
and is now parsed by parse_logs
and put into the column "time".index_tabular
: Fixed by replacing seqminer
with Rsamtools
. 1:123456789
, it will now be dealt with
appropriately.compute_n
can't handle SNP level N values for imputation only population
level. An explanatory error message has now been added.find_sumstats()
: vcf2df
.read_vcf
can now be parallised: splits query into chunks, imports them, and (optionally) converts them to data.table
before rbinding them back into one object. mt_thresh
to avoid using parallelisation when VCFs are small,
due to the overhead outweighing the benefits in these cases.tryCatch
to downloader
with different download.file
parameters that may work better on certain machines. file.path
to specify URL in:get_chain_file
import_sumstats
download_vcf
to pass URLs directly (without downloading the files)
when vcf_download=FALSE
. download_vcf
:load_ref_genome_data
:read_vcf_genome
: more robust way to get genome build from VCF. read_sumstats
: Speed up by using remove_empty_cols(sampled_rows=)
,
and only run for tabular file (read_vcf
already does this internally). select_vcf_field
: Got rid of "REF col doesn't exists" warning by omitting rowRanges
. vignettes/MungeSumstats.Rmd
were
surrounding by ticks. vcf2df
: Accounted for scenarios where writeVcf
accidentally converts geno
data into redundant 3D matrices. data.table::rbindlist(fill=TRUE)
to bind chunks back together. read_vcf
upgrades:infer_vcf_sample_ids
is_vcf_parsed
check_tab_delimited
read_vcf_data
remove_nonstandard_vcf_cols
dt_to_granges
by merging functionality into to_granges
.liftover
to accommodate the slight change. is_tabix
(I had incorrectly made path
all lowercase). index_vcf
recognize all compressed vcf suffixes. BiocParallel
registered threads back to 1 after
read_vcf_parallel
finishes, to avoid potential
conflicts with downstream steps. find_sumstats
output to keep track of
search parameters.import_sumstats
: save_path
) exists
before downloading to save time. force_new
in additional to force_new_vcf
. MungeSumstats
. read_vcf
to be more robust. IRanges
to Imports. stringr
(no longer used)is_tabix
to check whether a file is already
tabix-indexed. read_sumstats
: samples
as an arg. GenomicFiles
. read_sumstats
: now takes samples
as an arg.
By default, only uses first sample (if multiple are present in file). INFO_filter=
from ALS VCF examples in vignettes
(no longer necessary now that INFO parsing has been corrected). download_vcf
can now handle situations with vcf_url=
is actually a local file (not remote).check_info_score
step.check_info_score
:log_files$info_filter
in these instances. check_empty_cols
was accidentally dropping more columns than it should have.write_sumstats
when indexing VCF. read_sumstats
can read in any VCF files
(local/remote, indexed/non-indexed). test-vcf_formatting.R
test-check_impute_se_beta
setkey
on SNP
(now automatically renamed from ID by read_vcf
). test-read_sumstats
:read_sumstats
. vcf_ss
are dropped. parse_logs
: Add lines to parsing subfunctions to allow handling of logs
that don't contain certain info
(thus avoid warnings when creating the final data.table). check_pos_se
check_signed_col
Rsamtools::bgzip
does
compression in Bioc 3.15. Switched to using fread + readLines
in:read_header
read_sumstats
read_header
: wasn't reading in enough lines to get past the VCF header.
Increase to readLines(n=1000)
. read_vcf
: Would sometimes induce duplicate rows.
Now only unique rows are used (after sample and columns filtering). format_sumstats
can now import remote files (other than OpenGWAS). sumstatsColHeaders
entries:liftover
GenomeInfoDb::mapGenomeBuilds
to standardise build names.standardise_sumstats_column_headers_crossplatform
standardise_header
while keeping the original function
name as an internal function (they call the same code).liftover
tutorialcheck_pos_se
: Remove extra message()
call around string.check_signed_col
: Remove extra message()
call around string.write_sumstats
tabix_index=TRUE
because this is
required for tabix.compute_nsize
standardise_sumstats_column_headers_crossplatform
formatted_example
standardise_sumstats_column_headers_crossplatform
:
Added arg uppercase_unmapped
to
to allow users to specify whether they want make the columns that could not be
mapped to a standard name uppercase (default=TRUE
for backcompatibility).
Added arg return_list
to specify whether to return a named list
(default) or just the data.table
.formatted_example
:
Added args formatted
to specify whether the file should have its colnames standardised.
Added args sorted
to specify whether the file should sort the data by coordinates.
Added arg return_list
to specify whether to return a named list
(default) or just the data.table
..datatable.aware=TRUE
to .zzz as extra precaution. vcf2df
: Documented arguments. import_sumstats
: Create individual folders for each GWAS dataset,
with a respective logs
subfolder to avoid overwriting log files
when processing multiple GWAS. parse_logs
: New function to convert logs from one or more munged GWAS
into a data.table
. list_sumstats
: New function to recursively search for local
summary stats files previously munged with MungeSumstats
. inst/extdata/MungeSumstats_log_msg.txt
to test logs files. list_sumstats
and parse_logs
. gh-pages
branch automatically by new GHA workflow. convert_large_p
and
convert_neg_p
, respectively.
These are both handled by the new internal function check_range_p_val
,
which also reports the number of SNPs found meeting these criteria
to the console/logs. check_small_p_val
records which SNPs were imputed in a more robust way,
by recording which SNPs met the criteria before making the changes (as opposed to inferred this info from which columns are 0 after making the changes). This
function now only handles non-negative p-values, so that rows with negative
p-values can be recorded/reported separately in the check_range_p_val
step. check_small_p_val
now reports the number of SNPs <= 5e-324 to console/logs. check_range_p_val
and check_small_p_val
. parse_logs
can now extract information reported by check_range_p_val
and
check_small_p_val
. logs_example
provides easy access to log file stored
in inst/extdata, and includes documentation on how it was created. check_range_p_val
and check_small_p_val
now use #' @inheritParams format_sumstats
to improve consistency of documentation. suppressWarnings
where possible. validate_parameters
can now handle ref_genome=NULL
to_GRanges
/to_GRanges
functions to all-lowercase functions
(for consistency with other functions). nThread=1
in data.table
test functions.get_genome_builds
save_path
is in was
actually created (as opposed to finding out at the very end of the pipeline). read_header
and read_sumstats
now both work with .bgz files. data("sumstatsColHeaders")
for details format_sumstats(FRQ_filter)
added so SNPs can now be filtered by allele
frequency format_sumstats(frq_is_maf)
check added to infer if FRQ column values are
minor/effect allele frequencies or not. frq_is_maf allows users to rename the
FRQ column as MAJOR_ALLELE_FRQ if some values appear to be major allele
frequenciesget_genome_builds()
can now be called to quickly get the genome build
without running the whole reformatting.format_sumstats(compute_n)
now has more methods to compute the effective
sample size with "ldsc", "sum", "giant" or "metal". format_sumstats(convert_ref_genome)
now implemented which can perform
liftover to GRCh38 from GRCh37 and vice-versa enabling better cohesion between
different study's summary statistics.check_no_rs_snp
can now handle extra information after an RS ID. So if you
have rs1234:A:G
that will be separated into two columns.check_two_step_col
and check_four_step_col
, the two checks for when
multiple columns are in one, have been updated so if not all SNPs have multiple
columns or some have more than the expected number, this can now be handled.FRQ
column have been added to the mapping filecheck_multi_rs_snp
can now handle all punctuation with/without spaces. So if
a row contains rs1234,rs5678
or rs1234, rs5678
or any other punctuation
character other than ,
these can be handled.format_sumstats(path)
can now be passed a dataframe/datatable of the summary
statistics directly as well as a path to their saved location.A0/A1
corresponding to ref/alt can now be
handled by the mappign file as well as A1/A2
corresponding to ref/alt.import_sumstats
reads GWAS sum stats directly from Open GWAS. Now
parallelised and reports how long each dataset took to import/format in total. find_sumstats
searches Open GWAS for datasets. compute_z
computes Z-score from P. compute_n
computes N for all SNPs from user defined smaple size.format_sumstats(ldsc_format=TRUE)
ensures sum stats can be fed directly
into LDSC without any additional munging. read_sumstats
, write_sumstas
, and download_vcf
functions now exported. format_sumstats(sort_coordinates=TRUE)
sorts results by their genomic
coordinates. format_sumstats(return_data=TRUE)
returns data directly to user. Can be
returned in either data.table
(default), GRanges
or VRanges
format using
format_sumstats(return_format="granges")
. format_sumstats(N_dropNA=TRUE)
(default) drops rows where N is missing. format_sumstats(snp_ids_are_rs_ids=TRUE)
(default) Should the SNP IDs
inputted be inferred as RS IDs or some arbitrary ID.format_sumstats(write_vcf=TRUE)
writes a tabix-indexed VCF file instead of
tabular format. format_sumstats(save_path=...)
lets users decide where their results are
saved and what they're named. save_path
indicates it's in tempdir()
, message warns users that
these files will be deleted when R session ends. format_sumstats
via report_summary()
. preview_sumstats()
messages improved. format_sumstats(pos_se=TRUE,effect_columns_nonzero=TRUE)
format_sumstats(log_folder_ind=TRUE,log_folder=tempdir())
format_sumstats(imputation_ind=TRUE)
data(sumstatsColHeaders)
. See
format_sumstats(mapping_file = mapping_file)
.read_vcf
upgraded to account for more VCF formats. check_n_num
now accounts for situations where N is a character vector and converts to numeric. Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.