As GitHub has limitations on size of the repositories and files, Histone BigWig files not included in LocusExplorer/Data/EncodeBigWig/
. These files are public and can be downloaded from UCSC golden path - total ~2.5GB. Downloaded bigWig files must be saved in LocusExplorer/Data/EncodeBigWig/
folder.
UCSC golden path contains the public downloadable files associated with ENCODE track.
wgEncodeBroadHistoneGm12878H3k27acStdSig.bigWig 27-Jan-2011 10:33 250M
wgEncodeBroadHistoneH1hescH3k27acStdSig.bigWig 27-Jan-2011 11:57 651M
wgEncodeBroadHistoneHsmmH3k27acStdSig.bigWig 27-Jan-2011 17:04 307M
wgEncodeBroadHistoneHuvecH3k27acStdSig.bigWig 27-Jan-2011 19:55 296M
wgEncodeBroadHistoneK562H3k27acStdSig.bigWig 27-Jan-2011 21:18 292M
wgEncodeBroadHistoneNhekH3k27acStdSig.bigWig 28-Jan-2011 00:47 280M
wgEncodeBroadHistoneNhlfH3k27acStdSig.bigWig 28-Jan-2011 01:51 258M
Chemical modifications (e.g. methylation and acylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription. A specific modification of a specific histone protein is called a histone mark. This track shows the levels of enrichment of the H3K27Ac histone mark across the genome as determined by a ChIP-seq assay. The H3K27Ac histone mark is the acetylation of lysine 27 of the H3 histone protein, and it is thought to enhance transcription possibly by blocking the spread of the repressive histone mark H3K27Me3. Additional histone marks and other chromatin associated ChIP-seq data is available at the Broad Histone page.
This track shows data from the Bernstein Lab at the Broad Institute. The Bernstein lab is part of the ENCODE consortium.
LocusExplorer/Data/CustomDataExample
1. Association File
Association File is mandatory for plot generation. All other files are optional but enhance plot aesthetics and interpretation
chr2
, chrX
rs12345
, chr10:104329988:D
104356185
2
for typed and 1
for imputed variants2. LD File
LD File is not mandatory but is recommended for more informative plots. If user supplied LD data is not available, see Make LD file tab for instructions of how LD data relative to the index SNP(s) can be obtained from the 1000 Genomes Project Phase 3 Dataset.
2
, 23
104356185
104315667
rs10786679
, chr10:104329988:D
1
, 0.740917
Note: Lead SNP must be defined relative to itself for plotting purposes, e.g.:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
2 173309618 rs13410475 2 173309618 rs13410475 1
2 173309618 rs13410475 2 172827293 rs148800555 0.0906124
When using plink or LDlink method this does not need to be manually added.
3. [! Disabled !] Custom bedGraph Track
Note: This feature is currently disabled, and will be available in version 0.8. See related GitHub issue.
The first four required bedGraph fields are:
chrom
- The name of the chromosome (e.g. chr3, chrY).
chromStart
- The starting position of the feature in the chromosome.
chromEnd
- The ending position of the feature in the chromosome.
score
- A score, any number.
See BedGraph Track Format for more details.
File is tab separated and has no header. This file will be used to create a bar chart. Score is the height, e.g.:
chr2 173292313 173371181 -100
chr2 173500000 173520000 1000
LocusExplorer should be used for illustrative purposes only. Any results provided by LocusExplorer should be used with caution.
The source code and installation instructions for LocusExplorer are available at https://github.com/oncogenetics/LocusExplorer.
LocusExplorer is made available under the MIT license.
LocusExplorer runs in the R environment but is designed to be an easy to use interface that does not require familiarity with R as a prerequisite. LocusExplorer is platform agnostic and able to run on any operating system for which R is available.
LocusExplorer requires R version 3.2.2 to run and can be downloaded by following the instructions at https://www.r-project.org/. Some required packages are not available for earlier versions of R.
After installation of the R software, R packages used by LocusExplorer must be installed prior to use. This may take a few minutes, but is only required on the first occasion. To install packages, open the R program, copy the following code into the R console and hit Return:
#install CRAN packages, if missing
packages <- c("shiny","dplyr","tidyr","lazyeval","data.table","ggplot2","ggrepel","knitr","markdown","DT","lattice","acepack","cluster","DBI","colourpicker","igraph","visNetwork", "devtools")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())), dependencies = TRUE)
} else { print("All required CRAN packages installed")}
#install Bioconductor packages if missing
source("https://bioconductor.org/biocLite.R")
bioc <- c("ggbio","GenomicRanges","TxDb.Hsapiens.UCSC.hg19.knownGene","org.Hs.eg.db","rtracklayer")
if (length(setdiff(bioc, rownames(installed.packages()))) > 0) {
biocLite(setdiff(bioc, rownames(installed.packages())))
} else { print("All required Bioconductor packages installed")}
#install GitHub packages:
devtools::install_github("oncogenetics/oncofunco")
LocusExplorer runs through a web browser and uses an intuitive interface that does not require high level computational skills to operate.
Open RStudio (start a new R session) copy the following code into the console and hit Return:
library(shiny)
runGitHub("LocusExplorer", "oncogenetics", launch.browser = TRUE)
Click on Download as ZIP button, this will download the repisotory locally as a zip file LocusExplorer-master.zip. Unzip the folder. Open ui.R file in RStudio (start a new R session) and click on Run App (Please ensure Run External option is selected for full functionality) button at top right corner, or run below code.
library(shiny)
runApp(launch.browser = TRUE)
LocusExplorer: a user-friendly tool for integrated visualisation of genetic association data and biological annotations
Tokhir Dadaev1, Daniel A Leongamornlert1, Edward J Saunders1, Rosalind Eeles1,2 , Zsofia Kote-Jarai1
1Department of Genetics and Epidemiology, The Institute of Cancer Research, London, UK
2Royal Marsden NHS Foundation Trust, London, UK
Bioinformatics first published online November 20, 2015 doi:10.1093/bioinformatics/btv690
Summary: In this article we present LocusExplorer, a data visualisation and exploration tool for genetic association data. LocusExplorer is written in R using the Shiny library, providing access to powerful R-based functions through a simple user interface. LocusExplorer allows users to simultaneously display genetic, statistical and biological data for humans in a single image and allows dynamic zooming and customisation of the plot features. Publication quality plots may then be produced in a variety of file formats.
Availability and implementation: LocusExplorer is open source and runs through R and a web browser. It is available at www.oncogenetics.icr.ac.uk/LocusExplorer/ or can be installed locally and the source code accessed from https://github.com/oncogenetics/LocusExplorer.
Multiple novel prostate cancer susceptibility signals identified by fine-mapping of known risk loci among Europeans. Al Olama AA, et al.
Accepted, not published yet:
See FAQ.
Questions, suggestions, and bug reports are welcome and appreciated.
LocusExplorer/Data/CustomDataExample
1. Association File
Association File is mandatory for plot generation. All other files are optional but enhance plot aesthetics and interpretation
chr2
, chrX
rs12345
, chr10:104329988:D
104356185
2
for typed and 1
for imputed variants2. LD File
LD File is not mandatory but is recommended for more informative plots. If user supplied LD data is not available, see Make LD file tab for instructions of how LD data relative to the index SNP(s) can be obtained from the 1000 Genomes Project Phase 3 Dataset.
2
, 23
104356185
104315667
rs10786679
, chr10:104329988:D
1
, 0.740917
Note: Lead SNP must be defined relative to itself for plotting purposes, e.g.:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
2 173309618 rs13410475 2 173309618 rs13410475 1
2 173309618 rs13410475 2 172827293 rs148800555 0.0906124
When using plink or LDlink method this does not need to be manually added.
3. [! Disabled !] Custom bedGraph Track
Note: This feature is currently disabled, and will be available in version 0.8. See related GitHub issue.
The first four required bedGraph fields are:
chrom
- The name of the chromosome (e.g. chr3, chrY).
chromStart
- The starting position of the feature in the chromosome.
chromEnd
- The ending position of the feature in the chromosome.
score
- A score, any number.
See BedGraph Track Format for more details.
File is tab separated and has no header. This file will be used to create a bar chart. Score is the height, e.g.:
chr2 173292313 173371181 -100
chr2 173500000 173520000 1000
Application tested on 24/10/2020 11:10
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] oncofunco_0.0.0.9000 rtracklayer_1.48.0
[3] org.Hs.eg.db_3.11.4 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[5] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3
[7] Biobase_2.48.0 GenomicRanges_1.40.0
[9] GenomeInfoDb_1.24.2 IRanges_2.22.2
[11] S4Vectors_0.26.1 ggbio_1.36.0
[13] BiocGenerics_0.34.0 visNetwork_2.0.9
[15] igraph_1.2.6 colourpicker_1.1.0
[17] DBI_1.1.0 cluster_2.1.0
[19] acepack_1.4.1 lattice_0.20-41
[21] DT_0.16 markdown_1.1
[23] knitr_1.30 ggrepel_0.8.2
[25] ggplot2_3.3.2 data.table_1.13.2
[27] lazyeval_0.2.2 tidyr_1.1.2
[29] dplyr_1.0.2 shiny_1.5.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 ellipsis_0.3.1 biovizBase_1.36.0
[4] htmlTable_2.1.0 XVector_0.28.0 base64enc_0.1-3
[7] dichromat_2.0-0 rstudioapi_0.11 farver_2.0.3
[10] bit64_4.0.5 xml2_1.3.2 splines_4.0.2
[13] Formula_1.2-4 jsonlite_1.7.1 Rsamtools_2.4.0
[16] dbplyr_1.4.4 png_0.1-7 graph_1.66.0
[19] BiocManager_1.30.10 compiler_4.0.2 httr_1.4.2
[22] backports_1.1.10 assertthat_0.2.1 Matrix_1.2-18
[25] fastmap_1.0.1 later_1.1.0.1 htmltools_0.5.0
[28] prettyunits_1.1.1 tools_4.0.2 gtable_0.3.0
[31] glue_1.4.2 GenomeInfoDbData_1.2.3 reshape2_1.4.4
[34] rappdirs_0.3.1 Rcpp_1.0.5 vctrs_0.3.4
[37] Biostrings_2.56.0 nlme_3.1-148 crosstalk_1.1.0.1
[40] xfun_0.18 stringr_1.4.0 mime_0.9
[43] miniUI_0.1.1.1 lifecycle_0.2.0 ensembldb_2.12.1
[46] XML_3.99-0.5 zlibbioc_1.34.0 scales_1.1.1
[49] BSgenome_1.56.0 VariantAnnotation_1.34.0 ProtGenerics_1.20.0
[52] hms_0.5.3 promises_1.1.1 RBGL_1.64.0
[55] SummarizedExperiment_1.18.2 AnnotationFilter_1.12.0 RColorBrewer_1.1-2
[58] yaml_2.2.1 curl_4.3 memoise_1.1.0
[61] gridExtra_2.3 biomaRt_2.44.4 rpart_4.1-15
[64] reshape_0.8.8 latticeExtra_0.6-29 stringi_1.5.3
[67] RSQLite_2.2.1 checkmate_2.0.0 BiocParallel_1.22.0
[70] rlang_0.4.8 pkgconfig_2.0.3 matrixStats_0.57.0
[73] bitops_1.0-6 purrr_0.3.4 labeling_0.4.2
[76] GenomicAlignments_1.24.0 htmlwidgets_1.5.2 bit_4.0.4
[79] tidyselect_1.1.0 GGally_2.0.0 plyr_1.8.6
[82] magrittr_1.5 R6_2.4.1 generics_0.0.2
[85] Hmisc_4.4-1 DelayedArray_0.14.1 mgcv_1.8-31
[88] pillar_1.4.6 foreign_0.8-80 withr_2.3.0
[91] survival_3.1-12 RCurl_1.98-1.2 nnet_7.3-14
[94] tibble_3.0.4 crayon_1.3.4 OrganismDbi_1.30.0
[97] BiocFileCache_1.12.1 jpeg_0.1-8.1 progress_1.2.2
[100] grid_4.0.2 blob_1.2.1 digest_0.6.26
[103] xtable_1.8-4 httpuv_1.5.4 openssl_1.4.3
[106] munsell_0.5.0 askpass_1.1
Note:
This procedure will calculate LD relative to one top index SNP only. For regions with multiple index SNPs, the process can be performed separately for each individual index SNP and the individual processed LD files combined to make a single input LD file.
LD relative to the index SNP cannot be calculated for variants that are not present in the 1000 Genomes phase 3 dataset (dbSNP 142) from publically available data and these will therefore always appear to be uncorrelated on the plot when generating LD information in this way.
We will need to install tabix and plink 1.9. Download and install HTSlib package which will include tabix. Then download and install PLINK 1.9.
Then we download 1000 Genomes VCF file using tabix and calculate LD using PLINK.
Example:
Download vcf for region of interest 16:56995835-57017756
from 1000 genomes ftp site using tabix.
tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 16:56995835-57017756 > genotypes.vcf
plink --vcf genotypes.vcf --ld rs9935228 rs1864163
#output
--ld rs9935228 rs1864163:
R-sq = 7.20831e-05 D' = 0.0355584
Haplotype Frequency Expectation under LE
--------- --------- --------------------
GA 0.005359 0.004853
AA 0.249013 0.249519
GG 0.013719 0.014225
AG 0.731909 0.731403
In phase alleles are GA/AG
We will need a list of SNPs file, one SNP per row.
# Example file snplist.txt
> cat snplist.txt
# rs9935228
# rs1864163
Now we pass snplist to plink. To learn more about plink options selected below see here and here.
plink --vcf genotypes.vcf \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \
--out LD_rs9935228_rs1864163
Prostate summary data can be downloaded at The Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome (PRACTICAL) consortium website:
I don't know what uc031tcg.1 is (but I do understand PCAT1, I think that naming convention is RefSeq, are the others also?).
A: I am trying to re-build gene symbols by collapsing transcripts into genes, when transcripts do not overlap with gene symbols, they get named as transcript names - in this case something like uc031tcg.1 - this is UCSC ID. See, udf_GeneSymbol for details. This part of the script is quite heavy and we are working on it.
So far I am using this sentence: “The colored lines spanning the plotting region indicate the extent of LD for the lead SNPs with the same color designation, where the height of the line represents __ and the length of the line represents ___.” Can you send me a better sentence to describe how these lines should be interpreted if I am not on the right track here?
A: It is a loess smoothing for matching hit SNP. If there are 2 SNPs marked with red and green shape and fill, then we will have 2 matching loess lines red and green. Smoothing is using LD values from 1000G phase EUR subset. We can use different cut-offs of LD: LD=0, LD > 0.1, LD >= 0.2, etc., usually LD = 0, i.e.: include all SNP LDs works best.
Y axis is 0 to 1, as in minimum and maximum value for LD - R2. When wavy lines have similar shape, we can safely assume, those SNPs are the same signal. There is also an “R2” track, the darker the lines the higher the LD, this track also helps visually see how hit SNPs overlap.
See Manhattan.R.
# LD smooth per hit SNP - optional
if("LDSmooth" %in% input$ShowHideTracks) {
gg_out <- gg_out +
geom_smooth(data=plotDatLD(),aes(x=BP,y=R2_Adj,col=LDSmoothCol),
method="loess",se=FALSE)}
What the colour interpretation is for the histone panel?
A: Data from ENCODE project, see links for more info.
Histone - LocusExplorer README.md, UCSC tables
DNaseI - LocusExplorer README.md, UCSC tables
A: See RSessionInfo.md
A: As GitHub has limitations on size of the repositories and files, Histone BigWig files are not included in LocusExplorer/Data/EncodeBigWig/
. These files are public and can be downloaded from UCSC golden path - total ~2.5GB. Downloaded bigWig files must be saved in LocusExplorer/Data/EncodeBigWig/
folder.
We are working on server version of LocusExplorer, expected to be live by December 2015. Keep an eye on https://github.com/oncogenetics/LocusExplorer page. This will resolve the issues of different R versions. R packages, and no limits on anntation data - such as bigWig files.
After succesfully uploading data using Input Data tab, when clicked on Plot Settings RStudio crashes. This happens even when we run RGUI.
A: It is hard to guess the cause of the crash, it could be package dependencies with different versions. We can try to re-install packages.
Try below steps:
.libPaths()
and get library path, choose the one where the user has a write access. e.g.: On my machine I will choose the first of those two listed folders.> .libPaths()
[1] "C:/Users/tdadaev/Documents/R/win-library/3.2" "C:/Program Files/R/R-3.2.2/library"
myLibraryLocation <- .libPaths()[1]
# all packages excluding base
allPackages <- installed.packages()
allPackages <- allPackages[ is.na(allPackages[,4]), 1]
# CRAN packages required by this Application.
cranPackages <- c("shiny", "shinyjs", "data.table", "dplyr", "tidyr",
"ggplot2", "knitr", "markdown", "stringr","DT","seqminer",
"lattice","cluster")
# Bioconductor packages required by this Application.
bioPackages <- c("ggbio","GenomicRanges","TxDb.Hsapiens.UCSC.hg19.knownGene",
"org.Hs.eg.db","rtracklayer")
# remove all packages
sapply(allPackages, remove.packages)
myLibraryLocation
folder.#reinstall CRAN packages
sapply(cranPackages, install.packages, lib = myLibraryLocation)
#reinstall Bioconductor packages
source("http://bioconductor.org/biocLite.R")
biocLite("BiocInstaller")
biocLite(bioPackages)
When creating LD file using “LD Tutorial”, getting “Duplicate ID '.'” error.
A: Here is a guide on how to get LD files when we have duplicated ID issue in the 1000 genomes vcf files. I am using your example snplist.txt
file
We need to get 1000 genomes vcf for the region, usually I give 250Kb flank based on my SNPs in snplist.txt
.
tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1-71324000 > genotypes.vcf
This as you described give us an error.
plink --vcf genotypes.vcf \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \
--out myLD
# ...
# Error: Duplicate ID '.'.
As some SNP names are actually dots "."
, they don't have an RS ID attached to them. So we need to give them unique names.
Convert to plink format
plink --vcf genotypes.vcf \
--make-bed \
--out genotypes
Here is the cause of the problem, dots as SNP IDs:
head -n5 genotypes.bim
# 17 . 0 56 T C
# 17 . 0 78 C G
# 17 . 0 355 A G
# 17 . 0 684 T C
# 17 rs62053745 0 828 T C
Run this R script to make unique names:
Rscript makeSNPnamesUnique.R genotypes.bim
head -n5 genotypes.bim
# 17 17_56_T_C 0 56 T C
# 17 17_78_C_G 0 78 C G
# 17 17_355_A_G 0 355 A G
# 17 17_684_T_C 0 684 T C
# 17 rs62053745 0 828 T C
Now, the SNP IDs are fixed, we run plink LD as usual:
plink --bfile genotypes \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \ # to have smaller output file: --ld-window-r2 0.2
--out myLD
Note: If the output file is too big, then consider setting higher R2 filter value, for example --ld-window-r2 0.2
, meaning only output R2 values that are more than 0.2
.
Check the output
wc -l myLD.ld
# 49172 myLD.ld
head -n5 myLD.ld
# CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
# 17 834 rs9747082 17 56 17_56_T_C 0.00137027
# 17 834 rs9747082 17 355 17_355_A_G 0.00151223
# 17 834 rs9747082 17 684 17_684_T_C 0.00127126
# 17 834 rs9747082 17 828 rs62053745 0.678207
Finally, we can use this file as input for LocusExplorer.