Practical 3 - GWAS in Samples with Structure & Using REGENIE

Last updated: 2024-06-13

Checks: 6 1

Knit directory: SISG2024_Association_Mapping/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230530)

The command set.seed(20230530) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/Users/joelle.mbatchou/SISG/2024/SISG2024_Association_Mapping/data/	data

Repository version: 87fbbbc

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 87fbbbc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    data/sim_rels_geno.bed
    Ignored:    exe/
    Ignored:    gwas_plink.log
    Ignored:    gwas_regenie.log
    Ignored:    lectures/
    Ignored:    mk_website.R
    Ignored:    step2_gwas_regenie.log
    Ignored:    tmp/

Untracked files:
    Untracked:  .Rhistory
    Untracked:  analysis/SISGM15_prac4Solution.Rmd
    Untracked:  analysis/SISGM15_prac5Solution.Rmd
    Untracked:  analysis/SISGM15_prac6Solution.Rmd
    Untracked:  analysis/SISGM15_prac9Solution.Rmd
    Untracked:  analysis/Session02_practical_Key_cache/
    Untracked:  analysis/Session07_practical.Rmd
    Untracked:  analysis/Session07_practical_Key.Rmd
    Untracked:  analysis/Session08_practical.Rmd
    Untracked:  analysis/Session08_practical_Key.Rmd
    Untracked:  data/run_regenie.r
    Untracked:  notes.txt

Unstaged changes:
    Modified:   .gitignore

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Session03_practical.Rmd) and HTML (docs/Session03_practical.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	36fe58b	Joelle Mbatchou	2024-06-13	update session 3 exercises
html	b68dd42	Joelle Mbatchou	2024-06-13	update session 3 exercises
Rmd	81864cc	Joelle Mbatchou	2024-06-12	update practicals
html	81864cc	Joelle Mbatchou	2024-06-12	update practicals
Rmd	59e563a	Joelle Mbatchou	2024-06-10	update practicals 3
html	59e563a	Joelle Mbatchou	2024-06-10	update practicals 3
Rmd	76be14a	Joelle Mbatchou	2024-06-10	update practicals 1-3
html	76be14a	Joelle Mbatchou	2024-06-10	update practicals 1-3
html	3499240	Joelle Mbatchou	2024-06-05	update session pages

Before you begin:

Make sure that R is installed on your computer
For this lab, we will use the following R libraries:

library(data.table)
library(dplyr)
library(qqman)
library(ggplot2)

Introduction

We will be analyzing a simulated data set which contains sample structure to better understand the impact it can have in GWAS analyses if not accounted for. We will perform GWAS on a quantitative phenotype which was simulated with high heritability and polygenic.

The file “sim_rels_pheno.txt”” contains the phenotype measurements for a set of individuals and the file “sim_rels_geno.bed” is a binary file in PLINK BED format with accompanying BIM and FAM files which contains the genotype data at null variants (i.e. not associated with the phenotype).

How should we expect the QQ/Manhatthan plots to look like under this scenario?

Data preparation

Let’s first load the simulated data into the R session. We need to define the path to the directory containing the phenotype and genotype files (change the path to the files location).

files_dir <- "/SISGM19/data/"

Also specify the paths to the PLINK2 and REGENIE binaries:

plink2_binary <- "/SISGM19/bin/plink2" 
regenie_binary <- "/SISGM19/bin/regenie"

We can now read the files (recall the PLINK BED file is a binary file):

pheno_file <- fread(sprintf("%s/sim_rels_pheno.txt", files_dir), header = TRUE) 
head(pheno_file, 3)

    FID  IID        Pheno
1: 2307 2307  0.009989201
2:  379  379 -1.452527735
3:  478  478  0.110971665

sim_bim <- fread(sprintf("%s/sim_rels_geno.bim", files_dir), header = FALSE)
head(sim_bim, 3)

   V1             V2 V3       V4 V5 V6
1:  1 1:12000011:A:C  0 12000011  A  C
2:  1 1:12000012:A:C  0 12000012  A  C
3:  1 1:12000019:T:C  0 12000019  T  C

sim_fam <- fread(sprintf("%s/sim_rels_geno.fam", files_dir), header = FALSE)
head(sim_fam, 3)

     V1   V2 V3 V4 V5 V6
1: 2307 2307  0  0  1 -9
2:  379  379  0  0  2 -9
3:  478  478  0  0  1 -9

Exercises

Here are some things to try:

Examine the dataset:

How many samples are present? Use str
How many SNPs? In how many chromosomes? Use str and table

Examine the phenotype data:

How many individuals in the study have measurements? Use table(is.na(pheno_file$Pheno))
Plot a histogram to show the distribution of the phenotype. Use the hist() function

With PLINK, perform association mapping between the phenotype and the variants in the PLINK BED genotype file. Only perform association test on SNPs that pass the following quality control threshold filters:

minor allele frequency (MAF) > 0.01
at least a 99% genotyping call rate (less than 1% missing)
HWE p-values greater than 0.001

The basic command would look like

# first fill in the thresholds to use for each filter
filter_maf = 
filter_missing_rate = 
filter_hwe = 

cmd <- sprintf('%s --bfile "%s/sim_rels_geno" --pheno "%s/sim_rels_pheno.txt" --pheno-name Pheno --maf %g --geno %g --hwe %g --glm allow-no-covars --out gwas_plink', plink2_binary, files_dir, files_dir, filter_maf, filter_missing_rate, filter_hwe)
system(cmd, intern = T)

The results of the GWAS are stored in gwas_plink.Pheno.glm.linear.

Make a Manhattan plot of the association results. Make sure to check what information is stored in the PLINK output file (using str()).

plink.gwas <- fread("gwas_plink.Pheno.glm.linear", header = TRUE)
plot(
  x = 1:nrow(plink.gwas),
  y = -log10(plink.gwas$P),
  col = c("orange", "purple")[1 + plink.gwas$`#CHROM` %% 2],
  xaxt = "n", xlab = "Genomic position", ylab = "Observed -log10(P)"
)

Make a Q-Q plot of the association results.

qq(plink.gwas$P)

Compute the genomic control inflation factor $\lambda_{GC}$ based on the p-values. Is there evidence of possible inflation due to confounding?

chisq.values <- qchisq(plink.gwas$P, 1, lower.tail = FALSE)
median(chisq.values)

Now we will run REGENIE to perform a GWAS of the phenotype using a whole genome regression model. We first want to extract a set of high quality variants for the Step 1 null model fitting. Using PLINK, apply QC filters to remove variants with MAF below 5%, missingness above 1%, HWE p-value below 0.001, minor allele count (MAC) below 20. We will use --write-snplist to store list of variants passing QC without making a new BED file.

# first fill in the thresholds to use for each filter
filter_maf = 
filter_missing_rate = 
filter_hwe = 
filter_mac = 

cmd <- sprintf('%s --bfile "%s/sim_rels_geno" --pheno "%s/sim_rels_pheno.txt" --pheno-name Pheno --maf %g --geno %g --hwe %g --mac %g --write-snplist --out qc_pass', plink2_binary, files_dir, files_dir, filter_maf, filter_missing_rate, filter_hwe, filter_mac)
system(cmd, intern = T)

This produces a file qc_pass.snplist containing a list of variant IDs that pass the QC filters.

If REGENIE software is installed on your machine

Run REGENIE Step 1 to fit the null model and obtain polygenic predictions using a leave-one-chromosome-out (LOCO) scheme.

cmd <- sprintf('%s --bed "%s/sim_rels_geno" --phenoFile "%s/sim_rels_pheno.txt"  --phenoCol Pheno --qt --step 1 --loocv --bsize 1000 --extract qc_pass.snplist --out gwas_regenie', regenie_binary, files_dir, files_dir)
system(cmd, intern = T)

The LOCO polygenic predictions for the phenotype are stored in gwas_regenie_1.loco.

Run REGENIE Step 2 to perform association testing.

cmd <- sprintf('%s --bed "%s/sim_rels_geno" --phenoFile "%s/sim_rels_pheno.txt" --phenoCol Pheno --qt --step 2 --bsize 400 --pred gwas_regenie_pred.list --out step2_gwas_regenie', regenie_binary, files_dir, files_dir)
system(cmd, intern = T)

The REGENIE summary statistics will be in step2_gwas_regenie_Pheno.regenie.

Generate Manhatthan and Q-Q plots based on the REGENIE association results and compute $\lambda_{GC}$. Compare with output from Questions 4-6.

If REGENIE software does not run on your machine

We will use an implementation of REGENIE in R. Download it here and change the path of the variable regenie_script to the path of the script on your machine

regenie_script <- "/Users/xyz/Downloads/run_regenie.r"
source(regenie_script)

We now run REGENIE Step 1 to fit the null model and obtain polygenic predictions using a leave-one-chromosome-out (LOCO) scheme.

loco_pred <- run_regenie_step1(
  bedfile = paste0(files_dir, "/sim_rels_geno"),
  phenofile = paste0(files_dir, "/sim_rels_pheno.txt"),
  phenocol = "Pheno",
  bsize = 1000,
  extract = "qc_pass.snplist"
)

This function will return the LOCO polygenic predictions for the phenotype.

Run REGENIE Step 2 to perform association testing.

sumstats_regenie <- run_regenie_step2(
  bedfile = paste0(files_dir, "/sim_rels_geno"),
  phenofile = paste0(files_dir, "/sim_rels_pheno.txt"),
  phenocol = "Pheno",
  bsize = 200,
  loco.mat = loco_pred
) 

str(sumstats_regenie)

This function returns a data frame containing the REGENIE summary statistics.

Generate Manhatthan and Q-Q plots based on the REGENIE association results and compute $\lambda_{GC}$. Compare with output from Questions 4-6.

sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.4.2     qqman_0.1.8       dplyr_1.1.2       data.table_1.14.8

loaded via a namespace (and not attached):
 [1] gtable_0.3.3     jsonlite_1.8.5   compiler_4.3.0   highr_0.10      
 [5] promises_1.2.0.1 tidyselect_1.2.0 Rcpp_1.0.10      stringr_1.5.0   
 [9] git2r_0.32.0     later_1.3.1      jquerylib_0.1.4  scales_1.2.1    
[13] yaml_2.3.7       fastmap_1.1.1    R6_2.5.1         generics_0.1.3  
[17] workflowr_1.7.0  knitr_1.43       MASS_7.3-58.4    tibble_3.2.1    
[21] munsell_0.5.0    rprojroot_2.0.3  bslib_0.5.0      pillar_1.9.0    
[25] rlang_1.1.1      utf8_1.2.3       calibrate_1.7.7  cachem_1.0.8    
[29] stringi_1.7.12   httpuv_1.6.11    xfun_0.39        fs_1.6.2        
[33] sass_0.4.6       cli_3.6.1        withr_2.5.0      magrittr_2.0.3  
[37] grid_4.3.0       digest_0.6.31    rstudioapi_0.14  lifecycle_1.0.3 
[41] vctrs_0.6.2      evaluate_0.21    glue_1.6.2       whisker_0.4.1   
[45] colorspace_2.1-0 fansi_1.0.4      rmarkdown_2.22   tools_4.3.0     
[49] pkgconfig_2.0.3  htmltools_0.5.5