Practical 3 Key - GWAS in Samples with Structure & Using REGENIE

Last updated: 2024-06-13

Checks: 6 1

Knit directory: SISG2024_Association_Mapping/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230530)

The command set.seed(20230530) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/Users/joelle.mbatchou/SISG/2024/SISG2024_Association_Mapping/data/	data

Repository version: 87fbbbc

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 87fbbbc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    data/sim_rels_geno.bed
    Ignored:    exe/
    Ignored:    gwas_plink.log
    Ignored:    gwas_regenie.log
    Ignored:    lectures/
    Ignored:    mk_website.R
    Ignored:    step2_gwas_regenie.log
    Ignored:    tmp/

Untracked files:
    Untracked:  .Rhistory
    Untracked:  analysis/SISGM15_prac4Solution.Rmd
    Untracked:  analysis/SISGM15_prac5Solution.Rmd
    Untracked:  analysis/SISGM15_prac6Solution.Rmd
    Untracked:  analysis/SISGM15_prac9Solution.Rmd
    Untracked:  analysis/Session02_practical_Key_cache/
    Untracked:  analysis/Session07_practical.Rmd
    Untracked:  analysis/Session07_practical_Key.Rmd
    Untracked:  analysis/Session08_practical.Rmd
    Untracked:  analysis/Session08_practical_Key.Rmd
    Untracked:  data/run_regenie.r
    Untracked:  notes.txt

Unstaged changes:
    Modified:   .gitignore

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Session03_practical_Key.Rmd) and HTML (docs/Session03_practical_Key.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	87fbbbc	Joelle Mbatchou	2024-06-13	update practical
html	36fe58b	Joelle Mbatchou	2024-06-13	update session 3 exercises
html	b68dd42	Joelle Mbatchou	2024-06-13	update session 3 exercises
html	6000be1	Joelle Mbatchou	2024-06-12	update practicals
html	a2c3f7a	Joelle Mbatchou	2024-06-05	remove keys
html	de4ef79	Joelle Mbatchou	2024-06-05	add docs dir

Before you begin:

Make sure that R is installed on your computer
For this lab, we will use the following R libraries:

library(data.table)
library(dplyr)
library(qqman)
library(ggplot2)

Introduction

We will be analyzing a simulated data set which contains sample structure to better understand the impact it can have in GWAS analyses if not accounted for. We will perform GWAS on a quantitative phenotype which was simulated with high heritability and polygenic.

The file “sim_rels_pheno.txt”” contains the phenotype measurements for a set of individuals and the file “sim_rels_geno.bed” is a binary file in PLINK BED format with accompanying BIM and FAM files which contains the genotype data at null variants (i.e. not associated with the phenotype).

How should we expect the QQ/Manhatthan plots to look like under this scenario?

Data preparation

Let’s first load the simulated data into the R session. We need to define the path to the directory containing the phenotype and genotype files (change the path to the files location).

files_dir <- "/SISGM19/data/"

Also specify the paths to the PLINK2 and REGENIE binaries:

plink2_binary <- "/SISGM19/bin/plink2" 
regenie_binary <- "/SISGM19/bin/regenie"

We can now read the files (recall the PLINK BED file is a binary file):

pheno_file <- fread(sprintf("%s/sim_rels_pheno.txt", files_dir), header = TRUE) 
head(pheno_file, 3)

    FID  IID        Pheno
1: 2307 2307  0.009989201
2:  379  379 -1.452527735
3:  478  478  0.110971665

sim_bim <- fread(sprintf("%s/sim_rels_geno.bim", files_dir), header = FALSE)
head(sim_bim, 3)

   V1             V2 V3       V4 V5 V6
1:  1 1:12000011:A:C  0 12000011  A  C
2:  1 1:12000012:A:C  0 12000012  A  C
3:  1 1:12000019:T:C  0 12000019  T  C

sim_fam <- fread(sprintf("%s/sim_rels_geno.fam", files_dir), header = FALSE)
head(sim_fam, 3)

     V1   V2 V3 V4 V5 V6
1: 2307 2307  0  0  1 -9
2:  379  379  0  0  2 -9
3:  478  478  0  0  1 -9

Exercises

Here are some things to try:

Examine the dataset:

How many samples are present? Use str

str(sim_fam)

Classes 'data.table' and 'data.frame':  2400 obs. of  6 variables:
 $ V1: int  2307 379 478 1545 990 1907 369 1694 2137 2314 ...
 $ V2: int  2307 379 478 1545 990 1907 369 1694 2137 2314 ...
 $ V3: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V4: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V5: int  1 2 1 1 1 2 2 1 2 1 ...
 $ V6: int  -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
 - attr(*, ".internal.selfref")=<externalptr>

How many SNPs? In how many chromosomes? Use str and table

str(sim_bim)

Classes 'data.table' and 'data.frame':  106134 obs. of  6 variables:
 $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2: chr  "1:12000011:A:C" "1:12000012:A:C" "1:12000019:T:C" "1:12000027:C:T" ...
 $ V3: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V4: int  12000011 12000012 12000019 12000027 12000036 12000061 12000073 12000074 12000117 12000136 ...
 $ V5: chr  "A" "A" "T" "C" ...
 $ V6: chr  "C" "C" "C" "T" ...
 - attr(*, ".internal.selfref")=<externalptr>

table(sim_bim$V1)


   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
4918 4857 4813 4772 4810 4914 4840 4696 4790 4906 4782 4756 4803 4671 4814 4869 
  17   18   19   20   21   22 
4632 4834 4908 4942 4947 4860

Examine the phenotype data:

How many individuals in the study have measurements?

table(is.na(pheno_file$Pheno))


FALSE 
 2400

Plot a histogram to show the distribution of the phenotype. Use the hist() function

hist(pheno_file$Pheno)

Version	Author	Date
de1eb44	Joelle Mbatchou	2024-06-12

With PLINK, perform association mapping between the phenotype and the variants in the PLINK BED genotype file. Only perform association test on SNPs that pass the following quality control threshold filters:

minor allele frequency (MAF) > 0.01
at least a 99% genotyping call rate (less than 1% missing)
HWE p-values greater than 0.001

# first fill in the thresholds to use for each filter
filter_maf = 0.01
filter_missing_rate = 0.01
filter_hwe = 0.001

cmd <- sprintf('%s --bfile "%s/sim_rels_geno" --pheno "%s/sim_rels_pheno.txt" --pheno-name Pheno --maf %g --geno %g --hwe %g --glm allow-no-covars --out gwas_plink', plink2_binary, files_dir, files_dir, filter_maf, filter_missing_rate, filter_hwe)
system(cmd)

The results of the GWAS are stored in gwas_plink.Pheno.glm.linear.

Make a Manhattan plot of the association results. Make sure to check what information is stored in the PLINK output file (using str()).

plink.gwas <- fread("gwas_plink.Pheno.glm.linear", header = TRUE)
str(plink.gwas)

Classes 'data.table' and 'data.frame':  105886 obs. of  16 variables:
 $ #CHROM          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ POS             : int  12000011 12000012 12000019 12000027 12000036 12000061 12000073 12000074 12000117 12000136 ...
 $ ID              : chr  "1:12000011:A:C" "1:12000012:A:C" "1:12000019:T:C" "1:12000027:C:T" ...
 $ REF             : chr  "C" "C" "C" "T" ...
 $ ALT             : chr  "A" "A" "T" "C" ...
 $ PROVISIONAL_REF?: chr  "Y" "Y" "Y" "Y" ...
 $ A1              : chr  "A" "A" "T" "C" ...
 $ OMITTED         : chr  "C" "C" "C" "T" ...
 $ A1_FREQ         : num  0.12 0.187 0.402 0.12 0.415 ...
 $ TEST            : chr  "ADD" "ADD" "ADD" "ADD" ...
 $ OBS_CT          : int  2400 2400 2400 2400 2400 2400 2400 2400 2400 2400 ...
 $ BETA            : num  0.0122 -0.018 -0.0849 0.0125 0.0111 ...
 $ SE              : num  0.0438 0.0362 0.0284 0.0435 0.0288 ...
 $ T_STAT          : num  0.279 -0.497 -2.992 0.288 0.387 ...
 $ P               : num  0.78 0.6192 0.0028 0.7731 0.6991 ...
 $ ERRCODE         : chr  "." "." "." "." ...
 - attr(*, ".internal.selfref")=<externalptr>

plot(
  x = 1:nrow(plink.gwas),
  y = -log10(plink.gwas$P),
  col = c("orange", "purple")[1 + plink.gwas$`#CHROM` %% 2],
  xaxt = "n", xlab = "Genomic position", ylab = "Observed -log10(P)"
)

Version	Author	Date
de1eb44	Joelle Mbatchou	2024-06-12

Make a Q-Q plot of the association results.

qq(plink.gwas$P)

Version	Author	Date
de1eb44	Joelle Mbatchou	2024-06-12

Compute the genomic control inflation factor \(\lambda_{GC}\) based on the p-values. Is there evidence of possible inflation due to confounding?

chisq.values <- qchisq(plink.gwas$P, 1, lower.tail = FALSE)
median(chisq.values)/0.456

[1] 1.145772

Now we will run REGENIE to perform a GWAS of the phenotype using a whole genome regression model. We first want to extract a set of high quality variants for the Step 1 null model fitting. Using PLINK, apply QC filters to remove variants with MAF below 5%, missingness above 1%, HWE p-value below 0.001, minor allele count (MAC) below 20. We will use --write-snplist to store list of variants passing QC without making a new BED file.

# first fill in the thresholds to use for each filter
filter_maf = 0.05
filter_missing_rate = 0.01
filter_hwe = 0.001
filter_mac = 20

cmd <- sprintf('%s --bfile "%s/sim_rels_geno" --pheno "%s/sim_rels_pheno.txt" --pheno-name Pheno --maf %g --geno %g --hwe %g --mac %g --write-snplist --out qc_pass', plink2_binary, files_dir, files_dir, filter_maf, filter_missing_rate, filter_hwe, filter_mac)
system(cmd)

This produces a file qc_pass.snplist containing a list of variant IDs that pass the QC filters.

If REGENIE software is installed on your machine

Run REGENIE Step 1 to fit the null model and obtain polygenic predictions using a leave-one-chromosome-out (LOCO) scheme.

cmd <- sprintf('%s --bed "%s/sim_rels_geno" --phenoFile "%s/sim_rels_pheno.txt" --phenoCol Pheno --qt --step 1 --loocv --bsize 1000 --extract qc_pass.snplist --out gwas_regenie', regenie_binary, files_dir, files_dir)
system(cmd)

The LOCO polygenic predictions for the phenotype are stored in gwas_regenie_1.loco.

Run REGENIE Step 2 to perform association testing.

cmd <- sprintf('%s --bed "%s/sim_rels_geno" --phenoFile "%s/sim_rels_pheno.txt" --phenoCol Pheno --qt --step 2 --bsize 400 --pred gwas_regenie_pred.list --out step2_gwas_regenie', regenie_binary, files_dir, files_dir)
system(cmd)

The REGENIE summary statistics will be in step2_gwas_regenie_Pheno.regenie.

Generate Manhatthan and Q-Q plots based on the REGENIE association results and compute \(\lambda_{GC}\). Compare with output from Questions 4-6.

regenie.gwas <- fread("step2_gwas_regenie_Pheno.regenie", header = TRUE)
str(regenie.gwas)

Classes 'data.table' and 'data.frame':  106134 obs. of  13 variables:
 $ CHROM  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ GENPOS : int  12000011 12000012 12000019 12000027 12000036 12000061 12000073 12000074 12000117 12000136 ...
 $ ID     : chr  "1:12000011:A:C" "1:12000012:A:C" "1:12000019:T:C" "1:12000027:C:T" ...
 $ ALLELE0: chr  "C" "C" "C" "T" ...
 $ ALLELE1: chr  "A" "A" "T" "C" ...
 $ A1FREQ : num  0.12 0.187 0.402 0.12 0.415 ...
 $ N      : int  2400 2400 2400 2400 2400 2400 2400 2400 2400 2400 ...
 $ TEST   : chr  "ADD" "ADD" "ADD" "ADD" ...
 $ BETA   : num  0.00851 -0.01943 -0.0747 -0.023 0.01463 ...
 $ SE     : num  0.0419 0.0346 0.0272 0.0416 0.0275 ...
 $ CHISQ  : num  0.0413 0.3153 7.5548 0.3058 0.2823 ...
 $ LOG10P : num  0.0762 0.2407 2.2229 0.2364 0.2254 ...
 $ EXTRA  : logi  NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr>

plot(
  x = 1:nrow(regenie.gwas),
  y = regenie.gwas$LOG10P,
  col = c("orange", "purple")[1 + regenie.gwas$CHROM %% 2],
  xaxt = "n", xlab = "Genomic position", ylab = "Observed -log10(P)"
)

Version	Author	Date
de1eb44	Joelle Mbatchou	2024-06-12

qq(10^-regenie.gwas$LOG10P)

Version	Author	Date
de1eb44	Joelle Mbatchou	2024-06-12

chisq.values <- qchisq(10^-regenie.gwas$LOG10P, 1, lower.tail = FALSE)
median(chisq.values)/0.456

[1] 0.9940042

If REGENIE software does not run on your machine

We will use an implementation of REGENIE in R. Download it here and change the path of the variable regenie_script to the path of the script on your machine

regenie_script <- "data/run_regenie.r"
source(regenie_script)

We now run REGENIE Step 1 to fit the null model and obtain polygenic predictions using a leave-one-chromosome-out (LOCO) scheme.

loco_pred <- run_regenie_step1(
  bedfile = paste0(files_dir, "/sim_rels_geno"),
  phenofile = paste0(files_dir, "/sim_rels_pheno.txt"),
  phenocol = "Pheno",
  bsize = 1000,
  extract = "qc_pass.snplist"
)

Analyzing 2400 samples for phenotype Pheno 
Using genotype file prefix: /Users/joelle.mbatchou/SISG/2024/SISG2024_Association_Mapping/data//sim_rels_geno 
#Variants included = 97387 
Running REGENIE level 0 into 110 blocks
   user  system elapsed 
495.311   6.865  71.089 
Running REGENIE level 1 with 550 ridge predictors
   user  system elapsed 
  1.764   0.022   1.802 
Rsq for each ridge parameter at level 1: 0.07788307 0.08313518 0.09104799 0.09699941 0.09707784 
MSE for each ridge parameter at level 1: 0.932345 0.9199879 0.9144755 0.9133895 0.9315208 
Computing LOCO predictions
   user  system elapsed 
  1.550   0.009   1.565

This function will return the LOCO polygenic predictions for the phenotype.

Run REGENIE Step 2 to perform association testing.

sumstats_regenie <- run_regenie_step2(
  bedfile = paste0(files_dir, "/sim_rels_geno"),
  phenofile = paste0(files_dir, "/sim_rels_pheno.txt"),
  phenocol = "Pheno",
  bsize = 200,
  loco.mat = loco_pred
)

Analyzing 2400 samples for phenotype Pheno 
Using genotype file prefix: /Users/joelle.mbatchou/SISG/2024/SISG2024_Association_Mapping/data//sim_rels_geno 
#Variants tested for association = 106134 
Conditioning on LOCO polygenic predictions from REGENIE step 1
Running association tests...

str(sumstats_regenie)

Classes 'data.table' and 'data.frame':  106134 obs. of  12 variables:
 $ CHROM  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ GENPOS : int  12000011 12000012 12000019 12000027 12000036 12000061 12000073 12000074 12000117 12000136 ...
 $ ID     : chr  "1:12000011:A:C" "1:12000012:A:C" "1:12000019:T:C" "1:12000027:C:T" ...
 $ ALLELE0: chr  "C" "C" "C" "T" ...
 $ ALLELE1: chr  "A" "A" "T" "C" ...
 $ A1FREQ : num  0.12 0.187 0.402 0.12 0.415 ...
 $ N      : int  2400 2400 2400 2400 2400 2400 2400 2400 2400 2400 ...
 $ TEST   : chr  "ADD" "ADD" "ADD" "ADD" ...
 $ BETA   : num  0.00851 -0.01943 -0.0747 -0.023 0.01463 ...
 $ SE     : num  0.0419 0.0346 0.0272 0.0416 0.0275 ...
 $ CHISQ  : num  0.0413 0.3153 7.5548 0.3058 0.2823 ...
 $ LOG10P : num  0.0762 0.2407 2.2229 0.2364 0.2254 ...
 - attr(*, ".internal.selfref")=<externalptr>

This function returns a data frame containing the REGENIE summary statistics.

Generate Manhatthan and Q-Q plots based on the REGENIE association results and compute \(\lambda_{GC}\). Compare with output from Questions 4-6.

plot(
  x = 1:nrow(sumstats_regenie),
  y = sumstats_regenie$LOG10P,
  col = c("orange", "purple")[1 + sumstats_regenie$CHROM %% 2],
  xaxt = "n", xlab = "Genomic position", ylab = "Observed -log10(P)"
)

Version	Author	Date
36fe58b	Joelle Mbatchou	2024-06-13

qq(10^-sumstats_regenie$LOG10P)

Version	Author	Date
36fe58b	Joelle Mbatchou	2024-06-13

chisq.values <- qchisq(10^-sumstats_regenie$LOG10P, 1, lower.tail = FALSE)
median(chisq.values)/0.456

[1] 0.9940056

sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] ggplot2_3.4.2     qqman_0.1.8       dplyr_1.1.2       data.table_1.14.8

loaded via a namespace (and not attached):
 [1] sass_0.4.6       utf8_1.2.3       generics_0.1.3   stringi_1.7.12  
 [5] digest_0.6.31    magrittr_2.0.3   evaluate_0.21    grid_4.3.0      
 [9] calibrate_1.7.7  fastmap_1.1.1    rprojroot_2.0.3  workflowr_1.7.0 
[13] jsonlite_1.8.5   whisker_0.4.1    promises_1.2.0.1 fansi_1.0.4     
[17] scales_1.2.1     jquerylib_0.1.4  cli_3.6.1        rlang_1.1.1     
[21] munsell_0.5.0    withr_2.5.0      cachem_1.0.8     yaml_2.3.7      
[25] tools_4.3.0      colorspace_2.1-0 httpuv_1.6.11    crochet_2.3.0   
[29] vctrs_0.6.2      R6_2.5.1         lifecycle_1.0.3  git2r_0.32.0    
[33] stringr_1.5.0    fs_1.6.2         MASS_7.3-58.4    pkgconfig_2.0.3 
[37] pillar_1.9.0     bslib_0.5.0      later_1.3.1      gtable_0.3.3    
[41] glue_1.6.2       Rcpp_1.0.10      xfun_0.39        tibble_3.2.1    
[45] tidyselect_1.2.0 highr_0.10       rstudioapi_0.14  knitr_1.43      
[49] htmltools_0.5.5  rmarkdown_2.22   compiler_4.3.0   BEDMatrix_2.0.3