Overview of the All by All tables available on the All of Us Researcher Workbench

  • Updated

The All by All tables were created by leveraging the extensive genomic and phenotypic data available through the All of Us Researcher Workbench.

Using the available short-read whole genome sequence and phenotypic data, genome-wide association studies (GWAS) and rare variant association studies (RVAS) were conducted across a wide range of complex human traits. GWAS and RVAS analysis is used to identify genes or genetic variants which are associated with phenotypes such as height or a health condition like diabetes.

The All by All browser maps known and novel associations between genotypes and phenotypes using data contributed by All of Us Research Program participants as of October 1, 2023. You can explore the All by All browser here.  

Note: The All by All browser was built using Hail Tables and SQLite. When comparing results between the All by All browser and the All by All tables in the Researcher Workbench, you should use the Hail Tables available on the Researcher Workbench, rather than the Matrix Tables. 

Table of Contents

Overview

The All by All tables, which are available to Controlled Tier users, include the GWAS and RVAS results for thousands of phenotypes from almost 400,000 participants who have whole genome sequence data available.

Researcher Workbench users do not need to have prior experience conducting GWAS analysis to benefit from the All by All tables and can use the data directly to explore genes or genetic variants which contribute to their phenotype of interest. Users will also benefit from time and cost savings, since researchers will not need to perform the analysis themselves and can query the results directly.

All by All is a large genetic association result database across different ancestries and a wide range of important human traits and diseases. While similar analyses have been performed using other genomic databases, the All of Us Research Program has specific advantages.

The All of Us Research Program enrolls and collects data from participants of different ancestries. Additionally, the large sample size available in the All of Us dataset will provide the statistical power needed to identify associations using rare genetic variants.

Data Details and Quality Control

All by All utilizes the genomic and phenotypic data of the almost 400,000 participants who have short-read whole genome sequencing data available in the All of Us dataset.

All by All includes the results of association testing for over 3,500 phenotypes of seven different categories, including physical measurements, lab measurements, phecodeX, personal and family health history (PFHH), Mental Health and Wellbeing (MHWB) surveys, and electronic health record (EHR) sourced drugs and medications. 

Sample and variant quality control was performed such that only high quality samples, genotypes, and variants were included in downstream analyses. To ensure sufficient statistical power for downstream analyses, only phenotypes with greater than 200 cases within each genetic ancestry group were included. 

In total, 11,550 high-quality ancestry group and phenotype pairs were included in the final data. Find more information under Genetic Ancestry.

For each of the 11,550 phenotype-ancestry combinations (Figure 1), three types of association tests were performed:

  1. GWAS results for common variants from the Allele Count / Allele Frequency (ACAF) callset
  2. GWAS results for exonic variants from the exome callset
  3. RVAS gene-level results for variants from the exome callset
  Lab measurement Physical measurement Random phenotype PhecodeX Prescription MHWB PFHH Total
AFR 81 10 45 1,118 1,518 8 100 2,880
AMR 79 10 45 1,009 1,493 9 103 2,748
EAS 73 10 45 160 814 2 34 1,135
EUR 85 10 45 1,709 1,600 10 128 3,587
MID 57 10 45 9 205 0 1 322
SAS 69 10 45 85 651 0 21 878
Total 444 60 270 4,090 6,281 29 387 11,550

META (unique)

85 10 45 1,721 1,600 10 129 3,600

Figure 1 | Phenotypes included in the All by All tables. Values subtracted from the total column on the right represent rare random phenotypes lacking defined cases in the three smallest similarity groups and therefore excluded due to null model non-convergence in SAIGE step1.

The results of each association test by ancestry or meta-analysis are available as Hail Matrix Tables. Find more information under Data Format. The pipeline is available at https://github.com/atgu/aou_gwas

Genetic Ancestry

The genetic ancestry categories correspond to definitions used within gnomAD, the Human Genome Diversity Project, and 1000 Genomes. Read about how All of Us computes genetic ancestry.

All by All includes data from participants from six major ancestry groups: AFR (African), AMR (Admixed American), EAS (East Asian), EUR (European), MID (Middle Eastern), and SAS (South Asian) to understand the genetic basis of human traits and health conditions in various backgrounds.

Data Format

The All by All data is available as 21 Hail Matrix Tables with the summary statistics of ACAF, exome, and gene-based association testing by individual ancestry or following meta-analysis (Figure 2). In addition, the results of association testing by phenotype are available as  45,474 individual Hail Tables (Figure 2).

Figure 2 | Total number of association test results produced by All by All.

 

The results Hail Matrix Tables are all keyed by the same structure (Figure 3).

  • For gene-based results, the rows are the gene groups keyed by gene_id, gene_symbol, annotation, and max_MAF.
  • For single-variant results (ACAF & exome), the rows are keyed by locus, alleles and phenoname. 
  • The columns are the phenotype meta information fields keyed by phenoname and include participant case and control counts.
  • The entries are the summary statistics for association testing for each phenotype - genotype combination including Pvalue and Beta.

All by All 3.png

Figure 3 | All by All results are available as Hail MatrixTables. Note that the table in Figure 3 should only be used as an example, since the actual fields in the data will differ due to the full results table being too large to present as a figure.

 

The individual phenotype Hail Tables all have the same structure (Figure 4).

  • For gene-based results, the tables are the gene groups keyed by phenoname, gene_id, gene_symbol, annotation, and max_MAF.
  • For single-variant results (ACAF & exome), the tables are keyed by phenoname, locus, and alleles. 

Figure 4 | All by All results are available as per phenotype Hail Tables.

 

The All by All files are available for Controlled Tier users in the Researcher Workbench 2.0. The files are located in All by All folder in the Controlled Tier v8 Data Collection with the naming scheme below:

  • Hail Matrix Tables: /mt/${POP}_${TYPE}_results.mt
  • Hail Tables: /ht/${TYPE}/${POP}/phenotype_${PHENONAME}_${TYPE}_results.ht
  • POP specifies either single ancestry results or meta-analysis results (AFR, AMR, EAS, EUR, MID, SAS, META).
  • TYPE specifies the type of association test (ACAF, exome, gene).
  • PHENONAME specifies the phenotype of interest by concept ID.

To see a full list of phenotypes used in the All by All tables, please see Appendix A

How to query All by All tables

The All by All data is available as Hail Tables and Hail Matrix Tables. To effectively query the data, researcher will load the data into a Jupyter Notebook for analysis using Python. 

To learn how to query the All by All v8 data, please see the "How to query All by All results and analysis details" notebook in the All of Us Tutorial Workspace: Getting Started with Controlled Tier Data (v8) Featured Workspace. 

Individual phenotypes (Hail Table):

The All by All result Hail Tables for individual phenotypes are located in All by All folder in CDRv8 Controlled Tier data collection, with the naming scheme ${TYPE}/${POP}/phenotype_${PHENONAME}_${TYPE}_results.ht

POP specifies either single ancestry results or meta-analysis results (AFR, AMR, EAS, EUR, MID, SAS, META).

TYPE specifies the type of association test (ACAF, exome, gene).

PHENONAME specifies the phenotype of interest by concept ID

Appendix A - All by All Phenotypes Index

All by All v7 Phenotype Index

All by All Phenotype v7 Featured Workspace Index can be found here. This index details the phenotypes included in All by All available in 6 Featured Workspaces grouped by phenotype category (physical measurements, drugs, labs, personal and family health history, Phecode, and PhecodeX), where a graphical summary of each phenotype is available as its own individual notebook. For more details about how to use the Featured Workspaces, please see the ReadMe notebook for each Featured Workspace.

All by All v7 phenotypes Google Sheet here. For a downloadable excel sheet, please see the link at the bottom of this article. 

All by All v8 Phenotype Index

All by All v8 phenotypes Google Sheet here. For a downloadable excel sheet, please see the link at the bottom of this article.

The updated Featured Workspaces using All by All v8 data are in progress and will be released soon.

 

 

Appendix B - Past CDR Versions

All by All tables were previously generated using the CDR v7 dataset and are available at their original file pathgs://fc-aou-datasets-controlled/AllxAll/v1 in Researcher Workbench 1.0. All by All v7 data will be available in Researcher Workbench 2.0 soon.  

Please see Appendix C for Known Issues with the All by All CDR v7 data.

Appendix C - Known Issues

Known Issue: Adding All by All v8 data in Researcher Workbench 2.0

For Researcher Workbench 2.0 workspaces created before May 20, 2026 that include both CDRv8 versions (C2024Q3R8 and C2024Q3R9), there is a temporary issue preventing new data resources from being added through the data collection UI. In the meantime, the All by All v8 data remains accessible by querying it directly within a notebook.

Known issue: Removal of PFHH and Phecode Hail Matrix Tables (January 2025)

Some All by All tables for PFHH and Phecode results were deprecated in January 2025. Researchers can still access the results at the updated file paths; however, we do not recommend using these results for further analyses. See Incremental Update for CDRv7 All by All tables for more information. 

Known issue: Drug Phenotype Case Counts for All by All CDR v7

Some broad-category ATC codes have fewer case counts than expected. This may impact result interpretation and will be addressed in future releases.

 


References

The All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). https://doi.org/10.1038/s41586-023-06957-x

Karczewski, K. J., Solomonson, M., Chao, K. R., Goodrich, J. K., Tiao, G., Lu, W., Riley-Gillis, B. M., Tsai, E. A., Kim, H. I., Zheng, X., Rahimov, F., Esmaeeli, S., Jason Grundstad, A., Reppell, M., Waring, J., Jacob, H., Sexton, D., Bronson, P. G., Chen, X., Hu, X., Goldstein, J. I., King, D., Vittal, C., Poterba, T., Palmer, D. S., Churchhouse, C., Howrigan, D. P., Zhou, W., Watts, N. A., Nguyen, K., Nguyen, H., Mason, C., Farnham, C., Tolonen, C., Gauthier, L. D., Gupta, N., MacArthur, D. G., Rehm, H. L., Seed, C., Philippakis, A. A., Daly, M. J., Wade Davis, J., Runz, H., Miller, M. R. & Neale, B. M. Systematic single-variant and gene-based association testing of thousands of phenotypes in 394,841 UK Biobank exomes. Cell Genomics, (2022). https://doi.org/10.1016/j.xgen.2022.100168

Denny J. C., Ritchie M. D., Basford M. A., Pulley J. M., Bastarache L., Brown-Gentry K., Wang D., Masys D. R., Roden D. M., Crawford D. C., PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations, Bioinformatics, Volume 26, Issue 9, May 2010, Pages 1205–1210, https://doi.org/10.1093/bioinformatics/btq126

Shuey M. M., Stead W. W., Aka I., Barnado A. L., Bastarache J. A., Brokamp E., Campbell M., Carroll R. J., Goldstein J. A., Lewis A, Malow B. A., Mosley J. D., Osterman T., Padovani-Claudio D. A., Ramirez A., Roden D. M., Schuler B. A., Siew E., Sucre J., Thomsen I., Tinker R. J., Van Driest S., Walsh C., Warner J. L., Wells Q. S., Wheless L., Bastarache L., Next-generation phenotyping: introducing phecodeX for enhanced discovery research in medical phenomics, Bioinformatics, Volume 39, Issue 11, November 2023, https://doi.org/10.1093/bioinformatics/btad655

Sucre J., Thomsen I., Tinker R. J., Van Driest S., Walsh C., Warner J. L., Wells Q. S., Wheless L., Bastarache L., Next-generation phenotyping: introducing phecodeX for enhanced discovery research in medical phenomics, Bioinformatics, Volume 39, Issue 11, November 2023, https://doi.org/10.1093/bioinformatics/btad655

Was this article helpful?

13 out of 16 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.