How the All of Us Genomic data are organized (ARCHIVED C2021Q3R6 CDR CT Dataset v5)

  • Updated

The All of Us genomic dataset contains whole genome sequencing (WGS) data and microarray genotype data (Array).  The genomic data are accessible through the Researcher Workbench Controlled Tier dataset (e.g., genomic data is not available through the Registered Tier).  Bucket locations for accessing the data in analysis notebooks can be found in the Controlled CDR Directory.  We provide variant data in Variant Call Format (VCF), Hail MatrixTables (MT), and PLINK 1.9 bed/bim/fam triplets.  PLINK files are only available for the array variant data.  We provide auxiliary tabular data, such as the joint callset QC flagged samples or related pairs, as tab-separated values (tsv), with the column headers in the first row.  Please see Appendix A for a detailed comparison of the WGS data and the Array data.  Please note that all variants are called against the hg38/GRCh38 reference.  

Variant Call Format (VCF)

The Variant Call Format (VCF) is a text file that stores genomic variant data in a tabular form (genomic position by sample ID) with a descriptive header.  All of Us VCF files are based on VCF version 4.2 specification.  Most genomic tools that handle variant data analyses  (e.g. Hail, PLINK, Variant Effect Predictor (VEP), Genome Analysis Toolkit (GATK)) support VCF format.  

The current WGS dataset includes 20,039 joint VCF files and corresponding index files (.tbi) for 98,622 participants.  Each file is a separate region of the genome, with no file overlapping; internal lists for each VCF shard is available. The current  Array data includes 165,208 single sample VCFs and corresponding index files (.tbi) for 165,208 participants.

The information within a VCF is broken into two basic categories: site-level (INFO field) and per genotype (FORMAT field).  Additionally, each site has a list of failed filters (FILTER field).  If the FILTER field is empty (“.”) or “PASS,” then researchers should assume that there is a call at this site (though see Genotype Filter for WGS data, below).

In the current genomic dataset, VCFs only include single-nucleotide variants (SNVs) and short insertion-deletion (Indel) information.

Please note that  individual VCFs may contain substantially different information.  For example, a VCF generated from WGS samples will typically have different fields than what would be found in a VCF from array data.  The header in a VCF file will explain which fields are present and what their data type (e.g. number vs string) and descriptions are.

WGS VCFs

The WGS VCFs are sharded by genomic region, with no shard spanning multiple contigs.  We also provide interval lists (.interval_list) that describe the genomic region covered by each shard and a summary file that details the extent of each shard (see Appendix B).  Note that the filter tag (FT) on the genotypes is populated and analyses should take these into account by treating those genotypes as “no calls” (“./.”).

Please note that several fields in the WGS VCF (and corresponding Hail MatrixTable) should be ignored.  Please see Known Issues #6 in the All of Us Research Program Genomic Research Data Quality Report.  Those fields are not included in the descriptions below.

WGS VCFs in the All of Us genomic dataset contain the following:

FORMAT fields (per sample-site):  

  • Genotype (GT) -- The GT field specifies the alleles carried by the sample, encoded by a 0 for the reference (REF) allele, 1 for the first alternative (ALT) allele, 2 for the second ALT allele, etc. Since humans are diploid organisms, we expect two alleles (e.g. “0/1”).  Please note that the GT calls on sex chromosomes will have two alleles, even in the case of chrY and chrX in males.

  • Genotype Quality (GQ) -- The phred-scaled confidence that the called genotype is correct.  A higher score indicates a higher confidence.  For more information on GQ, please see the GQ documentation.  For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.

  • Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct.  A higher score indicates a higher confidence.  For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant.  For more information on interpreting phred-scaled values, please see Phred-scaled quality scores

  • Genotype Filter (FT) --  The WGS information has additional genotype-level filtering information.  As part of our joint callset quality control processing, we run Allele-Specific Variant Quality Score Recalibration, and use the results to populate the genotype filter (FT) field.  An example code snippet for filtering genotypes, in Hail, can be found in the Manipulating Hail Matrix Table tutorial notebook.

    • Possible values:

      • low_VQSLOD_INDEL -- The site did not pass the indel model cutoff for the recalibrated variant quality score. The cutoff corresponds to a target sensitivity of 0.990. 

      • low_VQSLOD_SNP -- The site did not pass the SNP model cutoff for the recalibrated variant quality score.  The cutoff corresponds to a target sensitivity of 0.997.

INFO fields (per site):

Descriptions of the INFO fields can also be found in the header of the VCF.

  • Allele Count (AC) -- the number of times we see each alternate allele in the sample.  For example, a “1/1” genotype would count as 2 observations of the first alternate allele.

  • Allele Number (AN) -- the total number of alleles seen.  Usually, this will be the number of samples times two, since humans are diploid organisms. No-call genotypes (“./.”) are not counted towards AN.

  • Allele Frequency (AF) -- the frequency of the alternate allele in the population that is the callset cohort. This is equivalent to AC/AN.

  • QUAL approximation (QUALapprox) -- the sum of the phred-scaled homozygous reference probability values across all samples, which is a proxy for the site-level QUAL score, but without the SNP or indel heterozygosity applied as a per-site prior probability of variation.  For more information on the QUAL score, see the VCF specification.

  • Allele-specific QUAL approximations (AS_QUALapprox) -- a per-allele, phred-scaled quality score derived from the sum of homozygous reference probability values across samples when each allele is considered in isolation. This is an approximation of the QUAL score for each allele.  For more information on the QUAL score, see the VCF specification.

  • Allele-specific variant quality score log-odds (AS_VQSLOD) -- for each alt allele, the log odds of being a true variant versus being false under the trained gaussian mixture model. Please see the GATK VQSR documentation or Genomics in the Cloud for more information on VQSR.

FILTER values (per site):

  • QUAL score does not meet threshold (LowQual) -- sites with this filter have a posterior probability of being variant that is equal to or below the probability of being variant by chance, represented by the expected heterozygosity for humans (QUALapprox lower than 60 for SNPs; 69 for Indels)

    • QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.

  • No high-quality genotypes (NO_HQ_GENOTYPES) -- sites with this filter do not have any genotypes that are considered high quality (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes)

    • Allele Balance (AB) is calculated for each heterozygous variant as the number of reads supporting the least-represented allele over the total number of read observations.  In other words, min(allele depth)/(total depth) for diploid GTs.

  • Excess Heterozygosity (ExcessHet) -- sites with this filter have more heterozygote genotypes than expected by chance under Hardy-Weinberg equilibrium. ExcessHet is a phred-scaled p-value. We cutoff anything more extreme than a z-score of -4.5 (p-value of 3.4e-06), which phred-scaled is 54.69

Array VCFs

Array VCFs in the All of Us genomic dataset will contain the following:

Header

The header field of the VCF contains many attributes which generally describe the processing of the sample in the array.  Many of these are specific to a single sample.

  • arrayType - This contains the name of the genotyping array that was processed.

  • autocallDate - The date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software.

  • autocallGender - The gender (sex) that autocall determined for the sample processed.

  • autocallVersion - The version of the autocall/gencall software used.

  • chipWellBarcode - The chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array).

  • clusterFile - The cluster file used.

  • extendedIlluminaManifestVersion - The version of the ‘extended Illumina manifest’ used by the VCF generation software.

  • extendedManifestFile - The filename of the ‘extended Illumina manifest’ used by the VCF generation software.

  • fingerprintGender - The gender (sex) determined using an orthogonal fingerprinting technology.  This is populated by an optional parameter used by the VCF generation software.

  • gtcCallRate - The gtc call rate of the sample processed.  This value is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls.

  • imagingDate - The date that the idats (raw image scans) for the chip well barcode were created.

  • manifestFile - The name of the Illumina manifest (.bpm) file used by the VCF generation software.

  • sampleAlias - The name of the sample.

Note that there are many other attributes in the header (Biotin*, DNP*, Extension*, Hyb*, NP*, NSB*, Restore, String*, TargetRemoval) that are populated with Illumina control values.  They are not described here.

Filtered Sites (FILTER)

There are several filters specific to genotyping array content.  These are:

  • DUPE - This filter is applied if there are multiple rows in the VCF for the same loci and alleles.  That is, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set.

  • TRIALLELIC - This filter is applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele.

  • ZEROED_OUT_ASSAY - This filter is applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file - this is typically done when the calls at the site are intentionally marked as unusual.  Genotypes called sites that are ‘zeroed out’ will always be no-calls.  

Genotype (sample level fields)

These fields describe attributes specific to the sample genotyped on the array.  The FORMAT specifier in the VCF header describes these fields.  They are:

  • GT - GENOTYPE.  This field describes the genotype.  It is a standard field, described in the VCF specification.

  • IGC - Illumina GenCall Confidence Score.  A measure of the call confidence.

  • X - Raw X intensity as scanned from the original genotyping array

  • Y - Raw Y intensity as scanned from the original genotyping array

  • NORMX - Normalized X intensity

  • NORMY - Normalized Y intensity

  • R - Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)

  • THETA - Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)

  • LRR - Log R Ratio

  • BAF - B Allele Frequency

INFO (site level fields)

These fields describe attributes specific to the probe on an array.  The INFO specifier in the VCF header describes these fields.  They are:

  • AC - Allele Count in genotypes, for each ALT allele.  A standard field, described in the VCF specification

  • AF - Allele Frequency.  A standard field, described in the VCF specification

  • AN - Allele Number.  A standard field, described in the VCF specification

  • ALLELE_A - The A Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)

  • ALLELE_B - The B Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)

  • BEADSET_ID - The BeadSet ID.  An Illumina identifier.  Used for normalization.

  • GC_SCORE - The Illumina GenTrain Score.  A quality score describing the probe design

  • ILLUMINA_BUILD - The Genome Build for the design probe sequence, as annotated in the Illumina manifest

  • ILLUMINA_CHR - The chromosome of the design probe sequence, as annotated in the Illumina manifest.

  • ILLUMINA_POS - The position of the design probe sequence (on ILLUMINA_CHR), as annotated in the Illumina manifest.

  • ILLUMINA_STRAND - The strand for the design probe sequence, as annotated in the Illumina manifest.

  • PROBE_A - The allele A probe sequence as annotated in the Illumina manifest.

  • PROBE_B - The allele B probe sequence as annotated in the Illumina manifest.  Note that this is only present on strand ambiguous SNPs.

  • SOURCE - The probe source as annotated in the Illumina manifest.

  • refSNP - The dbSNP rsId for this probe

Hail MatrixTable

A Hail MatrixTable (MT) is a set of binary files describing a two-dimensional matrix of entry fields where each entry is indexed by row key (variants) and column key (samples).  We provide Hail MatrixTables for both WGS and Array data.  Please refer to the published notebooks on how to use Hail MatrixTables.

Array Hail MatrixTable

We have merged the Array VCFs into a dense Hail MatrixTable (MT) (i.e. no additional processing across samples).  Each column corresponds to the research ID of the sample and each row corresponds to the variant.  Since the single sample array VCFs have identical sites and FILTER values, the FILTER field is populated with the value from a single sample VCF.

In conversion, we have dropped all of the 548 variants from alternate, unlocalized, and unplaced contigs (453 variants from ALT contigs (e.g. chr19_KI270866v1_alt), 82 from random contigs (e.g.  chr1_KI270706v1_random), and 13 from chrUn).  Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs.

WGS Hail MatrixTable

The WGS MT contains the result of the WGS joint callset.  The WGS MT does not split multi-allelic variants, and all “pass” in the filter field is written as “.” (i.e. missing) in the Hail MatrixTable. The WGS Hail MatrixTable contains all the information presented in the VCFs. Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs and on how to filter WGS variants.

Array PLINK 1.9 data

We provide binary PLINK 1.9 data (.bed / .bim / .fam) for Array data. The PLINK files are converted from the Hail MatrixTable using the export_plink command in Hail and contain all information in the Hail MatrixTable. The .bim file contains information on the participants, the .fam file contains information on the genetic markers, and the binary .bed file contains individual identifiers and genotypes. Please refer to the published notebooks on how to use the PLINK 1.9 data.

Please note that we will provide pgen files, instead of bed/bim/fam, in future callset releases.

Relatedness

We report the kinship score of any pair with a score over 0.1.  

The kinship score is half of the fraction of the genetic material shared.

  • Parent-child or siblings: 0.25

  • Identical twins: 0.5

Please see the Hail pc_relate function documentation for more information, including interpretation.

We provide two tables:

  • Pairwise samples with kinship scores above 0.1.  We will not provide identity kinship scores (i.e. kinship of a sample with itself).
    • Note that a pair will only appear once.  In other words, {sample1, sample2, 0.25} is equivalent to {sample2, sample1, 0.25}.

Field name

Type

Key?

Notes

i.s

string

yes

Sample ID of a sample in the pair

j.s

string

yes

Sample ID of the other sample in the pair

kin

float

no

Kinship score (0-0.5)

Column Explanations

    • Field name -- The name of the field.  In tsvs, this will appear on the first row of the file.
    • Type -- Data type.  Arrays are possible.
    • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
    • Notes -- Any other relevant information.
  • A list of samples that indicate the samples to flag.  This will be the maximal independent set for related samples.  This set is the minimal set of samples to prune to remove related samples from the cohort. 

Field name

Type

Key?

Notes

sample_id.s

string

Yes

Research ID of the sample

Column Explanations

    • Field name -- The name of the field.  In tsvs, this will appear on the first row of the file.
    • Type -- Data type.  Arrays are possible.
    • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
    • Notes -- Any other relevant information.

Genetic predicted ancestry

We provide categorical genetic ancestry for all participants as a .tsv file. The ancestry categories are correspond directly to categorial ancestry definitions used within  gnomAD, the Human Genome Diversity Project, and 1000 Genomes:

African/African American (afr), American Admixed/Latino (amr), East Asian (eas), European (eur), Middle Eastern (mid), South Asian (sas), and Other (oth; not belonging to one of the other ancestries or is a balanced admixture).

The report is a table, saved as a tsv, sorted by research ID.

Field Name

Key?

Type

Nullable?

Example Value

Notes

research_id

yes

String

No

1000055

This comes from sample metadata.

ancestry_pred

no

String

No

mid

The predicted ancestry for the sample, not including “other”.

probabilities 

no

Array[number]

No

[0.10, 0.99, 0.001, … 0.0]

Confidence of each output class (i.e. computed ancestry).
Each will have a length equal to the number of possible computed ancestry labels minus one (6).  The ancestry “Other” is computed separately based on the confidence of the other classes.

pca_features

no

Array[number]

No

[8.1232, 0.01234, 3.1123, …, 0.00132]

The principal components of the projection for the sample.  Each value is an array with a length of 16.

ancestry_pred_other

no

String

No

oth

The predicted ancestry for the sample, including “other”.

Column Explanations

    • Field name -- The name of the field.  In tsvs, this will appear on the first row of the file.
    • Type -- Data type.  Arrays are possible.
    • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
    • Notes -- Any other relevant information.

Flagged WGS samples

We provide a table listing samples that are flagged as part of the sample outlier QC for the WGS joint callset.  This includes the specific residual tests that were failed.The schema is described in the table below.  The table will be released as a tsv.

Flagged sample tsv schema

  • No fields can have a null value.
  • Count fields do not include filtered variants.
  • For all of the fail_* fields, a value of true indicates that the sample is an outlier and should be flagged.

Field Name

Type

Key?

Example Value

Notes

s

int

yes

1000000

Research ID

ancestry_pred

string

no

eur

The predicted ancestry for the sample, not including “other”.

probabilities

array<float>

no

[0.10, 0.99, 0.001, … 0.0]

Confidence of each output class (i.e. computed ancestry).
Each will have a length equal to the number of possible computed ancestry labels minus one (6).  The ancestry “Other” is computed separately based on the confidence of the other classes.

pca_features

array<float>

no

[8.1232, 0.01234, 3.1123, …, 0.00132]

Each will have a length of 16.

ancestry_pred_other

string

no

oth

The predicted ancestry for the sample, including “other”.

snp_count

int

no

3910035

Number of SNPs called in this sample.

ins_del_ratio

float

no

0.98814

Ratio of insertion to deletion counts.

del_count

int

no

427102

 

ins_count

float

no

456515

 

snp_het_homvar_ratio

float

no

2.1119

 

indel_het_homvar_ratio

float

no

2.3994

 

ti_tv_ratio

float

no

1.9967

 

singleton

int

no

15819

IMPORTANT:  This is not the number of singletons in a sample.  

This field is a count of the number of variants  not appearing in gnomAD 3.1.

fail_snp_count_residual

boolean

no

true

 

fail_ins_del_ratio_residual

boolean

no

false

 

fail_del_count_residual

boolean

no

true

 

fail_ins_count_residual

boolean

no

false

 

fail_snp_het_homvar_ratio_residual

boolean

no

true

 

fail_indel_het_homvar_ratio_residual

boolean

no

false

 

fail_ti_tv_ratio_residual

boolean

no

true

 

fail_singleton_residual

boolean

no

false

 

qc_metrics_filters

array<string>

no

["indel_het_homvar_ratio_residual",

"snp_count_residual"]

A list of each failed test.  These will correspond to all fail_* fields with a value of “true”.

 

Variant Annotation Table

We provide 102 functional annotations through the Variant Annotation Table. We provide the annotations as compressed tab-separated value text files ( “.tsv.gz”) that match each shard of the VCFs. Additionally, we provide a single, merged tsv file (“.tsv.bgz”).  

 

Appendix A: WGS and Array data comparison for beta

Table A.1 -- Comparison between WGS and Array genomic data

Deliverables

WGS

Array

Notes

VCFs

Joint called VCFs sharded by genomic region

Single sample VCFs 

For arrays, VCFs all have the same variants.


All VCFs are sorted and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi).

Hail Matrix Table

Yes

Yes

 

PLINK files

No

Yes

Generated from the Hail MatrixTable.

AC, AN, and AF in the VCFs

Yes

No

Not meaningful in single sample VCFs (i.e. array VCFs).

AC, AN, and AF in the Hail MatrixTable

Yes

Yes

For arrays, we have created one Hail MT based on all of the array VCFs.  We re-calculate the AC, AN, and AF.

Ancestry

Yes

No

WGS samples are a subset of array samples, so the intersecting array samples will have ancestry information.

Relatedness

Yes

No

WGS samples are a subset of array samples, so the intersecting array samples will have relatedness information.

Joint callset QC

Yes

No

Please see the Joint Callset section of the All of Us Beta Release Genomic Quality Report.

Functional Annotations

Yes

No

 

 

Appendix B: WGS VCF shard interval extent file format

We include a single tsv file that details the genomic extent of each VCF shard interval file.  This allows researchers to prune the considered intervals up front in their analyses.  The format of the files provided are in Table B.1:

Table B.1 -- Interval list extent tsv file

  • No fields can have a null value.

Field Name

Type

Key?

Example Value

Notes

filename

string

yes

0000000000-scattered.interval_list

The filename of the corresponding interval list file.  Note that the shard number is in the filename.

start_contig

string

no

chr1

This will always be the same as the end_contig, since the interval lists do not span multiple contigs.

start_position

int

no

10001

Position on the contig for the start of the interval list file.  

end_contig

string

no

chr1

This will always be the same as the start_contig, since the interval lists do not span multiple contigs.

end_position

int

no

124997

This will always be greater than the start_position.

Position on the contig for the end of the last interval in the interval list file.  

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.