How the All of Us Genomic data are organized

  • Updated

The All of Us genomic data includes short read whole genome sequencing (srWGS) data, long read whole genome sequencing (lrWGS) data, and microarray genotyping array (“array”) data. Researchers access this genomic data through the Researcher Workbench (RW) Controlled Tier dataset (e.g. genomic data is not available through the Registered Tier). Bucket locations for accessing the data in analysis notebooks can be found in the Controlled CDR Directory

Short variants - Single Nucleotide Polymorphisms (SNPs) and Insertions & Deletions (Indels) - are available for srWGS data, lrWGS data, and arrays. Structural variants (SVs) are available for srWGS and lrWGS data. We provide variant data in VariantDataset (VDS), Hail MatrixTable (MT), Variant Call Format (VCF), Binary GEN format (BGEN), and PLINK 1.9 bed/bim/fam triplets. Raw data is available in compressed CRAM or BAM format for the WGS reads and IDAT files for array data. We also provide auxiliary tabular data, such as the joint callset QC flagged samples or related pairs. A summary of the file formats for each data type can be found in Table 1

In this article, we will summarize the genomic data formats and what information is available in each data type. In some cases, we will refer to other documentation when it describes the data format we deliver. This article assumes a general knowledge of genomics and bioinformatics. For more questions on how to get started with genomic data, please reach out to the All of Us user support team at support@researchallofus.org. We also provide a detailed report on the quality of the genomic data with each release in the All of Us Genomic Quality Report available on the User Support Hub.

Genomic Data in the All of Us Research Program

The main deliverables of interest in the All of Us Research Program are the genomic variants, which are delivered in multiple data formats in order to meet researchers' various needs (Table 1). With the v7 data release, we are introducing the new VariantDataset data format, which is a Hail data storage format for large datasets. This format was introduced because of the large size of the All of Us srWGS SNP & Indel dataset and the scaling issues of VCF and Hail MatrixTable formats. 

 

Table 1 – Deliverables for each genomic data type

Deliverable srWGS SNP & Indel Array srWGS SVs lrWGS
Reference version hg38/GRCh38 reference

hg38/GRCh38 reference

Note: variants are called originally with hg19 reference but they are lifted over before release on RW

hg38/GRCh38 reference

T2Tv2.0

grch38_noalt 

Raw data CRAM files IDAT files CRAM files (same deliverable as srWGS SNP & Indels)

BAM files: 

- grch38_noalt standard & haplotagged

- T2Tv2.0 standard & haplotagged

Graphical Fragment Assembly (GFA) files: Primary de novo assembly, one de novo assembly for each chromosome copy. 

Fasta files: one for each GFA file

Variant data

VDS: all samples, complete WGS

VCFs, Hail MT, BGEN, PLINK bed:

joint called (queried from the VDS) over limited genomic regions - ACAF threshold, exome, ClinVar

Single sample VCFs (all VCFs have the same variants)

Hail MatrixTable (merged)

PLINK bed files (merged)

Joint called VCF

Sites-only joint-called VCF

VCFs

- single sample SNP and Indel phased variants

- single sample SNP and Indel unphased variants

- joint called SNP and Indel variants

- single sample PAV variants

- single sample PBSV SVs

- single sample Sniffles2 SVs

Hail MT:

- joint called SNP and Indel variants

SNF:

- single sample Sniffles2 SNF

Auxiliary files

Functional Annotations: (Variant Annotation Table)

Relatedness

Maximal set of unrelated samples

Ancestry

Limited region callset UCSC BED files

Flagged samples

srWGS Genomic metrics file

Ancestry and relatedness available for array samples that have srWGS data

Ancestry and relatedness available for srWGS samples based on the srWGS SNP & Indel deliverables

Maximal set of unrelated samples

Samples with probable aneuploidies

Ancestry and relatedness  available for lrWGS samples based on the srWGS SNP & Indel deliverables

lrWGS variant metrics

 

Genomic Data File Types

VariantDataset (VDS)

The Hail VariantDataset (VDS) is a data storage format we use for the All of Us srWGS SNP & Indel variant data. With one of the largest callsets in the world, the VDS helps to store variant data efficiently for all samples over the entire genome. The VDS is a sparse Hail data storage format that stores less data, but more information. As a comparison, the Hail MatrixTable is a dense variant storage format with every entry populated. For an overview of the VDS, check out ‘The new VDS format for All of Us srWGS data’ article on the User Support Hub

Most downstream analyses of the VDS involve filtering and converting the VDS into a VCF, Hail MatrixTable (MT), or other dense format (“densifying”). We have performed this step already to cover most use cases with reduced srWGS SNP & Indel variant datasets in VCF, Hail MT, BGEN, and PLINK bed formats over commonly used areas of the genome (see Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK).

Instructions for densifying the VDS are available in the article ‘The new VDS format for All of Us srWGS data’ and the VDS tutorial notebook. 

In the following sections, we describe how the VDS stores variant data, reference data, and how to determine if a variant site is filtered. 

Variant Data

The VDS uses variant level row fields to store data for all samples, including the variant locus (locus), a list of alternate alleles (alleles), and site level filtering data (filters). Local fields store data that only apply to a single sample, including genotype metadata and genotype filtering. The local alleles (LA) array maps the alleles that appear in the individual sample to the list of alternate alleles (alleles), thus genotype metadata is only stored for samples with the genotype.

Some familiar annotations from a VCF or Hail MT are not present in the VDS, but can be rendered when densifying the VDS. The allele count for each alternate allele (AC), the total number of alleles at each site (AN), and the frequency of each alternate allele (AF) are also stored in the Variant Annotation Table (VAT) for all variants that pass filtering.

Tables 2-5 describe the fields in the All of Us VDS. Please see the Hail documentation for more information on the Hail data types.

 

Table 2. VDS column fields: stores sample name

VDS Field Description Hail data type
s Research ID str

 

Table 3. VDS row fields: stores variant data

VDS Field Description Hail data type Example
locus Positional data for the variant. Formatted as chromosome name and position separated by colon. locus<GRCh38> chr1:12807
alleles List of alleles at a locus for all samples (otherwise known as global alleles). The first allele is the reference allele. All the alternate alleles are then listed in alphabetical order. array<str> [“C”, “T”]
filters Site level filtering information. Hard threshold filters include NO_HQ_GENOTYPES, LowQual, and ExcessHet. If no filtering reason is provided or there is a PASS, then the site has passed filtering. set<str> {“LowQual”, “NO_HQ_GENOTYPES”}
as_vqsr Allele-specific VQSR filtering model information for this site. Does not contain information about whether or not the site was filtered. We recommend that most users ignore this field and look at filters for useful filtering information.

dict<str, struct {

model: str, 

vqslod: float64, yng_status: str }>

{“A”:(“SNP”,-1.15e+00,”G”)}

 

Table 4. VDS entry fields: stores genotype level variant data

VDS Field Description Hail data type Example
GT Genotype for the sample at this locus. The call values are coordinates to the alleles array. GT can appear similar to LGT, however, the LGT call uses coordinates to the local alleles (LA) array. Not standard in the VDS but present in the All of Us VDS. Follows VCF description. call [1/1]
GQ Genotype Quality. Follows VCF description. int32 63
RGQ Reference Genotype Quality. Follows VCF description. int32 101
LGT Local genotype. The coordinates map to LA. LA always includes the reference allele so the call can be [0/1], [1/1], or [1/2]. call [1/1]
LAD Local allele depth, describes the allele depth for one sample. Maps to the alleles described in the local alleles (LA) array. See VCF description. array<int32>  [0,8]
LA Local alleles. The reference allele and allele(s) that appear in the sample are listed as coordinates mapping to the global alleles array. The reference coordinate is always included. array<int32>  [0,1]
FT Boolean containing genotype level filtering. True for PASS, False for FAIL, and NA for (.). In most cases, NA should be treated as PASS. The filtering reason is not provided.  bool True

 

Table 5. VDS global fields: filtering metadata for the entire callset

Note: These fields report metadata of the filtering model. See the row filter field filters or entry genotype field FT to see whether a variant did not meet the threshold reported in these fields. 

VDS Field Description Hail data type
tranche_data VQSR model information

array<struct { 

model: str,

truth_sensitivity: float64,

min_vqslod: float64,

filter_name: str } >

truth_sensitivity_snp_threshold SNP sensitivity threshold float64 
truth_sensistivity_indel_threshold Indel sensitivity threshold float64
snp_vqslod_threshold VQSR SNP threshold float64
indel_vqslod_threshold VQSR Indel threshold float64

 

Reference Data

The VDS also stores reference data for each sample as reference blocks in a separate component table reference_data. The row key is the locus and the ref_allele denotes the reference base at the genomic coordinate. Columns are keyed by the sample ID. No data at a particular location indicates that the sample has a variant call.

 

Table 6. VDS reference data column fields: stores sample name

VDS Field Description Hail data type
s Research ID str

 

Table 7. VDS reference data row fields: stores reference data

VDS Field Description Hail data type Example
locus Positional data for the variant. Formatted as chromosome name and position separated by colon. locus<GRCh38> chr1:10029
ref_allele The reference allele at the genomic coordinate str “A”

 

Table 8. VDS reference data entry fields: stores reference blocks

VDS Field Description Hail data type Example
GQ Genotype Quality. Follows VCF description. int32 40
END Indicates the end of the reference block, which is the group of consecutive non-variant sites that have the same genotype quality. All coordinates between the start locus and the end coordinate are called as reference for the sample. int32 10036

 

Filtering Information

The variant filtering data is represented in two fields in the VDS, filters and the FT field (Table 3, Table 4). The filters array contains site level filters, including NO_HQ_GENOTYPES, LowQual, and ExcessHet. If no filtering reason is provided or the filters field contains PASS, then the site has passed filtering. The FT field contains genotype level filtering. The genotype level filtering reasons are not specified in the All of Us VDS, there will be a boolean describing the filtering status for the genotype. True is PASS and False is FAIL. If all genotypes fail at a site, the True or False boolean can also apply to the filters array. The variant filtering process is described in depth in the QC report. All filtered variants are soft filtered, which means the variants will be marked but not removed from the callset. 

We provide a tutorial notebook for converting VDS to a Hail MT format, including code to transform the FT boolean True or False in the VDS to PASS or FAIL so that it is compatible for converting to a VCF.

Hail MatrixTable

A Hail MatrixTable (MT) is a dense variant data storage format: a set of binary files describing a two-dimensional matrix of entry fields where each entry is indexed by row key (variants) and column key (samples). We recommend that researchers become familiar with Hail and the Hail MT format. We provide a complete Hail MT for array data, limited Hail MTs for srWGS SNP & Indel data, and a complete Hail MT for the joint lrWGS SNP & Indel variant calls. Please refer to the published notebooks on how to use Hail MTs.

Array Hail MatrixTable

We have merged the array VCFs into a Hail MT with no additional processing across samples. Each column corresponds to the research ID of the sample and each row corresponds to the variant. Since the single sample array VCFs have identical sites and FILTER values, the FILTER field is populated with the value from a single sample VCF.

In conversion, we have dropped all of the 505 variants from alternate, unlocalized, and unplaced contigs (436 variants from ALT contigs (e.g. chr19_KI270866v1_alt), 72 from random contigs (e.g.  chr1_KI270706v1_random), and 13 from chrUn (e.g. chrUn_KI270742v1). These variants are still in the compressed array VCFs. The array Hail MatrixTable contains 1,826,060 variants. Please refer to the published notebooks on how we generated the Hail MatrixTable from the VCFs.

srWGS SNP & Indel Hail MatrixTable

We have released srWGS SNP and Indel variants in Hail MatrixTables covering three reduced genomic regions: ACAF threshold, exome, and ClinVar. The ACAF threshold callset contains variants that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome callset contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar callset contains variants in ClinVar, regardless of pathogenicity. Each reduced MT is available as both a multiallelic and multiallelic split Hail MT, resulting in six total Hail MT deliverables for the srWGS SNP and Indel callset. In the multiallelic split MT, sites with multiple alternate alleles will be split, so each row will only have one alternate allele. In the multiallelic MT, sites with multiple alternate alleles will be retained in the same row.

Please see the Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK for a thorough description of these Hail MTs. The complete srWGS SNP and Indel callset across all sites is released as a VDS, which is a Hail sparse data format. We provide a tutorial notebook for converting VDS to a Hail MT format, though we recommend that you stick with the premade Hail MT, if possible, to save time and money.

lrWGS SNP & Indel Joint Callset Hail MT

The lrWGS SNP and Indel variants are released as a joint callset in Hail MT format. The Hail MT was created by joining the single sample VCFs with GLNexus (version 1.4.1). The joint VCF was converted to Hail MT in Hail. This callset is also released in VCF format. Please see the QC report for more information.

Variant Call Format (VCF)

The Variant Call Format (VCF) is a text file that stores genomic variant data in a tabular form (genomic position by sample ID) with a descriptive header.  All of Us VCF files are based on VCF version 4.2 specification.  Most genomic tools that handle variant data analyses  (e.g. Hail, PLINK, Variant Effect Predictor (VEP), Genome Analysis Toolkit (GATK)) support VCF format. 

The information within a VCF is broken into two basic categories: site-level (INFO field) and per genotype (FORMAT field).  Additionally, each site has filtering information in the FILTER field. If the FILTER field is empty (“.”) or “PASS,” then researchers should assume that there is a call at this site (though see Genotype Filter for srWGS data, below).

We release data in VCF format for array data, srWGS SNP & Indel data over limited genomic regions, srWGS SV data, and lrWGS data. Please note that individual VCFs may contain substantially different information. For example, a VCF for srWGS SNP & Indel data will typically have different fields than what would be found in a VCF for SV data. The header in a VCF file will explain which fields are present and what their data type (e.g. number vs string) and descriptions.

Array VCFs

The current array data includes 312,945 single sample VCFs for 312,945 participants. The array VCFs are sorted and block compressed (vcf.gz) with local tabix index files (vcf.gz.tbi).

Array VCFs in the All of Us genomic dataset will contain the following:

Header

The header field of the VCF contains many attributes which generally describe the processing of the sample in the array.  Many of these are specific to a single sample.

  • arrayType - This contains the name of the genotyping array that was processed.
  • autocallDate - The date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software.
  • autocallGender - The gender (sex) that autocall determined for the sample processed.
  • autocallVersion - The version of the autocall/gencall software used.
  • chipWellBarcode - The chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array).
  • clusterFile - The cluster file used.
  • extendedIlluminaManifestVersion - The version of the ‘extended Illumina manifest’ used by the VCF generation software.
  • extendedManifestFile - The filename of the ‘extended Illumina manifest’ used by the VCF generation software.
  • fingerprintGender - The gender (sex) determined using an orthogonal fingerprinting technology.  This is populated by an optional parameter used by the VCF generation software.
  • gtcCallRate - The gtc call rate of the sample processed.  This value is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls.
  • imagingDate - The date that the IDAT files (raw image scans) for the chip well barcode were created.
  • manifestFile - The name of the Illumina manifest (.bpm) file used by the VCF generation software.
  • sampleAlias - The name of the sample.

Note that there are many other attributes in the header (Biotin*, DNP*, Extension*, Hyb*, NP*, NSB*, Restore, String*, TargetRemoval) that are populated with Illumina control values.  They are not described here.

Filtered Sites (FILTER)

There are several filters specific to genotyping array content.  These are:

  • DUPE - This filter is applied if there are multiple rows in the VCF for the same loci and alleles.  That is, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set.
  • TRIALLELIC - This filter is applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele.
  • ZEROED_OUT_ASSAY - This filter is applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file - this is typically done when the calls at the site are intentionally marked as unusual.  Genotypes called sites that are ‘zeroed out’ will always be no-calls. 

Genotype (sample level fields)

These fields describe attributes specific to the sample genotyped on the array.  The FORMAT specifier in the VCF header describes these fields.  They are:

  • GT - GENOTYPE.  This field describes the genotype.  It is a standard field, described in the VCF specification.
  • IGC - Illumina GenCall Confidence Score.  A measure of the call confidence.
  • X - Raw X intensity as scanned from the original genotyping array
  • Y - Raw Y intensity as scanned from the original genotyping array
  • NORMX - Normalized X intensity
  • NORMY - Normalized Y intensity
  • R - Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)
  • THETA - Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)
  • LRR - Log R Ratio
  • BAF - B Allele Frequency

INFO (site level fields)

These fields describe attributes specific to the probe on an array.  The INFO specifier in the VCF header describes these fields.  They are:

  • AC - Allele Count in genotypes, for each ALT allele.  A standard field, described in the VCF specification
  • AF - Allele Frequency.  A standard field, described in the VCF specification
  • AN - Allele Number.  A standard field, described in the VCF specification
  • ALLELE_A - The A Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
  • ALLELE_B - The B Allele, as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
  • BEADSET_ID - The BeadSet ID.  An Illumina identifier.  Used for normalization.
  • GC_SCORE - The Illumina GenTrain Score.  A quality score describing the probe design
  • ILLUMINA_BUILD - The Genome Build for the design probe sequence, as annotated in the Illumina manifest
  • ILLUMINA_CHR - The chromosome of the design probe sequence, as annotated in the Illumina manifest.
  • ILLUMINA_POS - The position of the design probe sequence (on ILLUMINA_CHR), as annotated in the Illumina manifest.
  • ILLUMINA_STRAND - The strand for the design probe sequence, as annotated in the Illumina manifest.
  • PROBE_A - The allele A probe sequence as annotated in the Illumina manifest.
  • PROBE_B - The allele B probe sequence as annotated in the Illumina manifest.  Note that this is only present on strand ambiguous SNPs.
  • SOURCE - The probe source as annotated in the Illumina manifest.
  • refSNP - The dbSNP rsId for this probe

srWGS SNP & Indel VCFs

We have released srWGS SNP and Indel variants in VCF formats covering three reduced genomic regions: ACAF threshold, exome, and ClinVar. The ACAF threshold callset contains variants that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome callset contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar callset contains variants in ClinVar, regardless of pathogenicity. 

Please see the Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK for a thorough description of these VCFs. The complete srWGS SNP and Indel callset across all sites is released as a VDS, which is a Hail sparse data format. We provide a tutorial notebook for converting VDS to VCF format, though we recommend that you stick with the premade smaller callsets if possible to save time and money.

The srWGS limited callset VCFs are sorted and block compressed in bgz format (.vcf.bgz) with a local tabix index (.vcf.bgz.tbi). Each VCF is split into multiple non-overlapping sections of the genome by chromosome in separate files for usability (sharding).

There are some differences between the srWGS SNP & Indel VCFs compared to the previously released VCFs which relates from these VCFs being generated from the VDS. See these known issues in the QC Report:

  • Known Issue #5: QUAL information has been removed for srWGS SNP & Indel variants
  • Known Issue #6: srWGS callset using new convention for genotype filtering flag
  • For the srWGS SNP and Indel VCFs, the filtering information is contained both in the FILTER column and the filter (FT) tag. The FT tag is at the genotype-level and will not have the filtering information, but contain PASS or FAIL.  

srWGS SNP & Indel VCFs in the All of Us genomic dataset contain the following:

FORMAT fields (per sample-site):  

  • Genotype (GT) -- The GT field specifies the alleles carried by the sample, encoded by a 0 for the reference (REF) allele, 1 for the first alternative (ALT) allele, 2 for the second ALT allele, etc. Since humans are diploid organisms, we expect two alleles (e.g. “0/1”).  Please note that the GT calls on sex chromosomes will have two alleles, even in the case of chrY and chrX in males.
  • Allelic Depth (AD) -- Allelic depths for the reference allele and the alternate allele(s) present at this site. For more information about AD and which reads are counted, see this article on Allele Depth
  • Genotype Quality (GQ) -- The phred-scaled confidence that the called genotype is correct.  A higher score indicates a higher confidence.  For more information on GQ, please see the GQ documentation.  For more information on interpreting phred-scaled values, please see Phred-scaled quality scores.
  • Reference Genotype Quality (RGQ) -- The phred-scaled confidence that the reference genotypes are correct.  A higher score indicates a higher confidence.  For more information on RGQ, please see the GQ documentation, but note that RGQ applies to the reference, not the variant.  For more information on interpreting phred-scaled values, please see Phred-scaled quality scores
  • Genotype Filter (FT) --  The srWGS SNP & Indel genotype-level filtering information. The value will be a PASS, FAIL, or NA, depending on the filtering status of the genotype.  As part of our joint callset quality control processing, we run Allele-Specific Variant Quality Score Recalibration, and use the results to populate the genotype filter (FT) field.  An example code snippet for filtering genotypes, in Hail, can be found in the Manipulating Hail Matrix Table tutorial notebook.
    • The cutoff for VQSLOD INDEL filtering is 0.990. 
    • The cutoff for VQSLOD SNP filtering is 0.997.

INFO fields (per site):

Descriptions of the INFO fields can also be found in the header of the VCF.

  • Allele Count (AC) -- the number of times we see each alternate allele for all samples.  For example, a “1/1” genotype would count as 2 observations of the first alternate allele.
  • Allele Number (AN) -- the total number of alleles seen.  Usually, this will be the number of samples times two, since humans are diploid organisms. No-call genotypes (“./.”) are not counted towards AN.
  • Allele Frequency (AF) -- the frequency of the alternate allele in the population that is the callset cohort. This is equivalent to AC/AN.

 

FILTER values (per site):

  • QUAL score does not meet threshold (LowQual) -- sites with this filter have a posterior probability of being variant that is equal to or below the probability of being variant by chance, represented by the expected heterozygosity for humans (QUALapprox lower than 60 for SNPs; 69 for Indels)
    • QUAL tells you how confident we are that there is some kind of variation at a given site. The variation may be present in one or more samples.
  • No high-quality genotypes (NO_HQ_GENOTYPES) -- sites with this filter do not have any genotypes that are considered high quality (GQ>=20, DP>=10, and AB>=0.2 for heterozygotes)
    • Allele Balance (AB) is calculated for each heterozygous variant as the number of reads supporting the least-represented allele over the total number of read observations.  In other words, min(allele depth)/(total depth) for diploid GTs.
  • Excess Heterozygosity (ExcessHet) -- sites with this filter have more heterozygote genotypes than expected by chance under Hardy-Weinberg equilibrium. ExcessHet is a phred-scaled p-value. We cutoff anything more extreme than a z-score of -4.5 (p-value of 3.4e-06), which phred-scaled is 54.69

srWGS Structural Variant VCF

The srWGS structural variants are released as a joint called VCF for 97,940 samples, sorted and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi). There is a sites-only VCF as well as full VCFs with sample genotypes, sharded by chromosome. The sites-only VCF has no genotype information but retains all site level information. The GATK-SV team has documented the SV VCF format in an article on the GATK site: How to interpret SV VCFs. The format has many similarities to a short variant VCF but you will see some differences that are necessary to specify SV variant details.The header describes the data fields in the VCF. 

The SV VCF is annotated with the GATK tool SVAnnotate. It adds the gene overlap and the predicted functional consequence. These annotations are added in the INFO field. The annotations produced by SVAnnotate are described in detail in the tool documentation. The GTF used for gene annotations was GENCODE v39. 

  • CHROM: The chromosome location of the start position of the SV
  • POS: The start position of the SV
  • ID: Unique identifier for the SV
  • REF: Not commonly used in structural variant VCFs, commonly has an N
  • ALT: Information about the SV type, descriptions of the SV types can be found in the header
  • FILTER: Filtering information for the SV
    • HIGH_NCR: Unacceptably high rate of no-call GTs.
    • MULTIALLELIC: Multiallelic CNV site. This FILTER status does not mean that the site is not real, but it should be treated differently from a biallelic SV site.
    • UNRESOLVED: Variant is unresolved. There was some evidence for an SV at this site but it was not able to be resolved completely from the available evidence.
    • VARIABLE_ACROSS_BATCHES: Site appears at variable frequencies across batches. Likely reflects technical batch effects.
    • PASS: None of the above site-level filters were applied.
  • INFO: Important annotations describing the variant at the site level. The annotations are described in depth in the SV VCF header. Some of these annotations include:
    • END: End position of the structural variant
    • CHR2: Second chromosome for interchromosomal events
    • END2: Position of breakpoint on CHR2
    • ALGORITHMS: The original algorithm that called the SV (GATK-SV is an ensemble method)
    • SVLEN: SV length in base pairs
    • SVTYPE: SV type
    • CPX_TYPE: Subtype of complex rearrangement
    • CPX_INTERVALS: Details of complex rearrangement
  • FORMAT: Annotations describing the variant at the genotype level (site and sample specific annotations). Depends on the SV type and the evidence categories that support the SV. All FORMAT annotations are described in the VCF header.

lrWGS VCFs

We provide short variants (SNP and Indels) and SVs for lrWGS samples. lrWGS variants are called against both the grch38_noalt and T2Tv2.0 references, as described in the lrWGS BAMs section. All lrWGS VCFs are sorted and block compressed (.vcf.gz) with a local tabix index (.vcf.gz.tbi).

For each reference sequence, lrWGS SNP and Indel variants are provided in two VCFs, one has phased variants and the other are unphased variants. There is also a joint callset for the lrWGS SNP and Indel variants in VCF and Hail MT formats. Please see the header of these VCFs for a description of the VCF fields.

Variants from the tool PAV are provided in VCF format for each lrWGS single sample. PAV variants are derived from the haplotype resolved assembly (GFA files) and outputs phased calls. Please see the header of the PAV VCFs for a description of the VCF fields.

lrWGS SVs are provided from both PBSV and Sniffles2 variant callers. Each lrWGS has a single sample VCF from each of the two variant callers. Please see the headers of these VCF files for descriptions of the VCF fields. In addition, we output a Sniffles2 binary snf file for use with Sniffles2. Please see the Sniffles2 documentation for more information about snf files.

Binary GEN format (BGEN)

We have released srWGS SNP and Indel variants in Binary GEN format (BGEN) covering three reduced genomic regions. The three reduced datasets are ACAF threshold, exome, and ClinVar. The ACAF threshold callset contains variants that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome callset contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar callset contains variants in ClinVar, regardless of pathogenicity. 

Please see the Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK for a thorough description of these BGEN files. The complete srWGS SNP and Indel callset across all sites is released as a VDS, which is a Hail sparse data format. We provide a tutorial notebook for converting VDS to BGEN format, though we recommend that you stick with the premade smaller callsets if possible to save time and money.

There is a known issue in the v7 BGEN files: the ‘rsid’ field is empty with the relevant data located in an ‘alternate_id’ field, which causes issues with PLINK and Regenie. Please see the FAQ in this article for more information about how to solve the issue.

PLINK data

We provide PLINK 1.9 data (.bed / .bim / .fam) for array data and srWGS SNP and Indel variants over limited regions. The PLINK files are converted from the Hail MatrixTable using the export_plink command in Hail and contain all information in the Hail MatrixTable. PLINK file type information can be found at the PLINK site. The .bed file is the PLINK binary biallelic genotype table and contains genotype calls. The .bim file is the PLINK extended .map file, and is a text file containing variant information. The .fam file is a text file with sample information for each participant. Please refer to the published notebooks on how to use the PLINK 1.9 data.

The three reduced datasets for the srWGS SNP and Indel data are ACAF threshold, exome, and ClinVar. The ACAF threshold callset contains variants that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome callset contains variants that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar callset contains variants in ClinVar, regardless of pathogenicity.

Please see the Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK for a thorough description of these PLINK files. The complete srWGS SNP and Indel callset across all sites is released as a VDS, which is a Hail sparse data format. We provide a tutorial notebook for converting VDS to PLINK format, though we recommend that you stick with the premade smaller callsets if possible to save time and money.

Please note that we will provide pgen files in future callset releases.

IDAT files

We provide IDAT files for all array samples. The IDAT file is a binary file containing raw BeadArray data directly from the scanner. There are two files for each sample, corresponding to the red and green intensity values. These values give information about specific nucleotides on the genome. You can read more about the steps to call variants from these IDAT files in the Genomic Quality Report.

For an in depth description and how to process these files, read more about the illuminaio tool.

CRAM and BAM files

We provide raw data for srWGS and lrWGS data as compressed alignment files. The srWGS data is in CRAM format, otherwise known as compressed SAM (sequence alignment map) format. The lrWGS data is in BAM (binary alignment map) format. 

Both CRAM and BAM files can be uncompressed to a SAM file, which contains records describing the reads, their mapping information, and quality score information. The CRAM, BAM, and SAM file formats are described in this specification doc. Refer to the Genomic Quality Report for more information on how variant calling was performed on these raw data files.

The raw data is more expensive to use because you must pay egress charges, which are the costs to retrieve the data from the cloud for analysis. We do not charge egress for variant data and so the raw data will be more expensive to use. Please see the Genomics FAQ for Recommendations for processing CRAMs with GATK on the Researcher Workbench.

srWGS CRAMs

The CRAM files for the srWGS data are mapped to the hg38/GRCh38 reference. There is one file for each srWGS sample and the research ID appears in the file name. All samples with SNP and Indel variant calls and all samples with SV variant calls have CRAM data.

lrWGS BAMs

Each sample in the lrWGS data is aligned to two references, grch38_noalt  and T2Tv2.0. grch38_noalt corresponds to the GRCh38 reference with no alternate sequences and T2Tv2.0 corresponds to the T2T-CHM13v2.0 reference, with the EBV contig added from the grch38_noalt reference. 

For each reference, we also provide a haplotagged BAM file, where each read has additional information to distinguish reads that come from different haplotypes. In total, there are four BAM files for each sample with lrWGS data.  

GFA files

We release three Graphical Fragment Assembly (GFA) files for each lrWGS sample, which are de novo graph based assemblies. One GFA file is the primary assembly for the sample and the other two GFA files are the chromosome copy assemblies. The GFA files describe the graph layouts of the contigs.

We use the tool hifiasm, which is a tool for haplotype-resolved de novo assemblies. Please check out the GFA specifications for more details about GFA format. 

FASTA files

We provide three de novo assemblies as FASTA files for each lrWGS sample, matching the sequences from the GFA files. A FASTA file is a text file representation of genomic data. Each genomic sequence is described in two lines: the first line is a description line starting with a greater-than (">") symbol at the beginning and the second line contains the genomic sequence data as a string with the nucleotide sequence. Other than the first line of the FASTA file which is the description, these two lines representing genomic sequences are repeated in the file.

 

 

Auxiliary Genomic Files

srWGS variant annotation table (VAT)

The Variant Annotation Table (VAT) is a resource provided for all samples with srWGS SNP & Indel data. The VAT gives functional annotations for all passing variants. In addition, the Variant Annotation table contains site-level annotations such as allele counts for each alternate allele (AC), the total number of alleles at each site (AN), and the frequency of each alternate allele (AF). This table can be used in addition to the VDS to determine variants of interest to your analysis. We provide the annotations as one single, merged tsv file (“.tsv.bgz”) which can be loaded into Hail. 

srWGS relatedness kinship scores

We calculate relatedness for all samples with srWGS data and report the kinship score of any pair with a score over 0.1. The kinship score is half of the fraction of the genetic material shared. (Parent-child or siblings will have a score of 0.25 while identical twins will have a score of 0.5). Please see the Hail pc_relate function documentation for more information, including interpretation.

We provide the kinship scores for pairwise samples with kinship scores above 0.1. We do not provide identity kinship scores (i.e. kinship of a sample with itself). Each pair will only appear once (in other words, {sample1, sample2, 0.25} is equivalent to {sample2, sample1, 0.25}).

 

Table 9. srWGS pairwise samples with a kinship score over 0.1 TSV file description

Field name Type Key? Notes
i.s string yes Sample ID of a sample in the pair
j.s string yes Sample ID of the other sample in the pair
kin float no Kinship score (0-0.5)

Column Explanations:

  • Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
  • Type -- Data type. Arrays are possible.
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

srWGS SNP & Indel maximal set of unrelated samples

We provide a list of samples to prune in order to remove related samples from the srWGS SNP & Indel cohort. Relatedness is calculated as described in the kinship score description above. This will be the maximal independent set for related samples which minimizes the number of samples that need pruning. 

 

Table 10. List of srWGS SNP & Indel related samples to prune TSV file description

Field name Type Key? Notes
sample_id.s string Yes Research ID of the sample

Column Explanations:

  • Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
  • Type -- Data type. Arrays are possible.
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

srWGS SV maximal set of unrelated samples

We provide a list of samples to prune in order to remove related samples from the srWGS SV cohort. Relatedness is calculated as described in the kinship score description above. This will be the minimal list of related samples to prune in order to produce the maximal independent set of unrelated samples.

The samples are reported in a txt file as a list of research IDs. One research ID is listed per line and there is no header in the file.

srWGS genetic predicted ancestry

We provide genetic ancestry groupings for all samples with srWGS data as a .tsv file, sorted by research ID. The ancestry categories are correspond directly to categorial ancestry definitions used within  gnomAD, the Human Genome Diversity Project, and 1000 Genomes:

African/African American (afr), American Admixed/Latino (amr), East Asian (eas), European (eur), Middle Eastern (mid), South Asian (sas), and Other (oth; not belonging to one of the other ancestries or is a balanced admixture).

 

Table 11. srWGS genetic predicted ancestry TSV file description

Field Name Key? Type Nullable? Example Value Notes
research_id yes String No 1000055 This comes from sample metadata. 
ancestry_pred no String No mid The predicted ancestry for the sample, not including “other.”
probabilities  no Array[number] No [0.10, 0.99, 0.001, … 0.0] Confidence of each output class (i.e. computed ancestry).
Each will have a length equal to the number of possible computed ancestry labels minus one (6).  The ancestry “Other” is computed separately based on the confidence of the other classes.
pca_features no Array[number] No [8.1232, 0.01234, 3.1123, …, 0.00132] The principal components of the projection for the sample.  Each value is an array with a length of 16.
ancestry_pred_other no String No oth The predicted ancestry for the sample, including “other.”

Column Explanations:

  • Field name -- The name of the field.  In tsv files, this will appear on the first row of the file.
  • Type -- Data type.  Arrays are possible.
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

srWGS SNP & Indel smaller callset BED files

We provide the genomic territory, otherwise known as interval files, used to create the srWGS SNP and Indel smaller callsets as UCSC BED files. Please see the Genomics FAQ: Smaller callsets for analyzing srWGS SNP & Indel data with Hail MT, VCF, and PLINK for a thorough description of these smaller callsets. The BED files contain the genomic regions for the exome, ACAF threshold, and ClinVar callsets. The complete srWGS SNP and Indel callset across all sites is released as a VDS and these reduced callsets are subsetted using the BED files provided. 

The ACAF threshold BED file contains sites that have a population-specific allele frequency (AF) greater than 1% or a population-specific allele count (AC) greater than 100 in any computed ancestry subpopulations. The exome BED file contains sites that are within the exon regions of the Gencode v42 basic transcripts, with padding of 15 bases on either side of each exon. The ClinVar BED file contains variants in ClinVar, regardless of pathogenicity.

Flagged srWGS samples

We provide a table listing samples that are flagged as part of the sample outlier QC for the srWGS SNP and Indel joint callset. This includes the specific residual tests that were failed.The schema is described in the table below. The table will be released as a tsv.

Flagged sample tsv schema

  • No fields can have a null value.
  • Count fields do not include filtered variants.
  • For all of the fail_* fields, a value of true indicates that the sample is an outlier and should be flagged.

 

Table 12. Flagged srWGS samples TSV file description

Field Name Type Key? Example Value Notes
s int yes 1000000 Research ID
ancestry_pred string no eur The predicted ancestry for the sample, not including “other.”
probabilities array<float> no [0.10, 0.99, 0.001, … 0.0] Confidence of each output class (i.e. computed ancestry).
Each will have a length equal to the number of possible computed ancestry labels minus one (6).  The ancestry “Other” is computed separately based on the confidence of the other classes.
pca_features array<float> no [8.1232, 0.01234, 3.1123, …, 0.00132] Each will have a length of 16.
ancestry_pred_other string no oth The predicted ancestry for the sample, including “other.”
snp_count int no 3910035 Number of SNPs called in this sample.
ins_del_ratio float no 0.98814 Ratio of insertion to deletion counts.
del_count int no 427102  
ins_count float no 456515  
snp_het_homvar_ratio float no 2.1119  
indel_het_homvar_ratio float no 2.3994  
ti_tv_ratio float no 1.9967  
singleton int no 15819

IMPORTANT:  This is not the number of singletons in a sample.  


This field is a count of the number of variants  not appearing in gnomAD 3.1.

fail_snp_count_residual boolean no true  
fail_ins_del_ratio_residual boolean no false  
fail_del_count_residual boolean no true  
fail_ins_count_residual boolean no false  
fail_snp_het_homvar_ratio_residual boolean no true  
fail_indel_het_homvar_ratio_residual boolean no false  
fail_ti_tv_ratio_residual boolean no true  
fail_singleton_residual boolean no false  
qc_metrics_filters array<string> no

["indel_het_homvar_ratio_residual",

"snp_count_residual"]

A list of each failed test.  These will correspond to all fail_* fields with a value of “true.”

srWGS genomic metrics

We provide a table with supplemental genomic QC metrics for each srWGS sample. The schema is described in the table below. The table will be released as a tsv. 

Genomic metrics tsv schema

  • No fields can have a null value.
  • No samples will be in the table if they do not pass the QC thresholds.

 

Table 13. Supplemental genomic metrics for each srWGS sample TSV file description

Field Name Type Key? Example Value Notes
research_id int yes 1000000 Unique identifier for each participant
sample_source string no Whole Blood Sample source (blood or saliva)
site_id
string no bi The genome center (GC) where the sample was sequenced.  This will be one of three values (bi = "Broad Institute", uw = "University of Washington", or bcm = "Baylor College of Medicine")
sex_at_birth string no Female Participant provided information for sex at birth
dragen_sex_ploidy string no XX Ploidy output from DRAGEN
mean_coverage float no 107.69 Mean number of overlapping reads at every targeted base of the genome (threshold ≥30x)
genome_coverage float no 97.61 Percent of bases with at least 20x coverage (threshold ≥90% at 20x)
aou_hdr_coverage float no 100 Percent of bases in the All of Us Hereditary Disease Risk gene (AoUHDR) with at least 20x coverage (threshold ≥95% at 20x)
dragen_contamination float no 0.003 Cross-individual contamination rate from DRAGEN
aligned_q30_bases float no 174329894399 Aligned Q30 bases from DRAGEN (threshold ≥8e10)
verify_bam_id2_contamination float no 0.0000104116 Cross-individual contamination rate from VerifyBamID2
biosample_collection_date string   2/11/2020 Dates that biosamples were collected

Column Explanations:

  • Field name -- The name of the field.  In tsv files, this will appear on the first row of the file.
  • Type -- Data type. 
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

srWGS SV samples with probable aneuploidies

We provide lists of samples with probable aneuploidies identified during srWGS SV ploidy estimation as tsv files. Ploidy estimation was performed across the srWGS SV samples using coverage estimations over binned regions of the genome as part of the GATK-SV pipeline. Details of this ploidy estimation process are described in the All of Us QC report for the v7 off-cycle data release. 

There are three separate files for samples with probable aneuploidies: mosaic autosomal aneuploidy, mosaic allosomal aneuploidy, and germline allosomal aneuploidy.

srWGS SV samples with probable mosaic aneuploidies

We provide two files describing samples with probable mosaic aneuploidies. One is samples with mosaic autosomal aneuploidy and the second is samples with mosaic allosomal aneuploidy. Both files have the same format, described in Table 14. Note that fewer than 20 samples had more than one probable mosaic autosomal aneuploidy, so these samples appear once per affected chromosome.   

 

Table 14. srWGS SV samples with probable mosaic aneuploidies TSV file description

Field name Type Key? Notes
research_id string no Research ID of the sample
chromosome string no Chromosome for which the sample is predicted to have a mosaic aneuploidy, ie. chr8 or chrX
estimated_copy_ratio float no Estimated copy ratio (see QC report Ploidy Estimation) for the chromosome with the probable mosaic aneuploidy 
aneuploidy_type string no Type of aneuploidy predicted. For the probable mosaic aneuploidies, the possible values are MOSAIC_GAIN or MOSAIC_LOSS

Column Explanations:

  • Field name -- The name of the field.  In tsv files, this will appear on the first row of the file.
  • Type -- Data type. 
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

 

srWGS SV samples with probable germline allosomal aneuploidy

We provide one file describing samples with probable germline allosomal aneuploidies, described in Table 15.

 

Table 15. srWGS SV samples with probable germline allosomal aneuploidy TSV file description

Field name Type Key? Notes
research_id string yes Research ID of the sample
copy_number_chrX integer no Estimated copy number for chrX, rounded to the nearest integer
copy_number_chrY integer no Estimated copy number for chrY, rounded to the nearest integer
aneuploidy_type string no Type of aneuploidy predicted. For the probable germline allosomal aneuploidies, the possible values are JACOBS, KLINEFELTER, and TRIPLE X (contains a space)

Column Explanations:

  • Field name -- The name of the field.  In tsv files, this will appear on the first row of the file.
  • Type -- Data type. 
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

lrWGS variant metrics

We provide two lrWGS variant metrics files, corresponding to each lrWGS reference, described in Table 16.

 

Table 16 -- lrWGS variant metrics file description

Field name Type Key? Notes
research_id string yes Research ID of the sample
mosdepth_cov float no Coverage from the mosdepth tool (See the QC report for a description)
aligned_frac_bases float no Fraction of bases aligned to the reference
aligned_num_bases float no Number of bases aligned to the reference
aligned_num_reads float no Number of reads aligned to the reference
aligned_read_length_N50 float no N50 of the aligned reads
aligned_read_length_median float no Median length of the aligned reads
aligned_read_length_mean float no Mean length of the aligned reads
aligned_read_length_stdev float no Standard deviation of the aligned read length
average_identity float no Mean percentage of matches to the reference per aligned read 
median_identity float no Median percentage of matches to the reference per aligned read
dvp_ft_pass_snp_cnt float no Number of PASS SNPs after filtering
pbsv_nonBND_50bpSV_cnt float no Number of SVs >= 50 bp called by PBSV (excluding break-end calls)
snf2_nonBND_50bpSV_cnt float no Number of SVs >= 50 bp called by Sniffles2 (excluding break-end calls)

Column Explanations:

  • Field name -- The name of the field.  In tsv files, this will appear on the first row of the file.
  • Type -- Data type. 
  • Key? -- Whether this field makes up a unique key for the row.  Note that all key fields together make a unique key for the row.
  • Notes -- Any other relevant information.

 

 

Frequently Asked Questions (FAQs) Regarding the Genomic Data Organization

1. Where can I find the research ID in the CRAM and IDAT files?

The research ID is in the file names of the CRAM and IDAT files. To correlate research IDs between the variant files and the raw data files, use the research IDs in the file name of the raw data files (CRAM and IDATs). 

2. Where is the gene name (rsID) stored for each variant?

The rsID for each gene is stored in the Variant Annotation Table (VAT). If you have a rsID of interest, you can use the VAT to determine the genomic coordinates of the variant for analysis in the Hail MT, VCFs, or PLINK formats.

Was this article helpful?

16 out of 19 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.