Widespread benchmarking of structural variants (SVs) from short read whole genome sequencing (srWGS) remains challenging, often due to lack of orthogonal data for comparison. The All of Us cohort of samples with srWGS SV data is somewhat unique in the availability of matched genomic datasets (i.e SNP arrays, srWGS SNPs and Indels, and long read genome sequencing [lrWGS]). There also exists a number of intrinsic measures that can be used to assess the technical quality of a dataset (e.g Hardy-Weinberg Equilibrium). Combining these methods, we have a unique opportunity for a high-quality assessment of SV generated via srWGS. Overall, we assess 7 measures of technical quality for the GATK-SV All of Us dataset, described in the main QC report and this supplemental document.
In the Structural Variant QC Results section of the Genomic Research Data Quality Report, we describe:
- Variant counts (cohort-wide and per-sample) relative to gnomAD V2 and the most recent 1000 Genomes Project high-coverage srWGS callset
- Size distribution of SVs
- Hardy-Weinberg equilibrium
In this benchmarking report, we additionally describe:
- Linkage disequilibrium with srWGS SNPs and Indels
- Patterns of evolutionary constraint
- Benchmarking against long read sequencing data
- Benchmarking against microarrays
In addition to these QC analyses, in this report we describe an analysis to benchmark the performance of the DRAGEN 3.4.12 aligner compared to BWA for the discovery of SVs with GATK-SV.