Click here for a downloadable .pdf version.
This document details the All of Us Genome Centers (GC) and Data and Research Center (DRC) quality control (QC) steps for genomic data in the research pipeline. This pipeline removes or flags samples and variants in the genomic data that fail quality thresholds. We apply these QC steps in the research pipeline before we release the genomic data for research use. We, the All Of Us DRC, only describe QC processes that are performed analytically (i.e., after the sample has been genotyped and sequenced). All descriptions and results are limited to the v7 data release made available in the Researcher Workbench April 20, 2023, which contains 312,945 genotyping array (“array”) samples, 245,394 short read whole genome sequencing (srWGS) samples with single nucleotide polymorphism, insertion, and deletion variant calls (SNPs and Indels), 11,390 srWGS samples with structural variant (SV) calls, and 1,027 long read whole genome sequencing (lrWGS) samples with SNP, Indel, and SV calls. The srWGS SV samples and lrWGS samples are a subset of the srWGS SNP and Indel samples, which in turn are a subset of the array data. The samples in the genomic data correspond to the All of Us Curated Data Repository (CDR) release C2022Q4R9 (“v7”), though please see Known Issue #1, as 20 array samples (less than 0.01%) and six srWGS samples (less than 0.01%) are missing their corresponding CDR data. These pipelines are automated unless otherwise noted. This document covers all genomic data types made available to researchers at this time including small variants (SNPs and Indels), structural variants, raw data, and auxiliary data. Small variants are available for array samples, srWGS samples, and lrWGS samples. Structural variants are available for srWGS samples and lrWGS samples.
- QC Report v7.pdf4 MB
Comments
0 comments
Article is closed for comments.