We performed additional quality analyses beyond our standard dataset QC to verify our methods and provide researchers with deeper insight into the All of Us genomic data. These included investigating differences between saliva and blood samples, and comparing allele frequencies of variants shared with gnomAD to confirm that our data reflects the general population.

Table of contents

Allele Frequencies of All of Us CDRv8 compared to gnomAD v3

Saliva and blood batch effect analysis

Introduction

Beginning in the CDRv8 genomic data release, we included both saliva and blood samples whereas in previous data releases, we only included blood samples. We performed an analysis to determine blood and saliva batch effects.

We did not find appreciable batch effects between blood and saliva samples in SNP variants. However, we identified significant differences in indel counts, with a mean genome-wide difference of less than 0.4%, largely attributable to low complexity regions (LCR). Please note that variant calls are less reliable at LCR sites in all variant datasets.

To address this issue, we suggest that researchers include the sample source (blood or saliva) as a factor in their analysis when studying indels or regions where we see appreciable batch effects. We provide the sample source, for all srWGS samples, in the genomics metrics tsv file.

Methods

We measured the batch effect size using Cohen’s d [1], which is a measure of the differences of the means between two groups. We defined an “appreciable batch effect” as any value of d greater than 0.5 in magnitude, which is the convention for a medium-size effect [1].

We sampled 167,861,000 sites from 2181 blood and 2217 saliva samples from the CDRv6 srWGS SNP and indel callset. We compared the blood and saliva samples using multiple variant metrics across various genome regions. All genome centers are represented in our analysis.

We examined four variant metrics: SNP count, indel count, SNP transition to transversion (Ti/Tv) ratio, and Insertion/Deletion (Ins/Del) ratio. The genome regions we compared are the whole genome, the whole genome excluding low complexity regions (WG excl. LC), the Genome in a Bottle high confidence calling regions (GiaB), high GC content (GC > 0.85), low GC content (GC < 0.25), American College of Medical Genetics (ACMG)’s list of 59 genes (ACMG59), low mappability, segmental duplication regions (SegDupe), and tandem repeat regions (Table 1).

Table 1 -- Regions used in the batch effect analysis

Region	Description
Whole genome	All variant calls that are in the CDRv6 genomic dataset
Whole genome excluding low complexity regions (WG excl. LC)	Low complexity regions removed from the whole genome
Genome in a Bottle (GiaB) high confidence regions	National Institute of Standards and Technology (NIST) benchmark high confidence variant call regions provided by the Genome in a Bottle consortium
GC (GC > 0.85)	High guanine (G) and cytosine (C) content in a genomic region
AT-rich genome region (GC < 0.25)	Region with low GC content
American College of Medical Genetics (ACMG)’s list of 59 genes (ACMG59)	A list of 59 genes that the ACMG recommends be actively searched for pathogenic or likely pathogenic variants during clinical genomic sequencing
Low mappability	Low mappability regions are regions of the genome where sequences cannot be mapped with as high confidence as other regions of the genome. These regions are generally less reliable.
Segmental duplication regions (SegDupe)	Regions that contain sequences with copies in other areas of a genome. These regions are prone to mapping issues.
Tandem repeat regions	Regions that contain sequences with adjacent copies. These regions are prone to mapping issues.

We focused our study on three All of Us genetic ancestry groups: 1KGP-HGDP-AFR-like, 1KGP-HGDP-AMR-like, and 1KGP-HGDP-EUR-like. We did not study the other genetic ancestry groups, including 1KGP-HGDP-EAS-like, 1KGP-HGDP-MID-like, and 1KGP-HGDP-SAS-like, due to having too few samples. We also did not include the remaining samples (OTH), because they are too heterogeneous, which would skew to smaller effect sizes. The All of Us genetic ancestry groups are inferred by measuring the relative genetic similarity of each participant to global reference populations from harmonized continental metadata labels from the Human Genome Diversity Project (HGDP) and 1000 Genomes Project training data. A full description of these genetic ancestry groups is in the All of Us genomic quality report.

Results

As a control, we randomized the batch labels to see what differences might appear by chance alone. In this test, the largest observed difference was d = 0.35 across all genomic regions, which serves as a baseline for interpreting batch effects in the real data.

We summarized the metrics where we found appreciable batch effects across the tested genetic ancestry groups in Table 2.

The SNP count and SNP ti/tv ratio metrics across the entire genome do not indicate appreciable batch effects between blood and saliva samples. We do detect differences in the indel count and ins/del ratios in some regions indicating appreciable batch effects. The regions where we saw the differences are the whole genome, GC < 0.25, low mappability, SegDupe, and tandem repeat regions. The regions with no significant differences indicating batch effects are the WG excl. LC, GiaB, GC > 0.85, and ACMG59.

Our results indicate that the batch effects within the whole genome region are driven by other regions, specifically the low complexity regions, as when they are removed, the WG excl. LC demonstrates no significant batch effects.

Table 2 -- Presence of appreciable batch effects across each region and metric

Region	SNP count	Indel count	SNP Ti/Tv ratio	Ins/Del ratio
Whole genome	No	Yes	No	Yes
WG excl. LC	No	No	No	No
GiaB	No	No	No	No
GC > 0.85	No	No	No	No
GC < 0.25	No	Yes	No	Yes
ACMG59	No	No	No	No
Low mappability	No	Yes	No	Yes
SegDupe	No	Yes	No	Yes
Tandem repeat regions	No	Yes	No	Yes

Whole genome results

The following sections contain the results for each metric for the whole genome region.

SNP count

In examining the SNP counts for the whole genome, we see that across the genetic ancestry groups, there are no significant batch effects (Table 3). We have included the SNP count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 1).

Table 3 -- Whole genome SNP count differences (Cohen’s d)

All of Us genetic ancestry group	Cohen’s d
1KGP-HGDP-AFR-like	d = 0.01
1KGP-HGDP-AMR-like	d = 0.15
1KGP-HGDP-EUR-like	d = 0.05

Figure 1. SNP counts across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.01, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.05, right).

Indel count

For the entire genome, the mean differences for indel counts are near a medium-size batch effect between blood and saliva samples (Table 4). We have included the indel count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 2).

Table 4 -- Whole genome indel count differences (Cohen’s d)

All of Us genetic ancestry group	Cohen’s d
1KGP-HGDP-AFR-like	d = 0.14
1KGP-HGDP-AMR-like	d = 0.24
1KGP-HGDP-EUR-like	d = 0.5

Figure 2. Indel counts across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.14, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.5, right).

SNP Ti/Tv ratio

In examining the SNP ti/tv ratios for the whole genome region, we see that the values do not indicate a batch effect between blood and saliva samples (Table 5). We have included the SNP count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 3).

Table 5 -- Whole genome SNP ti/tv ratio differences (Cohen’s d)

All of Us genetic ancestry group	Cohen’s d
1KGP-HGDP-AFR-like	d = 0.07
1KGP-HGDP-AMR-like	d = 0.13
1KGP-HGDP-EUR-like	d = 0.04

Figure 3. SNP ti/tv ratios across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.07, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.04, right).

Indel Ins/Del ratio

For the entire genome, the mean differences for the indel ins/del ratio are nearing values that indicate a batch effect between saliva and blood samples (Table 6). We have included the indel count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 4).

Table 6 -- Whole genome indel ins/del ratio differences (Cohen’s d)

All of Us genetic ancestry group	Cohen’s d
1KGP-HGDP-AFR-like	d = 0.07
1KGP-HGDP-AMR-like	d = 0.13
1KGP-HGDP-EUR-like	d = 0.04

Figure 4. Indel ins/del ratios across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.34, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.68, right).

Whole genome minus low complexity (WG excl. LC)

When we remove the low complexity regions from the whole genome, for all of our statistics, we see no significant differences in the means that indicate batch effects between blood and saliva samples (Table 2). This means that even in the indel counts and indel ins/del ratios, there are not significant differences indicating appreciable batch effects (Table 7). We included the indel count figure for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 5).

Other regions that indicated no appreciable batch effects for all metrics were GiaB, GC > 0.85, and ACMG59.

Table 7 -- WG excl. LC indel count differences (Cohen’s d)

All of Us genetic ancestry group	Indel count Cohen’s d
1KGP-HGDP-AFR-like	d = 0.03
1KGP-HGDP-AMR-like	d = 0.2
1KGP-HGDP-EUR-like	d = 0.19

Figure 5. Indel counts for the WG excl. LC region for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.03, left) and 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.19, right).

Conclusion

While there are no appreciable significant differences between blood and saliva samples in the SNP calls across the entire genome, we see some differences in indel calling as seen in the indel counts and indel ins/del ratio. The batch effects primarily impact low complexity regions, seen with the significant differences in the whole genome, GC < 0.25, for low mappability, SegDupe, and Tandem repeat regions. We did not see the same impacts in the whole genome once we removed low complexity regions (WG excl. LC), in GiaB, GC > 0.85, or the ACMG59 region.

We observe batch effects in regions where differences between saliva and blood sequencing are expected. Saliva DNA is often less pure and more fragmented, which amplifies mapping challenges and variant-calling issues in these hard-to-call regions. Shorter DNA fragments particularly complicate indel calling, as they make it more difficult to accurately determine the size and sequence of the variant.

We would predict that srWGS SV calls are more impacted, but the srWGS SV pipeline breaks the samples into batches and controls for these effects.

Remediation

We recommend that researchers take into consideration the sample source of saliva or blood if they are including indels in their research within regions where we see batch effects (Table 2). This includes the smaller callsets such as the ACAF Threshold, ClinVar, and the exome. The sample source variable is located in the genomic metrics file in the column sample_source and can be used as a confounding variable in association testing.

Additionally, to control for the differences between blood and saliva samples, you can remove the regions containing differences indicating batch effects, such as low complexity regions.

Even with the observed differences between blood and saliva samples, the inclusion of saliva samples significantly broadens the participant base, providing a valuable resource for ongoing research.

Sources

[1] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.

Allele Frequencies of All of Us CDRv8 compared to gnomAD v3

We compared the variants in the All of Us dataset to the variants in gnomAD, assessing the concordance of the allele frequencies. The allele frequency (AF) of a variant is equal to the number of times a certain allele is recorded divided by the number of total nucleotides sequenced at that location. The gnomAD v3 and All of Us CDRv8 datasets share around 100 million variants, and these variants have AF values that are almost identical. These patterns persist in the complete dataset and within the shared genetic ancestry groups (AFR, AMR, EAS, EUR, SAS, and OTH). The variants used in this analysis were taken from the All of Us Variant Annotation Table.

We found that there was a positive linear trend when comparing the AF values between All of Us and gnomAD (Figure 6), indicating that the AF values were nearly identical for the same variants. We analyzed the Root Mean Square Error (RMSE), the Pearson correlation coefficient, and the extreme-difference between the two datasets to further analyze the similarity of the All of Us CDRv8 dataset to gnomAD v3 (Table 8). The RMSE of 0.012 indicates the AF values had low variation between the two datasets. The Pearson correlation coefficient of 0.995 indicates the linear correlation between the two datasets. To calculate the extreme-difference, we counted the number of variants with an All of Us AF greater than 0.1 and a gnomAD AF less than or equal to 0.01, dividing by the total number of variants, and multiplying by 100%. We found the extreme-difference to be 0.021%. The values in the All of Us and gnomAD genetic ancestry groups also were consistent with these trends (Table 8).

We only included variants with a high call rate in this analysis because the majority of variants with a sizable discrepancy in AF were not frequently sampled in our datasets. We measured the call rate using the allele number (AN), a count of the total number of alleles that were genotyped at the site. If all samples were genotyped, the AN will be the number of samples times 2. We filtered variants with a gnomAD AF of 0, or an AN in either dataset less than one-fifth of the maximum AN of the dataset (gnomAD: 30,463, All of Us: 165,932).

The minimal differences in AF values between the two datasets can likely be explained by the use of different variant calling methods in the two datasets and because these two datasets include individuals with different genetic ancestry groups.

Figure 6. A two-dimensional histogram of each variant and its corresponding AF in the All of Us CDRv8 and gnomAD v3 datasets, showing a linear correlation. We excluded variants with a low call rate.

Table 8 -- Comparison statistics for the entire callset and each genetic ancestry group, excluding sites with a low call rate.

Analysis group	RMSE	Pearson correlation coefficient	% of variants with an extreme difference in AF	Number of variants
Complete dataset	0.012	0.995	0.021%	105,332,875
1KGP-HGDP-AFR-like (AFR)	0.013	0.996	0.027%	85,439,341
1KGP-HGDP-AMR-like (AMR)	0.016	0.994	0.024%	73,004,766
1KGP-HGDP-EAS-like (EAS)	0.023	0.995	0.058%	26,376,040
1KGP-HGDP-EUR-like (EUR)	0.013	0.996	0.026%	77,675,174
1KGP-HGDP-SAS-like (SAS)	0.020	0.995	0.049%	31,292,104
Remaining individuals (OTH)	0.016	0.996	0.024%	53,024,695

Additional benchmarking and quality analyses on the All of Us short-read WGS SNP & Indel dataset

Saliva and blood batch effect analysis

Introduction

Methods

Table 1 -- Regions used in the batch effect analysis