We performed additional quality analyses beyond our standard dataset QC to verify our methods and provide researchers with deeper insight into the All of Us genomic data. These included investigating differences between saliva and blood samples, and comparing allele frequencies of variants shared with gnomAD to confirm that our data reflects the general population.
Table of contents
Saliva and blood batch effect analysis
Allele Frequencies of All of Us CDRv8 compared to gnomAD v3
Saliva and blood batch effect analysis
Introduction
Beginning in the CDRv8 genomic data release, we included both saliva and blood samples whereas in previous data releases, we only included blood samples. We performed an analysis to determine blood and saliva batch effects.
We did not find appreciable batch effects between blood and saliva samples in SNP variants. However, we identified significant differences in indel counts, with a mean genome-wide difference of less than 0.4%, largely attributable to low complexity regions (LCR). Please note that variant calls are less reliable at LCR sites in all variant datasets.
To address this issue, we suggest that researchers include the sample source (blood or saliva) as a factor in their analysis when studying indels or regions where we see appreciable batch effects. We provide the sample source, for all srWGS samples, in the genomics metrics tsv file.
Methods
We measured the batch effect size using Cohen’s d [1], which is a measure of the differences of the means between two groups. We defined an “appreciable batch effect” as any value of d greater than 0.5 in magnitude, which is the convention for a medium-size effect [1].
We sampled 167,861,000 sites from 2181 blood and 2217 saliva samples from the CDRv6 srWGS SNP and indel callset. We compared the blood and saliva samples using multiple variant metrics across various genome regions. All genome centers are represented in our analysis.
We examined four variant metrics: SNP count, indel count, SNP transition to transversion (Ti/Tv) ratio, and Insertion/Deletion (Ins/Del) ratio. The genome regions we compared are the whole genome, the whole genome excluding low complexity regions (WG excl. LC), the Genome in a Bottle high confidence calling regions (GiaB), high GC content (GC > 0.85), low GC content (GC < 0.25), American College of Medical Genetics (ACMG)’s list of 59 genes (ACMG59), low mappability, segmental duplication regions (SegDupe), and tandem repeat regions (Table 1).
Table 1 -- Regions used in the batch effect analysis
| Region | Description |
| Whole genome | All variant calls that are in the CDRv6 genomic dataset |
| Whole genome excluding low complexity regions (WG excl. LC) | Low complexity regions removed from the whole genome |
| Genome in a Bottle (GiaB) high confidence regions | National Institute of Standards and Technology (NIST) benchmark high confidence variant call regions provided by the Genome in a Bottle consortium |
| GC (GC > 0.85) | High guanine (G) and cytosine (C) content in a genomic region |
| AT-rich genome region (GC < 0.25) | Region with low GC content |
| American College of Medical Genetics (ACMG)’s list of 59 genes (ACMG59) | A list of 59 genes that the ACMG recommends be actively searched for pathogenic or likely pathogenic variants during clinical genomic sequencing |
| Low mappability | Low mappability regions are regions of the genome where sequences cannot be mapped with as high confidence as other regions of the genome. These regions are generally less reliable. |
| Segmental duplication regions (SegDupe) | Regions that contain sequences with copies in other areas of a genome. These regions are prone to mapping issues. |
| Tandem repeat regions | Regions that contain sequences with adjacent copies. These regions are prone to mapping issues. |
We focused our study on three All of Us genetic ancestry groups: 1KGP-HGDP-AFR-like, 1KGP-HGDP-AMR-like, and 1KGP-HGDP-EUR-like. We did not study the other genetic ancestry groups, including 1KGP-HGDP-EAS-like, 1KGP-HGDP-MID-like, and 1KGP-HGDP-SAS-like, due to having too few samples. We also did not include the remaining samples (OTH), because they are too heterogeneous, which would skew to smaller effect sizes. The All of Us genetic ancestry groups are inferred by measuring the relative genetic similarity of each participant to global reference populations from harmonized continental metadata labels from the Human Genome Diversity Project (HGDP) and 1000 Genomes Project training data. A full description of these genetic ancestry groups is in the All of Us genomic quality report.
Results
As a control, we randomized the batch labels to see what differences might appear by chance alone. In this test, the largest observed difference was d = 0.35 across all genomic regions, which serves as a baseline for interpreting batch effects in the real data.
We summarized the metrics where we found appreciable batch effects across the tested genetic ancestry groups in Table 2.
The SNP count and SNP ti/tv ratio metrics across the entire genome do not indicate appreciable batch effects between blood and saliva samples. We do detect differences in the indel count and ins/del ratios in some regions indicating appreciable batch effects. The regions where we saw the differences are the whole genome, GC < 0.25, low mappability, SegDupe, and tandem repeat regions. The regions with no significant differences indicating batch effects are the WG excl. LC, GiaB, GC > 0.85, and ACMG59.
Our results indicate that the batch effects within the whole genome region are driven by other regions, specifically the low complexity regions, as when they are removed, the WG excl. LC demonstrates no significant batch effects.
Table 2 -- Presence of appreciable batch effects across each region and metric
| Region | SNP count | Indel count | SNP Ti/Tv ratio | Ins/Del ratio |
| Whole genome | No | Yes | No | Yes |
| WG excl. LC | No | No | No | No |
| GiaB | No | No | No | No |
| GC > 0.85 | No | No | No | No |
| GC < 0.25 | No | Yes | No | Yes |
| ACMG59 | No | No | No | No |
| Low mappability | No | Yes | No | Yes |
| SegDupe | No | Yes | No | Yes |
| Tandem repeat regions | No | Yes | No | Yes |
Whole genome results
The following sections contain the results for each metric for the whole genome region.
SNP count
In examining the SNP counts for the whole genome, we see that across the genetic ancestry groups, there are no significant batch effects (Table 3). We have included the SNP count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 1).
Table 3 -- Whole genome SNP count differences (Cohen’s d)
| All of Us genetic ancestry group | Cohen’s d |
| 1KGP-HGDP-AFR-like | d = 0.01 |
| 1KGP-HGDP-AMR-like | d = 0.15 |
| 1KGP-HGDP-EUR-like | d = 0.05 |
Figure 1. SNP counts across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.01, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.05, right).
Indel count
For the entire genome, the mean differences for indel counts are near a medium-size batch effect between blood and saliva samples (Table 4). We have included the indel count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 2).
Table 4 -- Whole genome indel count differences (Cohen’s d)
| All of Us genetic ancestry group | Cohen’s d |
| 1KGP-HGDP-AFR-like | d = 0.14 |
| 1KGP-HGDP-AMR-like | d = 0.24 |
| 1KGP-HGDP-EUR-like | d = 0.5 |
Figure 2. Indel counts across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.14, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.5, right).
SNP Ti/Tv ratio
In examining the SNP ti/tv ratios for the whole genome region, we see that the values do not indicate a batch effect between blood and saliva samples (Table 5). We have included the SNP count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 3).
Table 5 -- Whole genome SNP ti/tv ratio differences (Cohen’s d)
| All of Us genetic ancestry group | Cohen’s d |
| 1KGP-HGDP-AFR-like | d = 0.07 |
| 1KGP-HGDP-AMR-like | d = 0.13 |
| 1KGP-HGDP-EUR-like | d = 0.04 |
Figure 3. SNP ti/tv ratios across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.07, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.04, right).
Indel Ins/Del ratio
For the entire genome, the mean differences for the indel ins/del ratio are nearing values that indicate a batch effect between saliva and blood samples (Table 6). We have included the indel count figures for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 4).
Table 6 -- Whole genome indel ins/del ratio differences (Cohen’s d)
| All of Us genetic ancestry group | Cohen’s d |
| 1KGP-HGDP-AFR-like | d = 0.07 |
| 1KGP-HGDP-AMR-like | d = 0.13 |
| 1KGP-HGDP-EUR-like | d = 0.04 |
Figure 4. Indel ins/del ratios across the whole genome for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.34, left) and the 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.68, right).
Whole genome minus low complexity (WG excl. LC)
When we remove the low complexity regions from the whole genome, for all of our statistics, we see no significant differences in the means that indicate batch effects between blood and saliva samples (Table 2). This means that even in the indel counts and indel ins/del ratios, there are not significant differences indicating appreciable batch effects (Table 7). We included the indel count figure for 1KGP-HGDP-AFR-like and 1KGP-HGDP-EUR-like (Figure 5).
Other regions that indicated no appreciable batch effects for all metrics were GiaB, GC > 0.85, and ACMG59.
Table 7 -- WG excl. LC indel count differences (Cohen’s d)
| All of Us genetic ancestry group | Indel count Cohen’s d |
| 1KGP-HGDP-AFR-like | d = 0.03 |
| 1KGP-HGDP-AMR-like | d = 0.2 |
| 1KGP-HGDP-EUR-like | d = 0.19 |
Figure 5. Indel counts for the WG excl. LC region for the 1KGP-HGDP-AFR-like genetic ancestry group (d = 0.03, left) and 1KGP-HGDP-EUR-like genetic ancestry group (d = 0.19, right).
Conclusion
While there are no appreciable significant differences between blood and saliva samples in the SNP calls across the entire genome, we see some differences in indel calling as seen in the indel counts and indel ins/del ratio. The batch effects primarily impact low complexity regions, seen with the significant differences in the whole genome, GC < 0.25, for low mappability, SegDupe, and Tandem repeat regions. We did not see the same impacts in the whole genome once we removed low complexity regions (WG excl. LC), in GiaB, GC > 0.85, or the ACMG59 region.
We observe batch effects in regions where differences between saliva and blood sequencing are expected. Saliva DNA is often less pure and more fragmented, which amplifies mapping challenges and variant-calling issues in these hard-to-call regions. Shorter DNA fragments particularly complicate indel calling, as they make it more difficult to accurately determine the size and sequence of the variant.
We would predict that srWGS SV calls are more impacted, but the srWGS SV pipeline breaks the samples into batches and controls for these effects.
Remediation
We recommend that researchers take into consideration the sample source of saliva or blood if they are including indels in their research within regions where we see batch effects (Table 2). This includes the smaller callsets such as the ACAF Threshold, ClinVar, and the exome. The sample source variable is located in the genomic metrics file in the column sample_source and can be used as a confounding variable in association testing.
Additionally, to control for the differences between blood and saliva samples, you can remove the regions containing differences indicating batch effects, such as low complexity regions.
Even with the observed differences between blood and saliva samples, the inclusion of saliva samples significantly broadens the participant base, providing a valuable resource for ongoing research.
Sources
[1] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Allele Frequencies of All of Us CDRv8 compared to gnomAD v3
We compared the variants in the All of Us dataset to the variants in gnomAD, assessing the concordance of the allele frequencies. The allele frequency (AF) of a variant is equal to the number of times a certain allele is recorded divided by the number of total nucleotides sequenced at that location. The gnomAD v3 and All of Us CDRv8 datasets share around 100 million variants, and these variants have AF values that are almost identical. These patterns persist in the complete dataset and within the shared genetic ancestry groups (AFR, AMR, EAS, EUR, SAS, and OTH). The variants used in this analysis were taken from the All of Us Variant Annotation Table.
We found that there was a positive linear trend when comparing the AF values between All of Us and gnomAD (Figure 6), indicating that the AF values were nearly identical for the same variants. We analyzed the Root Mean Square Error (RMSE), the Pearson correlation coefficient, and the extreme-difference between the two datasets to further analyze the similarity of the All of Us CDRv8 dataset to gnomAD v3 (Table 8). The RMSE of 0.012 indicates the AF values had low variation between the two datasets. The Pearson correlation coefficient of 0.995 indicates the linear correlation between the two datasets. To calculate the extreme-difference, we counted the number of variants with an All of Us AF greater than 0.1 and a gnomAD AF less than or equal to 0.01, dividing by the total number of variants, and multiplying by 100%. We found the extreme-difference to be 0.021%. The values in the All of Us and gnomAD genetic ancestry groups also were consistent with these trends (Table 8).
We only included variants with a high call rate in this analysis because the majority of variants with a sizable discrepancy in AF were not frequently sampled in our datasets. We measured the call rate using the allele number (AN), a count of the total number of alleles that were genotyped at the site. If all samples were genotyped, the AN will be the number of samples times 2. We filtered variants with a gnomAD AF of 0, or an AN in either dataset less than one-fifth of the maximum AN of the dataset (gnomAD: 30,463, All of Us: 165,932).
The minimal differences in AF values between the two datasets can likely be explained by the use of different variant calling methods in the two datasets and because these two datasets include individuals with different genetic ancestry groups.
Figure 6. A two-dimensional histogram of each variant and its corresponding AF in the All of Us CDRv8 and gnomAD v3 datasets, showing a linear correlation. We excluded variants with a low call rate.
Table 8 -- Comparison statistics for the entire callset and each genetic ancestry group, excluding sites with a low call rate.
| Analysis group | RMSE | Pearson correlation coefficient | % of variants with an extreme difference in AF | Number of variants |
|---|---|---|---|---|
| Complete dataset | 0.012 | 0.995 | 0.021% | 105,332,875 |
| 1KGP-HGDP-AFR-like (AFR) | 0.013 | 0.996 | 0.027% | 85,439,341 |
| 1KGP-HGDP-AMR-like (AMR) | 0.016 | 0.994 | 0.024% | 73,004,766 |
| 1KGP-HGDP-EAS-like (EAS) | 0.023 | 0.995 | 0.058% | 26,376,040 |
| 1KGP-HGDP-EUR-like (EUR) | 0.013 | 0.996 | 0.026% | 77,675,174 |
| 1KGP-HGDP-SAS-like (SAS) | 0.020 | 0.995 | 0.049% | 31,292,104 |
| Remaining individuals (OTH) | 0.016 | 0.996 | 0.024% | 53,024,695 |
Comments
0 comments
Article is closed for comments.