All of Us Best Practices: Deduplication of Overlapping and Related Samples from Two or More Genomic Datasets

If you import genomic data from external datasets into the Researcher Workbench to analyze alongside All of Us data, you may need to remove duplicate records as well as records from individuals who are closely related to one another (hereafter referred to as “related samples”) from the combined dataset.

The All of Us Data User Code of Conduct (DUCC) allows you to import external data into the Researcher Workbench (RW) but prohibits any attempts to link All of Us data at the participant-level with participant-level data from other sources without explicit permission from the Resource Access Board (RAB). The best practices below describe an approved method to remove duplicates or related individuals from multiple genomic datasets and stay in compliance with the terms of the DUCC. Although this method refers to “related samples,” it applies to both removal of related and of identical samples from combined datasets.

If you use this method, you do not need permission from the RAB. However, if the proposed method is not feasible for your use case, or if you need to remove overlapping samples for non-genomic datasets, you should contact the RAB by email (aouresourceaccess@od.nih.gov) for additional guidance.

If you re-identify any All of Us participants, accidentally or otherwise, you must notify the RAB immediately. The below method should enable this analysis without reidentification, but you can contact the support team at support@researchallofus.org with additional questions as needed.

Method:

This method uses the Hail (python) pc_relate function [1] to identify relatedness estimates and find a maximum set of unrelated individuals without identifying direct links between participants. Note: the method may require adjustments based on updates to Hail or variations in your specific use case. Please see the Hail pc_relate function documentation for additional information about how to implement the method to align with your research goals.

Steps:

First, import your external dataset into the Researcher Workbench as a Hail MatrixTable (MT), and confirm that your imported dataset contains the same properties (columns, etc.) as the All of Us genomic dataset. In the code below, your external variant dataset will be referred to as my_variants. The details of this step will depend on the properties and format of your dataset(s). For more information on how to create a Hail MT, see the Hail Matrix Table documentation [2]. Then, set the default reference genome (e.g., Genome assembly GRCh38) to use for relatedness estimation in the next steps.

# load Hail 
import hail as hl

# create a Hail MT set of variants, called "my_variants", that contains the variants of interest for your dataset. (code not shown) 

# set the default reference genome to GRCh38
hl.default_reference(new_default_reference = "GRCh38")

In steps 2-4, you will apply a set of filtering parameters to subset, or “prune” your dataset (my_variants) and choose a set of variants for your relatedness estimation. The variants used in relatedness estimation must be present in all cohorts, accurately called with high confidence, and not correlated with one another.

To start, import the All of Us high-quality (HQ) sites list as a Hail MT (listed below as sites_mt). The variants within the All of Us HQ sites Appendix J, All of Us Genomic Quality Report [3]) are the sites that are guaranteed to pass those filtering parameters. Because we need the sites to be within both datasets, we can subset, or prune your dataset (my_variants) dataset using semi_join_rows() with the HQ sites list (sites_mt) to save computational power. The following code describes how to filter your Hail MT.

# Import the HQ sites VCF
sites_mt = hl.import_vcf('gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/aux/ancestry/merged_sites_only_intersection.vcf.bgz')

# Filter the my_variants table to only keep variants in the sites-only VCF
my_variants = my_variants.semi_join_rows(sites_mt.rows())

For your pruned dataset (my_variants), follow the same steps described in the All of Us relatedness analysis (Appendix J, All of Us Genomic Quality Report [3]), to select variants from the dataset that meet the below criteria:
- autosomal, bi-allelic single nucleotide variants (SNVs),
- Allele frequency > 0.1%,
- Call rate > 99%,
- LD-pruned with a cutoff of r2 = 0.1

The following code outlines how to filter your variants (my_variants) following these criteria. By the end of this step, my_variants should contain all of the samples, but only at the variant sites that meet all of the specified criteria.

# filter for the autosome
my_variants = my_variants.filter_rows(my_variants.locus.in_autosome())
# filter for bi-allelic
my_variants = my_variants.filter_rows(hl.len(my_variants.alleles) == 2)
# filter to only keep SNPs
my_variants = my_variants.filter_rows(hl.is_snp(my_variants.alleles[0], my_variants.alleles[1]))

# filter for AF > 0.1%
my_variants = hl.variant_qc(my_variants)
my_variants = my_variants.filter_rows(my_variants.variant_qc.AF[1] > 0.001)

# filter for call rate > 99%
my_variants = my_variants.filter_rows(my_variants.variant_qc.call_rate > 0.99, keep=True)

# LD-pruning
pruned_variant_table = hl.ld_prune(my_variants.GT, r2=0.1)
my_variants = my_variants.semi_join_rows(pruned_variant_table)

Next, input the All of Us variants callset using the ACAF MT (aou_variants). The Allele Count/Allele Frequency (ACAF) threshold callset is a curated subset of the short-read Whole Genome Sequencing (srWGS) data, containing variants with a population-specific allele frequency (AF) > 1% or allele count (AC) > 100. We use the ACAF MatrixTable for this process because these variants most closely align with the High-Quality (HQ) sites used for relatedness estimation. After reading in the ACAF data, you must filter it using semi_join_rows()and the same sites-only VCF (sites_mt) in the previous steps. This generates an All of Us variant list that matches the specifications for the HQ variant sites.

# Read in Allele Count/Allele Frequency (ACAF) threshold callset data as a variable called "AoU variants"
# The ACAF threshold callset provides a manageable subset of the srWGS SNP & Indel data
aou_variants = hl.read_matrix_table('gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/multiMT/hail.mt')

# Filter the aou_variants table to only keep variants present in the sites-only VCF (HQ sites)
aou_variants = aou_variants.semi_join_rows(sites_mt.rows())

Next, you will create a unified variant set (combined_variants) by subsetting your external variant list (my_variants) to include only those sites present in the filtered All of Us MT (aou_variants). We will also combine the samples. This ensures that the subsequent relatedness estimation is performed using high-quality variant sites that are common to both your imported data and the All of Us cohort for all individuals. Use the semi_join_rows() method to retain only matching rows from your dataset based on the filtered All of Us variant sites (aou_variants). Before combining samples, ensure both datasets have identical row and entry schemas. The details of this step will depend on the properties and format of your dataset(s).

# Subset aou_variants with your variants
aou_variants = aou_variants.semi_join_rows(my_variants.rows())

# Ensure both datasets have identical row and entry schemas and combine variant datasets
combined_variants = my_variants.union_cols(aou_variants)

Use the pc_relate method in the Hail (python) library to generate relatedness estimates, or kinship scores (see “What is a kinship score and what cutoff should I use for my analysis?” below) for the data using the combined variant set (combined_variants) in Step 5. The pc_relate function returns a list of samples (x_samples) as a Hail Table [4] that have pairwise relationships with kinship scores above the min_kinship parameter (i.e., they are related) and annotates those samples with their kinship scores. Use the min_kinship parameter to specify a relatedness cutoff that meets your research needs.

#use pc_relate() to create a new Hail Table called "x_samples" that contains a subset of the samples in combined_variants with only the samples with a kinship score above the min parameter (e.g., 0.1 in this example) and annotates the samples with a new field to represent the kinship scores for each sample.
#the pc_relate function uses (.GT) to input the genotypes  of the combined_variants file
#k= number of principal components to control for population structure


x_samples = hl.pc_relate(combined_variants.GT, min_individual_maf=0.01, statistics='kin', k=16, min_kinship=0.1)

Use the Hail method maximal_independent_set [5] on the x_samples Hail Table from step 6 to create a Hail Table of the smallest set of samples to remove based on their kinship scores. This method generates a list of samples to selectively remove, or prune from the original dataset while keeping the maximal set of samples.

# create a new Hail Table, "samples_to_remove", that contains a subset of combined_variants with only the samples that exceed the described kinship score cutoff and need to be removed.

samples_to_remove = hl.maximal_independent_set(x_samples["i.s"], x_samples["j.s"], False)

To create the final, deduplicated dataset (combined_variants_independent_samples), use the Hail method anti_join_cols to remove the related samples from your filtered, unified dataset (combined_variants).

# create a new Hail MT, "combined_variants_independent_samples" that contains only the samples from combined_variants that do not match the samples in the "samples_to_remove" Table.

 combined_variants_independent_samples = combined_variants.anti_join_cols(samples_to_remove)

Additional examples and information

Do you have an example of a set of variants to use for the pc_relate function?

Yes. The All of Us high-quality (HQ) sites (Appendix I, All of Us Genomic Quality Report [3]) are available to view and import as a Hail MT on the RW using the sites-only VCF file path listed in the Controlled CDR Directory under the “Genetic Ancestry” asset [6]. However, the variant sites selected for this step are highly dependent on the characteristics of your dataset, and the exact variants used in the All of Us Genomic Quality Report may not be the best fit for your data.

What is a kinship score and what cutoff should I use for my analysis?

A kinship score is half of the fraction of the genetic material shared and ranges from 0.0 - 0.5. The All of Us relatedness analysis estimates kinship scores above 0.1 as related, with parent-child or siblings as 0.25, and identical twins or duplicates as 0.5 (Appendix J, All of Us Genomic Quality Report [3]).

Is there an example of removing related samples from the All of Us genomic dataset?

For another example of how to use the Hail method anti_join_cols to remove related samples on the Researcher Workbench, see the featured workspace “How to Work with All of Us Genomic Data (Hail - Plink)(v8)” in the notebook “01_Get Started with WGS Data_part1_srWGS SNPINDEL.ipynb.” The example code in this featured workspace uses the HQ sites from the All of Us Genomics Quality Report to prune the data in order to keep only the maximal set of unrelated samples in the dataset.