The simplest way to isolate or select a specific variant or set of variants from a defined genomic region in the All of Us Hail MatrixTables (MTs) or Hail VDS is using code snippets provided in the “Manipulate Hail MatrixTable” notebook or "Manipulate VDS" notebook, respectively, in the “How to Work with All of Us Genomic Data” Featured Workspace. They provide a great walkthrough about setting up the appropriate environment, loading Hail, and then using this snippet to isolate variants. The snippets will be discussed more below and divided into code for Hail MTs or the Hail VDS.
Hail MatrixTables
This specific snippet filters variants from base pairs 32355000-32375000 on chromosome 13. First, you define the region of interest using the “test_intervals” function, then filter using the following function below it.
mt = hl.filter_intervals(test_intervals = ['chr13:32355000-32375000'] ##change sites here to the variants you are interested in##
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])
Using this snippet, all you need to change is the text in the brackets based on the genomic region you’re isolating.
Here are some examples using the snippet.
To select all of chromosome 13:
test_intervals = ['chr13']
mt = hl.filter_intervals(
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])
To select two different intervals on different chromosomes:
test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M']
mt = hl.filter_intervals(
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])
To select a single specific variant (the BRCA2 variant chr13-32355250-T-C). Importantly, you need to end the interval at least +1 bp past the variant of interest.
test_intervals = ['chr13:32355250-32355251']
mt = hl.filter_intervals(
mt,
[hl.parse_locus_interval(x,)
for x in test_intervals])
Important Notes
We also have a walkthrough video that goes over the use of this snippet, as well as necessary prior steps to load and use the MatrixTable.
It’s possible to select variants based on their rsID or gene name, but this is a bit more complicated because it first requires annotating a MatrixTable with the variant annotation table. To see more about annotating a MatrixTable, please see the “Getting Started with Genomic Data” notebooks in the “How to Work with All of Us Genomic Data” Featured Workspace as well as the following video from the All of Us data science team.
Hail VDS
Note about filtering the VDS: It's generally significantly more expensive than filtering a MT, so if possible, please try to filter with a Hail MT when possible.
This specific snippet filters variants from base pairs 32355000-32375000 on chromosome 13. First, you define the region of interest using the “test_intervals” function, then filter using the following function below it.
test_intervals = ['chr13:32355000-32375000'] ##change sites here to the variants you are interested in##
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])
Using this snippet, all you need to change is the text in the brackets based on the genomic region you’re isolating.
Here are some examples using the snippet.
To select all of chromosome 13:
test_intervals = ['chr13']
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])
To select two different intervals on different chromosomes:
test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M']
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])
To select a single specific variant (the BRCA2 variant chr13-32355250-T-C). Importantly, you need to end the interval at least +1 bp past the variant of interest.
test_intervals = ['chr13:32355250-32355251']
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])