How do I select specific variants from the Hail MatrixTables or Hail VDS?

  • Updated

The simplest way to isolate or select a specific variant or set of variants from a defined genomic region in the All of Us Hail MatrixTables (MTs) or Hail VDS is using code snippets provided in the “Manipulate Hail MatrixTable” notebook or "Manipulate VDS" notebook, respectively, in the “How to Work with All of Us Genomic Data” Featured Workspace. They provide a great walkthrough about setting up the appropriate environment, loading Hail, and then using this snippet to isolate variants. The snippets will be discussed more below and divided into code for Hail MTs or the Hail VDS. Please ensure you're using a Hail Genomics Analysis environment, which is required to run Hail. This can be enabled in the Jupyter panel on the right side of the workspace.

 

hail genomics analysis.png

 

We also have a walkthrough video that goes over the use of this snippet for MTs, as well as the necessary prior steps to load and use the MT. 

 

Hail MatrixTables

Before using the snippets described in this article to isolate variants of interest, you'll have to run code to load Hail and define variables used for the snippets. Depending on what you're doing you may want to customize your own code differently, but the example listed below will work when using the snippets. In this example we're using the ClinVar Hail MT, but you can use another srWGS Hail MT or the array MT using the appropriate environmental variable or location in the CDR directory

 

 

Snippets you can use to filter a genomic region 

This specific snippet filters variants from base pairs 32355000-32375000 on chromosome 13. First, you define the region of interest using the “test_intervals” function, then filter using the following function below it.

test_intervals = ['chr13:32355000-32375000'] ##change sites here to the variants you are interested in##
mt = hl.filter_intervals(

   mt,
   [hl.parse_locus_interval(x,)
     for x in test_intervals])

Using this snippet, all you need to change is the text in the brackets based on the genomic region you’re isolating.

 

Here are some examples using the snippet.

To select all of chromosome 13:

test_intervals = ['chr13'] 
mt = hl.filter_intervals(
   mt,
   [hl.parse_locus_interval(x,)
     for x in test_intervals])

 

To select two different intervals on different chromosomes:

test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M'] 
mt = hl.filter_intervals(
   mt,
   [hl.parse_locus_interval(x,)
     for x in test_intervals])

 

To select a single specific variant (the BRCA2 variant chr13-32355250-T-C). Importantly, you need to end the interval at least +1 bp past the variant of interest.

test_intervals = ['chr13:32355250-32355251'] 
mt = hl.filter_intervals(
   mt,
   [hl.parse_locus_interval(x,)
     for x in test_intervals])

 

Important Notes

It’s possible to select variants based on their rsID or gene name, but this is a bit more complicated because it first requires annotating a MatrixTable with the variant annotation table. To see more about annotating a MatrixTable, please see the “Getting Started with Genomic Data” notebooks in the “How to Work with All of Us Genomic Data” Featured Workspace as well as the following video from the All of Us data science team. 

 

 

Hail VDS 

Before using the snippets described in this article to isolate variants of interest from the VDS, you'll have to run code to load Hail and define variables used for the snippets. Depending on what you're doing you may want to customize your own code differently, but the example listed below will work when using the snippets. 

 

 

VDS Filtering Snippet

Note about filtering the VDS: It's generally significantly more expensive than filtering a MT, so if possible, please try to filter with a Hail MT when possible. We also strongly suggest filtering by chromosome before selecting specific variants, which should significantly speed up your analyses. This is described more in this featured workspace

 

This specific snippet filters variants from base pairs 32355000-32375000 on chromosome 13. First, you define the region of interest using the “test_intervals” function, then filter using the following function below it.

test_intervals = ['chr13:32355000-32375000'] ##change sites here to the variants you are interested in##
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])

Using this snippet, all you need to change is the text in the brackets based on the genomic region you’re isolating.

 

Here are some examples using the snippet.

To select all of chromosome 13:

test_intervals = ['chr13']
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])

 

To select two different intervals on different chromosomes:

test_intervals = ['chr1:100M-200M', 'chr16:29.1M-30.2M']
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])

 

To select a single specific variant (the BRCA2 variant chr13-32355250-T-C). Importantly, you need to end the interval at least +1 bp past the variant of interest.

test_intervals = ['chr13:32355250-32355251']
vds = hl.vds.filter_intervals( vds, [hl.parse_locus_interval(x,) for x in test_intervals])

 

Was this article helpful?

3 out of 8 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.