Selecting Genomic data: using the Genomic Extraction tool

  • Updated

Selecting Genomic Variant Data for Analysis

To access our short read WGS genomics data, you can utilize our point and click tools to extract variant data from the genomics dataset and save it as VCF (Variant Call Format) call set files for export to a Jupyter Notebook for analysis using Hail, PLINK, or other analysis tools of your choosing. We also have VCF, Hail, PLINK, and BGEN files for different callsets as well as a Hail VDS file for the entire dataset.


Note: The extraction process described below should only be used when you want to analyze Whole Genome Sequencing (WGS) data in a smaller subset of participants (less than 5,000 participants). This process will not work for array genomic data. For larger cohorts, you'll need to pull data directly into a notebook using files in the Controlled CDR Directory.

Genomic Cohort Extraction Cost

Extracting a genomic cohort to VCF files will incur cost, similar to the costs accrued by your cloud analysis environment. Typically, this process will incur a cost of ~ $.02 / extracted sample, but cost may vary as WGS data size varies slightly across samples. Note that when running a genomic extraction on a cohort, only participants with corresponding WGS data will be extracted to the resulting VCF files. Likewise, only these participants will affect the cost. To see the exact number of participants with WGS data within your cohort, try adding a criteria requirement of “Whole Genome Sequence”.

As with other analysis in the workspace, costs are billed either to the workspace creator’s initial credits, or to the associated billing account. When using your own billing account, note that charges relating to VCF extraction will show up as “BigQuery Analysis” and can be identified by a label of “extraction_uuid” in the GCP billing export.

Note: cost/credits are not automatically refunded for canceled or failed extraction jobs.

Choosing the Genomic Dataset

For analysis of WGS genomic data for a smaller cohort (5,000 participants or less), you can choose our prepackaged genomics cohort or concept sets, which consist of genomic variant data to include in your analysis.  However, for larger cohorts, we suggest you start with the prepackaged Hail Matrix Table or set of VCF files for the entire dataset.  

To choose the genomics dataset

  1. To build your cohort using the Cohort Builder tool, you can choose to use all participants, or create a custom cohort and save the cohort.
  2. In the Dataset Builder, choose your saved cohort under Select Cohorts in the far left column.
  3. Check the box next to All whole genome variant data under Select Concept Sets (rows) in the middle column, and then 
  4. Choose VCF files under Select Values (columns) in the far right column.  
  5. Then click CREATE DATASET in the bottom right of the screen.


A pop up screen will appear, asking you to title and save your dataset:


Next you will select ANALYZE at the bottom right of the screen to begin creating your Jupyter Notebook environment. A pop up screen will appear, asking if you would like to run the extraction process.  Please note that this process will utilize cloud compute credits in order to generate code and files from the genomic dataset which you can use in your analysis environment.  Be sure you are ready to begin this process, as it can be quite expensive, depending on the amount of data you are analyzing. Genomic data extraction will run in the background, and you will be notified when the files are ready for analysis.  

Alternatively, you can choose SKIP and still save the Dataset on the next screen without beginning the extraction process or incurring any credit charges.


You will then be asked to Export the Dataset to a Jupyter Notebook.  You will select Python as your coding language to then have the option to select which genomic tool you prefer. At this time we do not have any genomic tools associated with selecting R as your coding language from this screen.  You can select one of our recommended tools: Hail or PLINK, or choose “Other VCF-compatible tool” to generate a code snippet which simply retrieves the VCF files.

Note: it is recommended to use Python for the extraction process, and then create an R notebook or another Python notebook for the main analysis. It's a good practice to separate these 2 tasks (extraction and analysis).

Then you will select EXPORT to begin the Jupyter Notebook environment.


If you chose to run the extraction in the background while you were saving your dataset, you can check the status of your VCF extraction by clicking on the DNA icon in the help sidebar on the right side of the screen, and also see the full list of VCF files you have saved: 


Once the extraction process is complete, you will open the notebook created for the extraction and run the auto-generated code. You do not need to create a new notebook or copy the file path anywhere, as it has already been created for you when you clicked EXPORT at the start of your extraction. 

Video Demonstration

The use of the extraction tool is also described in a video linked below:


We encourage you to routinely check your workbench account for costs being incurred while using the workbench. To find out more information on how to check your account balance, optimize your workspaces to reduce costs, and for examples of project costs, please see the support articles below:  

Initial credits and how to create a billing account

Suggested Optimizations for Cost and Compute Time

Estimate how much your project will cost

How to work with All of Us Genomic data

Was this article helpful?

4 out of 6 found this helpful

Have more questions? Submit a request



Article is closed for comments.