Using the Genomic Extraction Tool

  • Updated

To access the short-read whole genome sequencing (srWGS) data, you have access to the point-and-click tool, Genomic Extraction.

The Genomic Extraction tool extracts variant data from the genomic dataset and saves it as Variant Call Format (VCF) callset files for exporting to a Jupyter Notebook environment, where you can perform analysis using Hail, PLINK, etc.

VCF, Hail, PLINK, and BGEN files are also available for different callsets as well as a Hail VariantDataset (VDS) file for the entire genomic dataset.

Note: The extraction process using the Genomic Extraction tool should only be used when you want to analyze whole genome sequencing (WGS) data in a smaller subset of participants (less than 5,000 participants). This process will not work for array data.

For larger cohorts, you will need to pull the genomic data directly into Jupyter Notebook using files in the Controlled CDR Directory.

To use the Genomic Extraction tool

  1. Follow the steps for creating a cohort with the Cohort Builder.
  2. Click “white plus sign on top of a blue circle” to the right of “Datasets.”

    After clicking the white plus sign on top of the blue circle, the Dataset Builder page will display. The Dataset Builder includes four sections: Select Cohorts (Participants) on the left, Select Concept Sets (Rows) in the middle, Select Values (Columns) to the right, and Preview Dataset on the bottom.

  3. Select your cohort under the “Select Cohorts (Participants)” column on the left by clicking the checkbox.

    The select cohorts column includes all the workspace cohorts you created using the Cohort Builder. To select your cohort, you will click the checkbox to the left of the cohort.

  4. Select the “Short-read whole genome sequencing data” under the “Select Concept Sets (Rows)” column in the middle by clicking the checkbox.

    The select concept sets column includes all the concept sets available to you. The short-read whole genome sequencing data concept set is a pre-made concept set for genomic analysis. To select the concept set, you will click the checkbox to the left of the concept set.

  5. Select “VCF Files” under the “Select Values (Columns)” column on the right by clicking the checkbox.

    The select values column includes all the values available to you. For the short-read whole genome sequencing data concept set, you will see only VCF files as the values. To select the value, you will click the checkbox to the left of the value. After you’ve selected your cohort, the short-read whole genome sequencing data concept set, and the VCF files value, there will be a blue create dataset at the bottom right of the Dataset Builder screen.

  6. Click “Create Dataset.”

    After clicking the blue create dataset button, a pop-up will appear with two text field boxes for naming and providing a description for your dataset.

  7. Name your dataset and add a description for your dataset.

    In the bottom right of the pop-up, the Save button is grayed out until you name and provide a description for your dataset. Once you’ve provided a name and a description, the Save button will turn blue.

  8. Click “SAVE.”

    After clicking the blue save button, the Dataset Builder will reappear on your screen. In the bottom right of the screen, there are still two buttons, but instead of Create dataset and Analyze, you’ll see Save dataset and Analyze. Save dataset is grayed out unless you make changes to your dataset. The Analyze button is no longer grayed out and is blue.

  9. Click “ANALYZE” in the bottom right to begin creating your Jupyter Notebook environment. A pop-up will appear, asking if you would like to run the extraction process.
    Note: the extraction process utilizes cloud compute credits to generate the code and files from the genomic dataset. The process can be significant, depending on the amount of data you are analyzing.

    After clicking the blue analyze button, a pop-up will appear titled Would you like to extract variant data as VCF files.

  10. Decide if you want to start or skip genomic extraction.
    Note: Genomic data extraction runs in the background and incurs compute costs. You can also skip and save your dataset without beginning the extraction process or incurring any compute costs.

    Extraction will generate VCF files for the participants in your dataset which you can use in your analysis environment. VCF extract will incur cloud cost. Extraction typically costs $0.02 per extracted sample, but costs may vary.

If you choose to start genomic extraction

  1. Click “Extract & Continue.”
    Note: Genomic data extraction runs in the background and will notify you when the files are ready for analysis.

    The blue Extract and continue button is in the bottom right of the pop-up. You can also choose to skip, which is linked text to the left of the extract and continue button. After clicking extract and continue, a new pop-up will appear titled Export Dataset. This pop-up is where you will set up how you want to export your dataset.

  2. Select Python as your programming language.
    Note: The genomic tools currently available in the Researcher Workbench do not use R or SAS as a programming language.

    Select programming language is a radio button option with Python, R, and SAS options. Python is defaulted for the export dataset, so no action should be necessary.

  3. Select the notebook or create a new notebook.

    Select a current Jupyter Notebook or creating a new notebook is a drop down option.

  4. If creating a new notebook, name your notebook.

    The Jupyter Notebook name field is an open text field.

  5. Select your preferred analysis tool: Hail, PLINK, or other VCF-compatible tool.

    Select analysis tool for genetic variant data is a radio button option with Hail, PLINK, and other VCF-compatible tool. After you fill out all the fields on the export dataset pop-up, there are three buttons and links along the bottom of the pop-up: a cancel link on the left that will close the pop-up, a white copy code button in the middle that allows you to copy the code created with the Dataset Builder, and a blue export button on the right that exports your dataset and launches your analysis environment.

  6. Click “EXPORT” to launch the Jupyter Notebook environment.

    After clicking export, the Analysis tab of your workspace will appear with your Jupyter Notebook.

To check the status of the genomic extraction

If you chose to run the extraction in the background while you were saving your dataset, you can check the status.

  1. Click “White DNA helix icon.”

    The white DNA helix icon is located on the right hand navigation bar as the last icon in the list of icons. After you click he white DNA helix, a pop-up will appear with your genomic extractions. This includes the dataset name, the status, date started, cost so far, and duration of the extraction so far.

  2. View the status, date started, cost, and duration of the genomic extraction.



  3. Open the notebook you created under the “If you choose to start genomic extraction” steps.
    Note: You do not need to create a new notebook or copy the file path anywhere, as it has already been created for you when you clicked export at the start of your extraction.

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.