Accessing Genomic Data in All of Us Controlled Tier

Summary

If you are not using the workbench to analyze genomic data in the Controlled Tier, you can ignore this. We’re making some technical changes on June 4-7, 2022 to our Google Cloud Storage configuration that may require minor changes in your code. If you don’t make these changes, your attempts to retrieve genomic data may encounter errors. The data is not changing and this update should not affect your analyses. 

An overview of the changes to the CDR Bucket, and the code updates you need to make, if this impacts your research, are detailed below. 

Changes to CDR Bucket

The Researcher Workbench stores genomic data files in a Google Cloud Storage bucket called the CDR Bucket. This bucket will be changed to the “requester pays” setting on June 7, 2022. The CDR Bucket region is also changing from multi-regional to single-region. 

After this change, any egress fees will be billed to your workspace and paid using that workspace’s configured payment method (either the initial credits supplied by the All of Us program or a user-provided billing account).

These changes are driven by two upcoming events: the availability of genomic read data (CRAM files) in a future All of Us data release, and a pricing change in Google Cloud:

  • Genomic read data (CRAM files) represent a very large set of files (approximately 20GB per participant) that may not be useful for the majority of researchers. To optimize storage costs, we are planning to use a lower-cost GCP storage class. This reduces ongoing storage costs but incurs some cost at the time of use. Because this use would be tied to specific research projects, we’re turning on “requester pays” so that those costs are billed to the workspace using this data.
  • Google Cloud has announced pricing changes that introduce egress fees when reading data from a multi-region bucket, effective October 1, 2022. Our CDR bucket is currently multi-region, so these fees would affect most users if we do nothing. To avoid that, we’ll be replacing our CDR bucket with a new, identically named single-region bucket.

Code updates to make in order to access the genomic data 

When a bucket has the “requester pays” setting on, Google Cloud requires any access to that bucket to provide a project for billing. Your code should provide the project associated with your workspace, which is available in the environment variable GOOGLE_PROJECT. The details of how to provide the project vary depending on what tools you’re using to access the CDR Bucket.

If you access the CDR bucket using Hail or pyspark, the project will be provided automatically – no code changes needed.

If you access the CDR bucket using gsutil, Nextflow, Cromwell or dsub, you will need to make the code changes listed below.

gsutil

If you use gsutil to access the CDR bucket, you will need to pass an additional flag in the command:

!gsutil -u $GOOGLE_PROJECT ls gs://fc-aou-datasets-controlled

dsub

If you use dsub to access the CDR bucket, you will need to pass an additional parameter in the command:

--user-project ${GOOGLE_PROJECT}

You can see this parameter in a complete dsub command below:

dsub \
--user-project "${GOOGLE_PROJECT}"\

}

Note: If you have an existing dsub command, you will need to add in this parameter. If you are using the dsub Tutorial Notebook, this parameter is already included.

Nextflow

If you use Nextflow to access the CDR bucket, you will need to pass an additional parameter in the Nextflow configuration file:

google.enableRequesterPaysBuckets = true

You can see this parameter in a complete Nextflow configuration file below:

profiles {
gls {
google.enableRequesterPaysBuckets = true

}
}

Note: If you have an existing Nextflow script, you will need to add in this parameter. If you are using the Nextflow Tutorial Notebook, the automatically generated configuration file contains this parameter already.

Cromwell 

If you use Cromwell to access the CDR bucket, you will need to use Cromwell version 77. The Cromwell Tutorial Notebook has been updated to use Cromwell version 77.

You can install Cromwell in a Jupyter Notebook with the following commands. Note, you also need Womtool and SDKMAN! to run Cromwell. 

!curl https://github.com/broadinstitute/cromwell/releases/download/77/cromwell-77.jar -o cromwell-77.jar -L

!curl https://github.com/broadinstitute/cromwell/releases/download/77/womtool-77.jar -o womtool-77.jar -L

!curl -s "https://get.sdkman.io" -o install_sdkman.sh
!bash install_sdkman.sh

 

Cost impact

For the types of data we’ve released to date, we expect these changes will have negligible impact on the researcher’s cost to access and use data. This is because the files currently offered are stored in the standard storage class, and retrieval of resources from this storage class is free. There is a small fee based on the number of operations (up to $0.05 per 10,000), but we expect total fees to remain negligible for most researchers.

Some files released in the future may use a colder storage class, such as nearline, which will incur retrieval charges when those files are accessed. We’ll provide more details when those files are released. For more details, see Google Cloud’s storage pricing information.

If you have any questions, please contact our support team via the User Support Hub linked from the Researcher Workbench or by email at support@researchallofus.org. 

Was this article helpful?

7 out of 7 found this helpful

Have more questions? Submit a request