What exactly am I paying for?

  • Updated

Creating a Researcher Workbench account is free. Additionally, the All of Us Research Program provides $300 initial credits to each registered researcher that can be applied towards cloud computational costs described below. Once these initial credits have been exhausted, a Google Cloud Platform (GCP) billing account must be set up to proceed with analyses on the Workbench. Prior to starting a project on the Researcher Workbench, we recommend you review this support article to optimize your analysis and mitigate costs.

During the initial credit period, the workspace creator will be charged for all usage in the workspace - including usage from other collaborators in the workspace. If you do not want other collaborators to spend your initial credits, consider sharing your workspace with ‘read only’ permissions. Using the Cohort Builder, Dataset Builder, reviewing cohorts, or browsing support materials will not incur costs. For reference, please see this support article for some example cost breakdowns of Featured Workspaces to get a better idea of what your project may cost. 

Note: You are able to access your spend report at any time via Google Billing in the GCP console by accessing the “About” tab of your workspace > “View detailed spend report.” Please note, spend reports are reported by total cost per day. See here for more on GCP billing: View your billing reports and cost trends.

Roughly calculating costs

The equation below can be used for a general understanding and breakdown of costs accumulated in the Researcher Workbench. More detailed factors can come into play as you begin to customize your cloud analysis environments and work with specific kinds of data. 

Costs = Data + (Compute Resources * Time) + (Storage * Time)

Time: the amount of time measured in minute increments

 

Data

The All of Us data are stored on a secure cloud based platform via Google Cloud Platform (GCP). BigQuery is a serverless data analytics platform, and the Curated Data Repository (CDR) is securely contained in BigQuery. In the Researcher Workbench, SQL is used to pull the CDR from BigQuery. Queries are billed according to the number of bytes read. Therefore, if you pull more data (larger amount of bytes) from BigQuery, it will incur increased charges. For example, if you want to pull all columns and rows from the EHR condition domain for all participants in your cohort size of 10,000 participants, it will result in close to several GB or TB of data - correlating to higher cost (roughly up to several hundred dollars) for the data pull. Therefore, for certain data types, such as the Fitbit minute level data or types of genomic data, we do not recommend querying the entire table due to the large volume of data and cost. 

To learn more about ways to optimize using BigQuery, please see this GCP support article.

 

Cloud compute Resources

The workbench uses Google Compute Engine (GCE) for computational resources in the cloud and Google Cloud Storage (GCS) for storage in the cloud. When users open an application for analysis (like a Jupyter Notebook), the application is loaded in a virtual cloud environment that costs money to use. When users are actively working with applications, they will be spending at an hourly rate depending on the size of the machine. The default Jupyter “General Analysis” environment costs around 20 cents per hour when active, as shown in the menu below. When notebooks are not in active use, the machines can be “paused” to minimize compute costs. In a “paused” state, there is still a nominal cloud spend, which is dependent on the environmental variables. Once environment are deleted, no compute cost is incurred. 

 

gae 2.png

 

Via the “Jupyter cloud analysis environment” console in the Researcher Workbench, you are able to choose the type of environment (standard or dataproc) and customize the compute resources such as CPUs, RAM, and disk sizes. To learn more about compute resources see this support article: [Insert Chris’s article]. As you increase computing power, it will incur applicable cost. For example, increasing CPUs from 4 CPU to 8 CPU will change the cost while running from $0.20 per hour to $0.29 per hour. Certain analyses may require more computing resources, and thus will increase cost. 

 

Storage

All workspaces have an associated “workspace bucket” for storage. Storage cost is $0.026 per GB per month. Workspace storage is typically a very small fraction of the total cost unless users create and import very large files. Unlike compute costs, storage costs are incurred even when users are not actively using workspaces. It is good practice to remove large files that are no longer needed from the workspace bucket.

Furthermore, when a standard General Analysis environment is created, a persistent disk is attached. The persistent disk (PD) storage is part of your environment that is automatically attached (like a USB drive) that can store files even if the environment is deleted. We offer two types of PDs: standard and solid state-drive PDs, which incur a per minute charge for a monthly billed total. For example, a 120 GB standard PD will incur a monthly cost of $4.80, while a 120 GB solid state-drive (SSD) PD is $20.40 a month. As you increase disk size with each PD type, it will increase the total charge. Therefore, if you are not utilizing the PD as your long term storage option, we recommend deleting the PD when you delete your analysis environment. 

To learn more about storage options, please see this support article: Storage Options Explained

Other costs that may be incurred for genomic work

If working with All of Us Genomic data, depending on the data type and file format there may be associated cost. For example, we offer a genomic extraction tool which allows you to extract variant data from our short read WGS (srWGS) data. This process will typically incur a cost of ~ $.02 / extracted sample. For example, if using the tool for 4,500 samples, that is roughly $90.  Cost may vary as WGS data size varies slightly across samples.

 

CRAMs

The All of Us genomic dataset does offer raw data in compressed CRAM format for whole genome sequencing (WGS) data as outlined in this support article How the All of Us Genomic data are organized. Raw genomic data is significantly more expensive to use because you must pay egress charges, which are the costs to retrieve the data from the cloud for analysis. The CRAMs are stored in nearline storage in a 'requestor pays' bucket. If you copy all the CRAM files to a local disk, we estimate that your cost will be multiple thousands of dollars. For example, if you stream every v7 CRAM file you can estimate approximately $20,000 in cost. Furthermore, utilizing multiple CRAM files or a large quantity of them will require more compute resources necessary to analyze them, in turn making an expensive analysis environment. CRAM files are a sizable amount of data, and there is an associated increase in storage cost for these files. Therefore, we recommend partial CRAM streaming as demonstrated in this tutorial notebook. Even then, running the dsub command to stream a portion of one chromosome from all v7 CRAM files will cost around $500. Therefore, if using CRAMs, we highly recommend setting up a billing account and connect it to your workspace before running any stream; otherwise it may exhaust your free credits and the command will fail before completion.  We encourage all users interested in using CRAMs to review this tutorial workspace prior to the start of their project and consider associated analysis cost. 

 

Workflows

In conjunction with the cost for data analysis, storage, and compute power, there are also costs associated with the use of workflows for users processing large volumes of data. The Researcher Workbench supports two workflow engines, Nextflow (version 21.03.0-edge and above) and Cromwell/WDL. We also offer dsub, a command-line tool, that makes it easy to submit and run batch scripts in the cloud. 

Each of these workflow features incurs its own associated cost. Cromwell, for example, is an application currently available in the workspace. If you have one Cromwell application in your workspace, Cromwell costs $296/month when running and $148/month when paused. Each additional Cromwell application will cost $148/month when running or paused. When running, a Cromwell instance costs $0.20/hour + ($0.20/hour x number of Cromwell apps running) and $0.20/hour when paused. Unlike Juptyer application, Cromwell does not auto-pause, and requires monitoring. Therefore, if you do not pause the Cromwell instance, there will be high cost associated with this application use. 

 

To learn more about these tools, please see these support articles: How to use Cromwell in the All of Us Researcher Workbench; Workflows in the All of Us Researcher Workbench: Nextflow and Cromwell; Use dsub in the All of Us Researcher Workbench

Was this article helpful?

11 out of 11 found this helpful

Have more questions? Submit a request

Comments

0 comments

Article is closed for comments.