Using, customizing, and optimizing Jupyter cloud environments

Cloud environments are responsible for providing the necessary computing resources for using applications on the Researcher Workbench. For example, Jupyter environments power Jupyter Notebooks and use of the terminal. They consist of CPUs, RAM, and a storage disk, which are all required for different aspects of the environment. CPUs, or central processing units, are responsible for executing instructions and performing calculations. RAM, or random access memory, is used to temporarily store data that is being actively processed. Finally, the storage disk is used to store and retrieve data for analysis. Environments are unique for each user in a workspace, so every collaborator can have their own separate customized environment.

This article describes how to customize the main components of cloud environments, how to pause and delete environments, using Dataproc clusters for Hail analyses, as well as information about optimizing environments for your research. It specifically pertains to Jupyter environments since those are currently the only kind that can be customized, however, other Workbench applications utilize fixed cloud environments that work similarly. We also have a video tutorial that provides some basic background about cloud environments.

Creating and customizing Jupyter environments

You can customize a Jupyter environment by clicking the Jupyter icon panel on the right side of your workspace, which will open a menu with different types of environmental variables that can be changed. Please note, any environment customization may have associated cost, as noted here. Shown below are the default settings for a General Analysis environment, which is one of two default options (the other being Hail Genomics Analysis environment). Clicking the create button in the bottom right corner will create the associated environment, which generally takes about 5 minutes.

custom env image.png

create env.png

Central processing units (CPUs)

The central processing unit (CPU) plays a crucial role in cloud environments. It is responsible for executing instructions and performing calculations, making it a vital component of any cloud infrastructure. Increasing the number of CPUs in cloud environments increases performance and task-processing speeds. With more CPUs, the cloud system can handle a higher volume of requests and execute multiple tasks simultaneously, resulting in improved efficiency and reduced processing times. You may want to increase CPU number if your notebooks are running slowly and you'd like to speed up the generation of outputs.

The maximum amount of CPUs that can be used is 96.

Random access memory (RAM)

RAM, which stands for Random Access Memory, is a type of computer memory that allows data to be stored and accessed quickly by the computer's processor. Essentially, you'll need to make sure that your RAM is greater than the size of the files you're actively working with. Lack of sufficient RAM can often lead to kernel dying errors. In order to prevent them, you'll generally have to significantly increase the amount of RAM available in your environment. A video below discusses how to determine RAM usage using the GCP console.

The maximum amount of RAM that can be enabled is 624 GB.

Storage disk

In addition to RAM, a separate storage disk needs to be attached to save files while an environment is active. Persistent disks (PDs) are used for General Analysis environments, which are convenient because they can "persist" after an environment is deleted, allowing them to be re-attached when a new environment is created. While PDs are a great way to easily save data that can be retrieved when starting a new environment, we still recommend saving important files to your workspace bucket in case you accidentally delete the PD or it gets corrupted. Both standard and solid-state drive PDs can be used; the solid-state versions are faster, but also cost about 4 times as much.

The maximum size of a PD is 4000 GB. As described below, standard disks are the only option when using Dataproc clusters.

GPUs

GPUs have become increasingly popular due to their ability to accelerate complex computational tasks. While it's still in a beta release, the Workbench offers GPU instances that allow users to leverage the power of GPUs for machine learning, data analysis, and other computationally intensive workloads.

Cost of environments

Cloud environments cost money/credits to use, with their cost based on the environmental variables like CPU, RAM, and disk size. Please see this article if you're interested in learning more about costs in the Researcher Workbench. You'll notice that when setting up an environment, you can see the cost per hour of an environment when active and paused. If you have credits available, these will also be listed.

gae 2.png

When you change variables, you'll notice that the cost of the environment while active and paused immediately updates. The goal is to ensure we're as transparent as possible about the costs incurred by your cloud environment.

gae 3.png

Pausing and deleting an environment

When you're finished actively working in an environment, we recommend pausing or deleting the environment in order to save costs. A paused environment will still incur some costs because the environment isn't completely gone and can be easily restarted, but will generally be significantly lower than when active. Deleting the environment will ensure you're no longer being charged for the environment, but you would lose any data that isn't saved to a workspace bucket or PD. Note that Jupyter environments are generally auto-deleted about two weeks after they're created (unless manually deleted before then), so please ensure any critical files are routinely saved to the workspace bucket.

Code in your Jupyter Notebooks will not be deleted when you delete an environment or persistent disk. The notebooks themselves are saved in the workspace bucket, which stores files separately from active or paused environments.

Pausing an environment

Active environments can be paused by clicking the green "pause" sign in the Jupyter environment panel.

pause env.png

Paused environments can be restarted at any time by clicking the run button.

restart env.png

Deleting an environment

Environments can be manually deleted by clicking the "delete environment" button at the bottom of the cloud environment panel.

delete env.png

If you used a General Analysis environment, you'll have the option to save or delete the persistent disk associated with it. If you haven't saved any critical outputs to the workspace bucket, we recommend saving the persistent disk as this will allow you to carry over these files into a new environment. NOTE: Jupyter environments are auto-deleted every 1 - 2 weeks, so please make sure important files are transferred to the workspace bucket unless you're using a persistent disk.

delete pd.png

Dataproc clusters

Dataproc clusters must be enabled to run Hail, a popular open-source framework for genomic analysis that can be used in the Researcher Workbench. By leveraging Dataproc's scalability and managed infrastructure, users can easily process large-scale genomic datasets with Hail, enabling efficient and parallelized analysis. Unlike General Analysis environments, Dataproc clusters can utilize workers to speed up the processing of some computations. In addition to being required for Hail, these environments can be used for other types of processing, though they're more expensive than General Analysis environments.

Dataproc clusters can be enabled by selecting "Hail Genomics Analysis" option in the main menu of the Jupyter environment panel.

hail genomics analysis.png

Unlike General Analysis environments, standard disks are the only option available when using Dataproc clusters. Since persistent disks aren't supported with Dataproc clusters, please ensure that any important files are always saved in your workspace bucket, as the standard disk will always be deleted when your environment is deleted. Like General Analysis environments, these are always deleted automatically every 1 - 2 weeks unless manually deleted more frequently.

Note: You cannot use "highcpu" Dataproc clusters in the Researcher Workbench. Please do not choose more CPUs than relative GB of RAM (memory) when customizing your Dataproc cluster. The following environment configurations may result in a "highcpu" error, preventing you from creating a cloud analysis environment:
name: ‘n1-highcpu-2’, CPU: 2, memory: 1.8	name: ‘n1-highcpu-4’, CPU: 4 memory: 3.6
name: ‘n1-highcpu-8’ CPU: 8 memory: 7.2	name: ‘n1-highcpu-16’, cpu: 16, memory: 14.4
name: ‘n1-highcpu-32’, cpu: 32, memory: 28.8	name: ‘n1-highcpu-64’, cpu: 64, memory: 57.6
name: ‘n1-highcpu-96’, cpu: 96, memory: 86.4

Workers

In a Dataproc cluster, workers are the nodes that perform the actual data processing tasks. These workers are responsible for executing the code and running the computations required to process the data. They play a crucial role in the overall performance and efficiency of the cluster and can be customized as well. Preemptible workers are a special type of worker node that are offered at a significantly lower cost compared to regular workers but come with a catch - they can be preempted at any time. Preemption means that the resources allocated to the preemptible worker can be taken away by the cloud provider if there is a demand for them elsewhere. They are particularly useful for running short-lived and fault-tolerant tasks that can be easily restarted if interrupted.

Generally, it's recommended that no more than 50% of your total workers are preemptible.

Using spark console to monitor jobs

The Dataproc cluster provides open source web interfaces that can be used to manage and monitor cluster resources, job performance, and facilities, such as the YARN resource manager and Spark. For example, these resources would be great if you want to monitor and optimize your code for Hail jobs. These resources can be accessed by selecting “Manage and monitor Spark console” under compute type when creating a Dataproc cluster environment. These web interfaces provide operation metrics and visibility into the health, performance, and availability of Dataproc clusters and jobs.

Optimizing cloud environments using Google Console metrics

When doing any kind of large scale analysis, we recommend optimizing your environment to make sure you have sufficient amounts of CPUs, RAM, and disk space for your work. Additionally you may want to ensure you're not using more than necessary, as powerful environments are more expensive. Finally, optimizing RAM can be essential to prevent your kernel from dying.

Below is a video from our support team that describes how to use GCP console metrics in order to optimize your environment.

Using, customizing, and optimizing Jupyter cloud environments