Accessing Files in the Workspace Bucket or Persistent Disk

  • Updated

There are two general storage options available on the Researcher Workbench - the workspace bucket and the persistent disk. This article will outline methods to access saved files in the workspace bucket and persistent disk (PD).

Here you may find answers to questions such as:

  • Where can I see a list of items saved to the workspace bucket vs. PD?
  • How can I delete files in the workspace bucket vs. PD?
  • What are common commands to access items in the workspace bucket?

To learn more about storage options and how to save files, please see these support resources: Storage Options Explained; Persistent Disk; How do I access the workspace bucket and copy data to and from it?; How to work with environment variables tutorial notebook (Researcher Workbench login required).

Accessing & Deleting Files - Workspace Bucket

There are multiple ways to view the items that are currently stored in your workspace bucket to include code snippets and accessing files directly via the Jupyter interface.  

Accessing files in Workspace Bucket

The workspace buckets available in the All of Us Researcher Workbench are part of Google Cloud Storage. As such, they use Google Cloud’s infrastructure and access methods. Below are several methods for accessing files in the workspace bucket. 

1) Using "Snippets" in Jupyter Notebook 

To see a list of items in your workspace bucket, the easiest way is to use our provided snippets in an active R or Python Jupyter Notebook, as outlined in this support article.

In an active notebook, click the 'Snippets' option towards the right in the top notebook menu. This will drop down a menu where you can select 'All of Us R and Cloud Storage snippets' or All of Us Python and Cloud Storage' snippets which will reveal a side menu. In that side menu, click the 'Setup' snippet which will load in a cell, and then select the cell beneath, navigate to the same 'Cloud Storage snippets' sub menu and hover over '(2) List objects in Workspace Bucket' to reveal a 'list_objects_in_bucket.R' or 'list_objects_in_bucket.py'. That snippet will load into the cell, and now you can run both cells to list the objects in your Google bucket. Here's a screenshot of the menu selection:

snippet.png

2) View bucket files via the Google Storage Interface

Alternatively, you can navigate to the 'ABOUT' tab of your Workspace, click the 'File management' and view the files in your workspace bucket directly with Google's Storage interface. Note: If using the Google Storage interface, you cannot download or delete files. You will only be able to view if the file is saved in the bucket. You must download files from the Jupyter interface, as outlined in this support article

3) Using command-line tools

The Google Cloud SDK (Software Development Kit) is a set of tools that you can use to manage resources and applications hosted on Google Cloud Platform (GCP). This includes the gsutil command-line tool, which is specifically designed for manipulating data in Google Cloud Storage, like the workspace buckets in the All of Us Researcher Workbench. To access files stored in workspace buckets you should use command gsutil

 

Using gsutil for Bucket Access:

You can use gsutil to do a wide variety of bucket and object management tasks, including:

  • Listing buckets and objects (ls), gsutil ls
  • Moving (mv) objects, gsutil mv
  • Copying(cp) objects, gsutil cp 
  • Renaming(rm) objects, gsutil rm

To learn more about gsutil please see here

 

Common System Commands in Jupyter Notebooks: 

Below you will find common commands to copy files to workspace bucket and access those files.

To get bucket name

To view the items in your workspace bucket, you need to know the path to your bucket. These commands can be run to get the bucket path and make it a variable for use in the ls command.

Os module

import os
my_bucket
= os.getenv('WORKSPACE_BUCKET') my_bucket

Using %env

%env WORKSPACE_BUCKET

By using !

!echo $WORKSPACE_BUCKET

In R

my_bucket <- Sys.getenv('WORKSPACE_BUCKET')

 

To copy file to directory

In Python, you can use the os.system() function or the ! (exclamation mark) to run these commands. For example:

import os
os.system("gsutil cp local_file.txt gs://your-bucket-path/")
# or
!gsutil cp local_file.txt gs://your-bucket-path/

In R, you can use the system() function to execute system commands, including gsutil:

system("gsutil cp local_file.txt gs://your-bucket-path/")

To list objects in bucket

In python

!gsutil ls {my_bucket}

#or

os.system("gsutil ls gs://your-bucket-path/")

In R

system("gsutil ls -r ", my_bucket)

 

Delete files from Workspace Bucket

To delete objects in your workspace bucket from a Jupyter notebook with an R or Python kernel, you need to know the exact path of the object you want to delete: you need the path to your workspace bucket, the directory or directories in which the object is stored, and the exact name of the object or a generic name convention or filetype if you want to delete multiple similar objects.

Here are some example R commands to delete objects from your workspace bucket. In Python, you can use the os.system() function or the ! (exclamation mark) to run these commands. Note: please ensure you use names relevant to your directories and files and remove all the square brackets before executing:

# Setup:

library(tidyverse) 

# Get the bucket name

my_bucket <- Sys.getenv('WORKSPACE_BUCKET')




# Delete one object from your workspace bucket (replace variables and [] with correct names):

system(paste0("gsutil rm ", my_bucket, "/[directory]/[object_name.txt]"), intern=T)




# Delete multiple objects of the same file type from your workspace bucket

# (replace variables and [] with correct names):

system(paste0("gsutil rm ", my_bucket, "/[directory]/*[.txt]"), intern=T)




# Delete multiple objects of the same name with different file types from your workspace bucket

# (replace variables and [] with correct names):

system(paste0("gsutil rm ", my_bucket, "/[directory]/[object_name].*"), intern=T)

 

Accessing & Deleting Files - Persistent Disk 

Viewing the saved objects in your persistent disk (PD) uses similar structures as that of the workspace bucket. You can view objects via command line codes such as: 

# List all objects in your working directory of your Persistent Disk:
system("ls ./*", intern=T)

# List all objects in a directory of your PD:
system("ls ./[directory]/*, intern=T)

You can also see a list of all objects in the working directory of your Persistent Disk in the Jupyter Menu by clicking 'File' and then 'Open' in the top menu of an active notebook.

Alternatively in an active notebook, you can right-click the 'Jupyter' icon in the top left to open a new tab, and this will take you to the main page of the Jupyter menu (which is also a part of your PD that is viewable if you run system("pwd", intern=T)) from which you can click 'workspaces', and then the name of your active workspace, and now you're back at the same page as you would be if you selected 'File: Open'.

 

Delete files from Persistent Disk

To delete files from your persistent disk, you will use the same command structure, but the 'my_bucket' variable will become a. to reference the location of the files rather than the name of the instance associated with your environment. Unless you saved your data to a specific directory on your persistent disk, it will automatically save to your working directory, which can be referenced with a period. Because you do not need to reference your bucket path, you do not need to run the setup or set the 'my_bucket' variable as there is nothing to paste. Since . is not a variable, you can get rid of the paste0 command. Your PD—though based on GCP architecture—is not a Google Bucket and will not recognize gsutil commands. Please see example below using R: 

# Setup:

library(tidyverse) 




# Delete one object from your PD (replace variables and [] with correct names):

system("rm ./[object_name.txt]"), intern=T)




# Delete multiple objects of the same file type from your PD 

# (replace variables and [] with correct names):

system("rm ./*[.txt]", intern=T)




# Delete multiple objects of the same name with different file types from your PD

# (replace variables and [] with correct names):

system("rm ./[object_name].*", intern=T)

 

Please refer to this article to learn more about persistent disks. 

Was this article helpful?

3 out of 6 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.