Recommendations for processing CRAMs with GATK on the Researcher Workbench

  • Updated

When running GATK on the Researcher Workbench to process CRAMs, there are some considerations to keep in mind to manage cost. When reading CRAM files, you will pay egress charges, which are the costs to retrieve the data from the cloud (downloading the data). Some tools are more savvy about these charges when analyzing CRAMs, including GATK. 

The GATK runs an asynchronous prefetcher for cloud data that prefetches the next 40 MB (by default) of the file in a separate thread when retrieving data from the cloud. This greatly speeds up performance in typical use cases, but can increase total data transferred and therefore cost, in particular in cases where you are running with large numbers of small, non-adjacent intervals scattered widely across the genome. If you want to minimize data transfer cost at the expense of performance in such cases, you can turn off the cloud prefetcher completely by running GATK with the options --cloud-prefetch-buffer 0 --cloud-index-prefetch-buffer 0. If you want to prioritize minimizing data transfer cost, but not completely kill performance, you can instead try running with the minimum of 1 MB of prefetching: --cloud-prefetch-buffer 1 --cloud-index-prefetch-buffer 1. For more information, see the GATK documentation

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request



Article is closed for comments.