The pharmacogenomics auxiliary dataset includes haplotype calls and predicted phenotypes for 18 genes relevant to human drug metabolism for all samples with short-read whole genome sequencing (srWGS) data. In this article, we will describe the pipeline and our QC processes, and provide additional information about the dataset.
Need additional information?
Pharmacogenomics file paths: Controlled CDR Directory
Getting started with genomic data on the Researcher Workbench: Genomics Featured Workspace
Quality report for the srWGS dataset: All of Us Genomic Quality Report
Table of contents
Introduction
The pharmacogenomics auxiliary dataset includes haplotype calls and predicted phenotypes for 18 genes relevant to human drug metabolism for all samples with srWGS data. Pharmacogenomics, also known as star allele calling, combines pharmacology (the study of drugs) with genomics (the study of genes and their functions). We validated our pipeline on well characterized samples and compared our novel results on All of Us samples to previously observed population allele frequencies and results from long-read sequencing on the same All of Us samples when available.
Pharmacogenomics examines how genetics influence drug response, helping identify markers associated with drug efficacy or toxicity. Testing enables healthcare providers to tailor medications to a patient’s genetic profile, improving outcomes. As a key aspect of precision medicine, pharmacogenomics optimizes treatment plans, dosages, and drug choices to enhance effectiveness and reduce adverse reactions.
Pharmacogenes—genes that influence drug response—are highly polymorphic, meaning they can contain multiple single nucleotide polymorphisms (SNPs) and undergo copy number or structural changes. Determining the exact functional impact of an allele requires phasing or haplotyping of the entire genetic region. Different alleles can exhibit varying levels of activity compared to the reference allele, with this information documented in databases like PharmVar.
This process is sometimes called star-allele calling because many pharmacogenetic haplotypes are assigned unique star-allele identifiers. However, we use the broader term haplotyping, as not all pharmacogenes have defined star alleles. A star-allele diplotype (the combination of haplotypes inherited from both parents), can be used to predict a phenotype using resources such as the Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines (CPIC website). For some pharmacogenes, a net activity score quantifies how gene activity compares to two reference alleles.
Previously, pharmacogenomics data from the CDRv6 release were generated as part of an All of Us demonstration project and the results are available via a preprint, Frequency of pharmacogenomic variation and medication exposures among All of Us Participants.
Analysis pipeline
The diplotypes and predicted phenotypes are across 18 genes from PharmGKB Tier 1 and Tier2 lists that are supported by the tool Stargazer. We ran Stargazer 2.0.2 on the srWGS variant data, derived from DRAGEN 3.7.8 variant calling software, which improves alignments for reads derived from genes with high homology. The SNP and Indel calls used were filtered using the new VETS joint filtering pipeline, described in the CDRv8 All of Us Genomic Data QC report. A separate analysis pipeline was performed for the gene CYP2D6.
The input VCF was derived from the CDRv8 joint-called VDS, subset to positions containing variants relevant to the allele definitions for our 18 pharmacogenomics genes, then split into single-sample VCFs while retaining homozygous reference calls. Because only SNP and indel data were provided as input to Stargazer, copy number calls are not provided, except for CYP2D6. Stargazer output was post-processed to apply allele function definitions according to CPIC and improve phasing.
Allele definitions are as included in Stargazer 2.0.2. CPIC phenotypes were retrieved from v1 of the API on 2024-06-21.
CYP2D6 pharmacogenomics calling and copy number was performed with the package Cyrius v1.1.1 and run on per-sample cram input. The DRAGEN version and VETS filtering improved the quality of the CYP2B6 calls substantially to the extent that we include it in this release. Structural variation nomenclature was harmonized and phenotypes were applied using the cyp2d6_parser package.
QC results
We performed quality control testing including call rate testing, Genetic Testing Reference Materials (GeT-RM) concordance rate testing, long read concordance evaluation, and per-allele cosine similarity checks. For each gene, we categorized the results as low concordance or high concordance based on their concordance to the long-read WGS callset results and the Stargazer calls on available GET-RM samples. To be included in the high concordance list of genes, Stargazer calls on available GeT-RM samples must be greater than or equal to 90% and lrWGS concordance must be greater than or equal to 90%. We consider CYP2D6 calls to be high concordance based on the high GeT-RM concordance reported in the Cyrius paper.
The list of high concordance genes is CYP2C_CLUSTER, ABCG2, CACNA1S, CFTR, CYP2C9, CYP2D6, G6PD, NUDT15, RYR1, TPMT, VKORC1
The list of low concordance genes is CYP2B6, CYP2C19, CYP3A5, CYP4F2, DPYD, SLCO1B1, UGT1A1
Six samples are missing from one gene - CYP2D6 - due to pipeline issues.
Call rate
We performed call rate testing for all samples for each gene (Table 1). Indeterminate samples may be able to be resolved with an alternate star allele caller like Aldy, or may contain haplotypes that have not been previously characterized.
Table 1. Call rate for each gene
Gene | Call rate | Number of samples |
CYP2C_CLUSTER | 1 | 414830 |
ABCG2 | 1 | 414830 |
CACNA1S | 1 | 414830 |
CFTR | 1 | 414830 |
CYP2B6 | 1 | 414830 |
CYP2C19 | 0.994644 | 414830 |
CYP2C9 | 0.999491 | 414830 |
CYP2D6 | 0.977887 | 414824 |
CYP3A5 | 0.999935 | 414830 |
CYP4F2 | 0.940395 | 414830 |
DPYD | 1 | 414830 |
G6PD | 0.999711 | 414830 |
NUDT15 | 1 | 414830 |
RYR1 | 1 | 414830 |
SLCO1B1 | 0.987817 | 414830 |
TPMT | 0.99994 | 414830 |
UGT1A1 | 0.999496 | 414830 |
VKORC1 | 1 | 414830 |
GeT-RM validations
We performed concordance testing using the Genetic Testing Reference Materials (GeT-RM), maintained by the US Centers for Disease Control. These data include genotype information for pharmacogenes in dozens of cell line samples validated by one or more clinical laboratories. We made adjustments to some of the concordance tests for some genes due to inconsistencies in the data either due to not having updated information to more recently discovered alleles or difficulties of long-range phasing. Details about approaches for each gene are available in the gene-specific details.
For some genes, calls were considered concordant in our analysis when they contained the same variants, but produced different allele definitions because of phasing errors. Where GeT-RM results contain more than two alleles for a sample, diplotypes called by Stargazer that contain any of those alleles are considered a match.
GeT-RM results for CYP2C19 and SLCO1B1 were impacted by a lack of confident homozygous reference genotype output in the 1000 Genomes data file format. For CYP2D6 we rely on published Cyrius evaluations. Note that genes ABCG2, CACNA1S, CFTR, and RYR1 do not currently have published GeT-RM validations.
Table 2. GeT-RM concordance validation
Gene | Concordance | Number of samples |
CYP2C_CLUSTER | 1 | 11 |
CYP2B6 | 0.9204545455 | 88 |
CYP2C9 | 0.9898989899 | 99 |
CYP2C19 | 0.5795454545 | 88 |
CYP3A5 | 1 | 88 |
CYP4F2 | 0.96875 | 64 |
DPYD | 0.8409090909 | 88 |
G6PD | 0.9886363636 | 88 |
NUDT15 | 1 | 4 |
SLCO1B1 | 0.5568181818 | 88 |
TPMT | 0.9772727273 | 88 |
UGT1A1 | 0.984375 | 64 |
VKORC1 | 1 | 88 |
Long read concordance
Many key pharmacogenomics genes have homology to other parts of the reference genome. This can confound short read alignments and reduce variant calling accuracy. Long reads derived from these regions cover more unique sequences on one or both sides of the homology, leading to more confident mapping, alignment, and variant calling.
For selected All of Us samples that have both srWGS and long-read whole genome sequencing (lrWGS) data available, single-sample VCFs containing SNP and indel calls for each were input to Stargazer. Calls were compared for each gene for each sample between the two technologies and results were aggregated per gene. Low concordance is indicative of potentially decreased accuracy with short reads. Additional notes are in the gene-specific details.
The 1005 samples represent the intersection of the All of Us CDRv7 dataset and the CDRv8 srWGS data. DPYD has fewer samples than other genes because of pipeline issues.
Table 3. Long read concordance results
Gene | Concordance | Number of samples |
2C_CLUSTER | 0.968159204 | 1005 |
ABCG2 | 0.9820895522 | 1005 |
CACNA1S | 1 | 1005 |
CFTR | 0.992039801 | 1005 |
CYP2B6 | 0.8616915423 | 1005 |
CYP2C19 | 0.8955223881 | 1005 |
CYP2C9 | 0.9701492537 | 1005 |
CYP3A5 | 0.8447761194 | 1005 |
CYP4F2 | 0.8567164179 | 1005 |
DPYD | 0.8694779116 | 996 |
G6PD | 0.9313432836 | 1005 |
NUDT15 | 1 | 1005 |
RYR1 | 1 | 1005 |
SLCO1B1 | 0.8955223881 | 1005 |
TPMT | 0.9910447761 | 1005 |
UGT1A1 | 0.8686567164 | 1005 |
VKORC1 | 0.968159204 | 1005 |
all_genes | 0.9224047343 | 18081 |
Per-allele cosine similarity
As an additional check, population allele frequencies were compared between our results and the published results from CPIC available through PharmGKB (retrieved June 2024). We used a cosine similarity metric based on vectors of per-population allele frequencies for computed versus published. These tables are available in the PDF at the bottom of this article. Cosine similarity can be underestimated in cases where the published results are derived from a comparatively small cohort size. There are also discrepancies in how the populations are defined for All of Us versus CPIC, especially for African genetic ancestry and Americas samples. CPIC allele frequencies are often compiled from multiple studies and assign participants according to genetic ancestry categories as described on the PharmGKB site.
Note that when the allele frequency across all the intersecting populations is zero or not available for one dataset or the other then the cosine similarity is zero.
Data description
Star allele calls are provided as per-gene TSVs. The fields that are within each TSV vary depending on the gene and are described in Table 4. All TSVs contain the fields person_id, gene, genotype, and phenotype. The TSVs can be concatenated easily, but are provided separately for memory usage considerations. We recommend researchers start by looking at the phenotype field, which contains a pre-computed predicted outcome of the genotype. The genotype field is quite variable depending on the gene.
The diplotype is used to predict the phenotype using CPIC guidelines. Drug transporters and non-drug metabolizing enzymes may not have easily inferred phenotypes. For some genes, especially in cases where the relevant CPIC drug guidelines include additional genes, phenotype information may be given as "variant present" or "variant absent". Example phenotypes are given in gene-specific details.
For some pharmacogenes, a net activity score quantifies how gene activity compares to two reference alleles. They may vary based on the role of the enzyme that they are associated with. The activity score quantifies the activity of the enzyme. For example, activity scores are only appropriate for certain drug-metabolizing enzymes, such as those in the cytochrome P450 family where activity scores are defined by CPIC. There can be a possible phenotype and possible genotype field for some genes. There are two additional columns unique to CYP2D6 output, which are "copy_number" and "cyrius_filter".
Table 4. Pharmacogenomics possible fields file description
Field Name | Key? | Type | Nullable? | Example Value | Notes |
person_id | yes | String | No | 1000055 | |
gene | no | String | No | 2C_CLUSTER | |
genotype | no | String | No | *1/*1 | Various formats appear for this field depending on the gene. Another example value: “rs9923231 reference (C)/rs9923231 reference (C)" |
phenotype | no | String | No | Normal Function | Predicted phenotype based on the diplotype using CPIC guidelines. Example phenotypes are given in gene-specific details. |
activity_score | no | Float | No | 2.0 | Defined by CPIC for drug-metabolizing enzymes. Measures gene activity relative to someone with two reference alleles. Not available for all genes. |
possible_genotype | no | String | Yes | Not available for all genes. | |
possible_phenotype | no | String | Yes | Not available for all genes. | |
copy_number | no | Float | No | 2.0 | Only available for CYP2D6 output. |
cyrius_filter | no | String | No | PASS | Only available for CYP2D6 output. |
Column Explanations:
- Field name -- The name of the field. In tsv files, this will appear on the first row of the file.
- Type -- Data type.
- Key? -- Whether this field makes up a unique key for the row. Note that all key fields together make a unique key for the row.
- Notes -- Any other relevant information.
High concordance genes
CYP2C_CLUSTER
CYP2C_CLUSTER refers to the non-coding SNP rs12777823 upstream of the CYP2C cluster of genes. Only 11 samples were available to compare with the GeT-RM results, but given the perfect long read concordance for those samples and good coverage over this region, we include the 2C cluster in our high concordance results.
Gene name: 2C_CLUSTER. Genotype values for CYP2C_CLUSTER are formatted as “rs12777823 reference (G)/rs12777823 reference (G)”. Phenotypes are denoted as "Variant Present" or "Variant Absent".
ABCG2
GeT-RM validation is not available for this gene. However, ABCG2 only contains a single SNP. Due to high long read call concordance and lack of phasing challenges, this gene is included in the high concordance release.
Gene name: ABCG2. Genotype values for ABCG2 are formatted as “rs2231142 reference (G)/rs2231142 reference (G)”. Phenotypes are denoted as "Normal Function", "Decreased Function", or "Poor Function".
CACNA1S
GeT-RM validation is not available for this gene. Non-reference alleles in CACNA1S are rare with the most common allele reported at 0.01% allele frequency in the East Asian genomic ancestry population in gnomAD. c.3257G>A variant comparison allele frequencies are based on gnomAD for better correspondence.
Gene name: CACNA1S. Genotype values for CACNA1S are formatted as “Reference/Reference”. Phenotypes are denoted as "Malignant Hyperthermia Susceptibility" or "Uncertain Susceptibility".
CFTR
Phenotypes are given as Ivacaftor-responsive or not, in accordance with CPIC guidelines. CFTR alleles can also be categorized by class according to Table 1 of the CPIC guidelines in order to relate variants to treatment strategies. Some CFTR allele frequency cosine similarities are high though AFs appear low because of rounding in the PDF output of the eur.cpic column.
Gene name: CFTR. Genotype values for CFTR are formatted as “Reference/Reference”. Phenotypes are denoted as "normal_function", "decreased_function", or "unknown_function". An activity score is included for CFTR.
CYP2C9
The only discrepancy in the GeT-RM validation is that Stargazer calls a *61 allele that was not included in any of the GeT-RM assays. In sample HG01190 that allele is instead called as *2 in GeT-RM. Concordance is still high enough to merit this gene being included in our high concordance list.
Six CYP2C9 alleles have no sensitivity in our callset due to filtered SNPs in the joint-called VDS. Of these, *7, *59, *66, and *79 are classified as unknown function and *37 and *46 are classified as decreased function, but very rare. CPIC has no observations of these alleles and gnomAD shows missing decreased function alleles to be very rare:
Gene name: CYP2C9. Genotype values for CYP2C9 are formatted as “*1/*2”. Phenotypes are denoted as "Normal Metabolizer", "Intermediate Metabolizer", "Poor Metabolizer”, or “Indeterminate”. An activity score is included for CYP2C9.
CYP2D6
The Cyrius publication reports 99% diplotype accuracy on 144 GeT-RM samples. However, that validation is not reproduced here.
Gene name: CYP2D6. Genotype values for CYP2D6 are formatted as “*2/*4”. An example of a phenotype value is "CYP2D6 Normal Metabolizer". An activity score is included for CYP2C9. The columns "copy_number" and "cyrius_filter" are unique to CYP2D6 output. Filters are applied by the Cyrius tool used to produce CYP2D6 diplotype calls. Filters include: PASS, Not_assigned_to_haplotypes, More_than_one_possible_genotype, and LowQ_high_CN.
G6PD
Note that G6PD is located on chrX and diplotypes are modified based on the participant's X chromosome count derived in the genomic metrics auxiliary data (see the Controlled CDR Directory). If multiple alleles are reported for a sample genotyped as XY, then that sample's diplotype and phenotype are given as indeterminate.
G6PD concordance with GeT-RM and lrWGS each exceed 90%, so this gene is included in the high concordance list.
Gene name: G6PD. Genotype values for G6PD are formatted as “A”. Phenotypes are given as "Normal", "Variable", "Deficient", or "Indeterminate". As of this publication, phenotype terms for G6PD have not been standardized by CPIC.
NUDT15
The Pratt 2022 paper with NUDT15 and TPMT results only genotyped NUDT15 in 18 samples. Only 4 of those samples overlapped with NYGC 1000 Genomes samples. However, concordance with long read calls is 100% and allele frequency cosine similarity is very high for the three populations and nine alleles that overlap CPIC. This gene is included in the high concordance list.
Gene name: NUDT15. Genotype values for NUDT15 are formatted as “*1/*1”. Phenotypes are denoted as "Normal Metabolizer", "Intermediate Metabolizer", "Poor Metabolizer”, or “Indeterminate”.
RYR1
The called genotypes have no sensitivity to c.14512C>G, which does show a phenotype of increased activity. However this allele appears to be very rare. CPIC does not have population allele frequency estimates for this allele, but this variant is absent from gnomAD 4.1.
RYR1 has excellent long read concordance and good allele frequency similarity for many alleles.
Gene name: RYR1. Genotype values for RYR1 are formatted as “Reference/Reference”. An example of a phenotype value for RYR1 is "Uncertain Susceptibility".
TPMT
The called genotypes have no sensitivity to *28. This allele is not in the GeT-RM validations, is classified as unknown function, absent in CPIC, and very low AF in gnomAD. The AF in gnomAD in the EUR genetic ancestry group is 0.000002542.
TPMT has excellent GeT-RM concordance and long read concordance and good allele frequency similarity for many alleles.
Gene name: TPMT. Genotype values for TPMT are formatted as “*1/*1”. Phenotypes are denoted as "Normal Metabolizer", "Intermediate Metabolizer", "Poor Metabolizer”, or “Indeterminate”.
VKORC1
GeT-RM validation assessed VKORC1 as three separate assays for the three known VKORC1 SNPs. Stargazer run on GVCF input (as used in GeT-RM validations) fails to phase the rs72547529/V66M and rs61742245/D36Y SNPs. GeT-RM validations were compared only for the rs9923231/*S1 allele calls. The All of Us participants are only genotyped for the rs9923231
allele. With respect to the rs9923231, validation concordance is high and population allele frequency is in very good agreement so we include this in the list of high concordance genes.
Gene name: VKORC1. Genotype values for VKORC1 are formatted as “rs9923231 reference (C)/rs9923231 reference (C)". Phenotypes are given as "Variant Absent", "Variant Present", or "Indeterminate".
Low concordance genes
CYP2B6
CYP2B6 concordance was evaluated in comparison with updated GeT-RM calls from the Aldy publication (Hari et al. 2023). GeT-RM calls were updated according to the logic detailed in the supplemental notes. For CYP2B6, several diplotype calls were updated to alleles that were called by a consensus of evaluated tools where the called alleles included all the variants in the GeT-RM allele. A few other CYP2B6 diplotypes were updated because none of the tools evaluated confirmed the GeT-RM allele.
CYP2B6 long read concordance is poor. This may be a result of the high degree of self-similarity of the reference genome for parts of this gene, as evidenced by the "self-chain" track in the UCSC Genome Browser. The most common error mode seen in the lrWGS versus srWGS comparison is that lrWGS prefers *6 but srWGS calls *9.
A diagram from the CAVE viewer comparing CYP2B6 alleles at PharmVar shows some of the distinctions.
Population allele frequency comparisons are poor for *19, *20, *21, and *28, which are only present in CPIC in the sub-saharan African population. The All of Us genetic ancestry groups do not distinguish sub-saharan African from African ancestry.
Gene name: CYP2B6. Genotype values for CYP2B6 are formatted as “*1/*6”. An example of a phenotype value for CYP2B6 is “normal_metabolizer”. CYP2B6 also includes an activity_score.
CYP2C19
CYP2C19 concordance was evaluated in comparison with updated GeT-RM calls from the Aldy publication, similar to CYP2B6. CYP2C19 GeT-RM concordance is low due to data format issues with the 1000 Genomes validation data that don't apply to the All of Us pipeline, but lrWGS concordance is ~90%. This gene is included in the low concordance list.
Note that *38 is the "reference" allele with no variants compared to the GRCh38 reference, not *1, which has a SNP at chr10:94842866:A>G.
Gene name: CYP2C19. Genotype values for CYP2C19 are formatted as “*1/*38”. Phenotypes are denoted as "Normal Metabolizer", "Intermediate Metabolizer", "Poor Metabolizer”, or “Indeterminate”.
CYP3A5
This gene is included in the low concordance list.
Gene name: CYP3A5. Genotype values for CYP3A5 are formatted as “*3/*3”. Phenotypes are denoted as "Normal Metabolizer", "Intermediate Metabolizer", "Poor Metabolizer”, or “Indeterminate”. CYP3A5 includes a possible_genotype field.
CYP4F2
Variants on the *2 and *3 alleles are much farther apart than the insert size for short read sequencing, so phasing is challenging.
According to PharmVar "Although CYP4F2 is part of a CPIC guideline, function has not been assigned by CPIC."
Gene name: CYP4F2. Genotype values for CYP4F2 are formatted as “*1/*1”. An example of a phenotype value is "CYP4F2 Variant Absent".
DPYD
The DPYD gene is too large to phase variants across its entirety. As such, the nomenclature typically doesn't follow the star allele convention and variants are not mutually exclusive.
DPYD concordance was evaluated in comparison with update GeT-RM calls from the Aldy publication, similar to CYP2B6. Cosine similarity metrics are based on the first two called alleles, which may result in underestimating the similarity. DPYD calls have no sensitivity to c.46C>G, c.1403C>A, or c.2582A>G due to filtered SNPs, however these are all classified as normal function alleles.
Activity scores are given to quantify DPYD function. In scenarios where a sample carries >2 alleles for DPYD, CPIC guideline recommends using the 2 alleles with the lowest activity to assign activity score for the gene. Therefore a genotype can contain >2 alleles. Qualitative phenotypes are given based on activity scores.
Gene name: DPYD. Genotype values for DPYD are formatted as “c.496A>G/Reference”. An example of a phenotype value is "DPYD Normal Metabolizer”. DPYD includes an activity score formatted as “DPYD Activity Score 2.0”.
SLCO1B1
SLCO1B1 is a large gene containing variants that can be tens of thousands of bases apart. As such, the phasing of variants can be inconclusive. The PCR- and array-based methods used to generate the GeT-RM SLCO1B1 data do not convey any sample-specific phasing information. More information about the GeT-RM assays is available at the CDC website.
The GeT-RM validation performed here allowed for matches if the sample contained the correct variants, but the phasing differed from the GeT-RM call. Specifically:
- *1/*15 is cis version of *5/*37
- *1/*14 is cis version of *4/*37
- *21 became suballele of *20 and then *1/*20 is cis version of *19/*37
Phasing differences do not affect the predicted phenotype in these cases. Additionally, Stargazer reported *1A allele is updated to *1 and *1B becomes *37 to match the CPIC definitions.
Even with all these improvements, the GeT-RM concordance is about 55%. However, concordance with lrWGS calls is nearly 90%, so we believe most of the GeT-RM issues to be attributable to the older assay data.
Gene name: SLCO1B1. Genotype values for SLCO1B1 are formatted as “*1/*15”. Phenotypes are denoted as "Normal Function", "Decreased Function", or "Poor Function". SLCO1B1 includes a possible_genotype field and a possible_phenotype field.
UGT1A1
This gene has not been curated by PharmVar. The SNP defining the *60 allele is far upstream of the UGT1A1 transcript start site and is not included in Stargazer or PharmCAT. *60 GeT-RM calls were converted to reference for the purposes of the evaluation. *80 allele calls present in Stargazer were only produced by one GeT-RM assay and not present in any GeT-RM consensus calls. *80 calls were converted to reference for the purposes of validation. These changes helped UGT1A1 calls achieve a GeT-RM concordance of 98%. However, long read concordance is less than 90%. This may be due to low long-read sensitivity caused by coverage dropout, especially given the TATA repeats in this gene. Due to the low lrWGS concordance, UGT1A1 is included in the low concordance release.
Gene name: UGT1A1. Genotype values for UGT1A1 are formatted as “*1/*1”. Phenotypes are denoted as "Normal Metabolizer", "Intermediate Metabolizer", "Poor Metabolizer”, or “Indeterminate”. UGT1A1 includes a possible_genotype field and a possible_phenotype field.
Comments
0 comments
Article is closed for comments.