In this article, you will find helpful resources for utilizing Jupyter Notebooks within the Researcher Workbench. For more general information about Jupyter Notebooks, visit the beginner guide here. Currently within notebooks, users are able to analyze data using the following programming languages: R, Python, and SQL. When analyzing data in Jupyter Notebooks, consider implementing modes of query optimization to best reduce computational costs within notebooks. Below are suggestions for SQL optimization. Also, click here for another helpful article on SQL optimization, as well.
Use of Workspace Buckets to save my files in notebooks
It is important to permanently save your files in Workspace Buckets and to know how to call these data into your notebook for continued analysis. Doing this will save notebook loading time and add efficiency to your query further reducing the use of your available credits. Please watch the linked video for detailed instructions on how to do this.
Use of OHDSI (Observational Health Data Sciences and Informatics) Standard Clinical Data Tables (GitHub link)
These tables show core information about the clinical events that occur longitudinally during valid Observation Periods for each Person, as well as demographic information for the Person. The image below provides an entity-relationship diagram highlighting the tables within the Standardized Clinical Data portion of the Observational Medical Outcomes Partnership (OMOP) Common Data Model.
Select files instead of using SELECT*
When running exploratory queries, many SQL developers use SELECT* (read as “select all”) as a shorthand to query all available data from a table. However, if a table has many fields, this increases your costs by querying a lot of unnecessary data.
Using the SELECT statement will query only the data you need for your research.
Inefficient:
SELECT *
FROM`{dataset}.measurement`
Efficient:
SELECT
person_id,
measurement_date,
measurement_concept_id,
value_as_number
FROM`{dataset}.measurement`
The second query is much cleaner and only pulls the required information for measurement.
To keep an index of all tables and field names, run a query from a system table such as INFORMATION_SCHEMA or ALL_TAB_COLUMNS.
Avoid SELECT DISTINCT (click here for helpful article)
SELECT DISTINCT is one way to remove duplicates from a query. SELECT DISTINCT works by GROUPing all fields in the query to create distinct results. To accomplish this goal however, a large amount of processing power is required. Additionally, data may be grouped to the point of being inaccurate. To avoid using SELECT DISTINCT, select more fields to create unique results.
Inefficient and inaccurate example:
SELECT
DISTINCT person_id
FROM`{dataset}.measurement`
The query above does not account for multiple people with more than one measurement data type (e.g. height, weight, etc.). The person ID will be grouped together. In large databases, a large number of person IDs will cause this query to run slowly.
Efficient and accurate:
SELECT
person_id
, measurement_concept_id
, measurement_date
FROM`{dataset}.measurement`
By adding more fields, unduplicated records were returned without using SELECT DISTINCT. The database does not have to group any fields, and the number of records is accurate.
Create joins with INNER JOIN (not WHERE)
Some SQL developers prefer to make joins with WHERE clauses, such as the following:
SELECT
person_id,
measurement_source_concept_id,
measurement_date,
gender_concept_id
FROM `{dataset}.person` p, `{dataset}.measurement` as m
WHERE p.person_id = m.person_id
Use INNER JOIN instead:
SELECT
person_id,
measurement_source_concept_id,
measurement_date,
gender_concept_id
FROM `{dataset}.measurement` as m
INNER JOIN `{dataset}.person` p on p.person_id=m.person_id
Use WHERE instead of HAVING to define filters
The goal of an efficient query is to pull only the required records from the database. Per the SQL Order of Operations, HAVING statements are calculated after WHERE statements. If the intent is to filter a query based on conditions, a WHERE statement is more efficient.
SELECT
m_ext.src_id,
COUNT(1)
FROM `{dataset}.measurement` m
INNER JOIN `{dataset}.measurement_ext` m_ext on
m.measurement_id=m_ext.measurement_id
GROUP BY m_ext.src_id
HAVING m.measurement_date >= ‘2020-15-16’
In comparison, WHERE clauses limit the number of records pulled:
SELECT
m_ext.src_id,
COUNT(1)
FROM `{dataset}.measurement` m
INNER JOIN `{dataset}.measurement_ext` m_ext on
m.measurement_id=m_ext.measurement_id
WHERE m.measurement_date >= ‘2020-15-16’
GROUP BY m_ext.src_id
HAVING should only be used when filtering on an aggregated field.
Use Wildcard at the end of phrase only
When searching plaintext data, such as cities or names, wildcards create the widest search possible. However, the widest search is also the most inefficient search.
When a leading wildcard is used, especially in combination with an ending wildcard, the database is tasked with searching all records for a match anywhere within the selected field.
Consider the query below to count rows by EHR site:
SELECT
mm.src_id,
COUNT(1) as n
FROM `{dataset}.device_exposure` as m
LEFT JOIN `{dataset}.device_exposure_ext` as mm on m.device_exposure_id = mm.device_exposure_id
WHERE lower(mm.src_id) like '%ehr site%'
Since EHR site ids all start with the words `ehr site` with no any other words before it, a more efficient query would be:
SELECT
mm.src_id,
COUNT(1) as n
FROM `{dataset}.device_exposure` as m
LEFT JOIN `{dataset}.device_exposure_ext` as mm on m.device_exposure_id = mm.device_exposure_id
WHERE lower(mm.src_id) like 'ehr site%'
Use LIMIT to sample query results
Before running a query for the first time, ensure the results will be desirable and meaningful by using a LIMIT statement (in some DBMS systems, the word TOP is used interchangeably with LIMIT). The LIMIT statement returns only the number of records specified. Using a LIMIT statement prevents taxing the production database with a large query, only to find out the query needs editing or refinement.
In the query below, we will examine a limit of 10 records:
SELECT
concept_name,
count(distinct person_id) as n_participants
FROM
`{dataset}.concept`
JOIN `{dataset}.concept_ancestor` on (concept_id=ancestor_concept_id)
JOIN `{dataset}.observation` on (descendant_concept_id=observation_concept_id)
JOIN `{dataset}.observation_ext` using(observation_id)
WHERE observation_concept_id not in (40766240,43528428,1585389) --hcau vocab issues
AND concept_class_id='Module'
AND src_id='PPI/PM'
AND questionnaire_response_id is not null
GROUP BY 1
ORDER BY n_participants desc
LIMIT 10
You can see by the sample whether you have a usable data set or not.
Run your query during off-peak hours
In order to minimize the impact of your analytical queries on the production database, talk to a DBA (Database Administrator) about scheduling the query to run at an off-peak time. The query should run when concurrent users are at their lowest number, which is typically the middle of the night (3 a.m. – 5 a.m.).
The more of the following criteria your query has, the more likely of a candidate it should be to run at night:
- Selecting from large tables (>1,000,000 records)
- Cartesian Joins or CROSS JOINs
- Looping statements
- SELECT DISTINCT statements
- Nested subqueries
- Wildcard searches in long text or memo fields
- Multiple schema queries
Initial Optimization Checks
- pip install user-query-optimizer
Additional Tips
- Avoid for loops (at all cost): these are extremely slow in both languages
- Don’t reinvent the wheel. Take advantage of libraries such as data.table (CRAN introduction) or dplyr for R, and numpy and pandas for Python
- Profile your code to find out which part of your codes is slowUsing approximate algorithms (approx_distinct() instead of COUNT(DISTINCT …))
- Selecting the columns the user wants explicitly, rather than using (SELECT *)
- Filtering on partitioned columns
- Try to extract nested subqueries using a WITH clause.
- Suggest filtering on most effective columns, by parquet-file ordering
- Eliminate filtering overhead for partitioned columns
- Replace UNION with UNION ALL if duplicates do not need to be removed
- Aggregate a series of LIKE clauses into one regexp_like expression
- Push down a complex join condition into a subquery
- Specify GROUP BY targets with numbers for expressions
- Instead of connecting to the database and creating a cursor for each row. If you move this step outside of the loop it will be much faster.
Comments
0 comments
Article is closed for comments.