This document is intended to serve as a reference for those who already know SQL and may find it helpful to map Malloy concepts and syntax to SQL.
Components of a Query
SQL | Malloy | Description / Docs |
---|---|---|
|
|
AS: Names come before definitions in Malloy. |
|
|
SELECT / FROM: Malloy by Example |
|
|
JOIN: Documentation covers more join types and complex relationships. Example |
|
|
GROUP BY: Any field included in Malloy’s group_by selection will be included in both the generated SELECT and GROUP BY . |
|
|
WHERE |
|
|
ORDER BY. By default, ORDER BY is generated following implicit rules; this can be overridden. |
|
|
HAVING |
|
|
LIMIT |
|
|
Pipelines allow the output of one query to be used as the input to the next. |
|
|
CTEs: can be generated through Pipelines and Sources from queries. |
|
|
Subqueries Can be written into SQL Blocks. Example below |
|
|
CTEs: can be generated through Pipelines and Sources from queries. |
Expressions
Many SQL functions supported by the database can simply be used unchanged in Malloy. In certain cases we have implemented what we feel are improvements and simplifications of certain SQL functions. This is intended to serve as a quick reference, more complete documentation can be found here.
SQL | Malloy | Description / Docs |
---|---|---|
|
|
Basic SQL aggregations are supported verbatim, but it’s worth learning about Malloy’s additional aggregate locality / symmetric aggregate handling. |
|
|
Pick is Malloy’s improvement of SQL’s CASE statement. This example also introduces the ? Apply operator, which "applies" a value to another value, condition, or computation. This is most often used with partial comparisons or alternations. |
|
|
Aggregates may be filtered using filter expressions. Doc |
|
|
Type Cast. Also worth reviewing Types doc. |
Working with Time
The Time Expressions reference contains substantially more detail and examples.
SQL | Malloy | Docs |
---|---|---|
|
|
Truncation |
|
|
Extraction |
|
|
Date Diff / Intervals |
|
|
Filter Expressions, Apply Operator |
Not Supported and/or Coming Soon
Feature requests are tracked using Issues on Github.
SQL | Notes |
---|---|
|
Window Functions: For now, reach out for advice on achieving what you need (much is possible with nesting, pipelines, or SQL Blocks). Example. |
NA | Querying arrays as nested objects |
|
Correlated Subqueries: coming soon; note that functionality will be dependent on the database dialect. |
Full Query Examples
Many of the above concepts are best understood in the context of complete queries.
The Basics
We’ll start with a relatively simple SQL query:
SELECT TIMESTAMP_TRUNC(created_at, DAY) as order_date, SUM(sale_price) as total_sale_price, COUNT(DISTINCT order_id) as order_count FROM `malloy-data.ecomm.order_items` WHERE created_at>=TIMESTAMP('2021-01-01', 'UTC') AND created_at<TIMESTAMP('2021-04-01', 'UTC') AND status NOT IN ('Returned','Cancelled') GROUP BY 1 ORDER BY 1 ASC
In Malloy, this is expressed:
run: duckdb.table('data/order_items.parquet') -> { where: created_at = @2021-Q1, status != 'Cancelled' & 'Returned' group_by: order_date is created_at.day aggregate: total_sale_price is sale_price.sum() -- names come before definitions order_count is count(distinct order_id) order_by: order_date asc }
More Complex Example
SELECT CASE WHEN (100.0*(ii.product_retail_price-ii.cost)) / (NULLIF(ii.product_retail_price,0)) >=55 THEN 'High (over 55%)' WHEN (100.0*(ii.product_retail_price-ii.cost)) / (NULLIF(ii.product_retail_price,0))>=45 THEN 'Medium (45% to 55%)' ELSE 'Low (up to 45%)' END AS gross_margin_pct_tier, SUM(ii.product_retail_price) AS total_retail_price, AVG((ii.product_retail_price-ii.cost)) AS avg_gross_margin FROM `malloy-data.ecomm.inventory_items` AS ii LEFT JOIN `malloy-data.ecomm.order_items` AS oi ON ii.id=oi.inventory_item_id LEFT JOIN `malloy-data.ecomm.users` AS u ON oi.user_id = u.id WHERE oi.status NOT IN ('Returned','Cancelled') AND (u.country='USA') GROUP BY 1 ORDER BY 3 ASC
In Malloy, this can be expressed in a query:
run: inventory_items is duckdb.table('data/inventory_items.parquet') extend { join_one: order_items is duckdb.table('data/order_items.parquet') on id = order_items.inventory_item_id join_one: users is duckdb.table('data/users.parquet') on order_items.user_id = users.id dimension: gross_margin is (product_retail_price - cost) gross_margin_pct is gross_margin / nullif(product_retail_price, 0) } -> { group_by: gross_margin_pct_tier is gross_margin_pct ? pick 'High (over 55%)' when >= 0.55 pick 'Medium (45% to 55%)' when >= 0.45 else 'Low (up to 45%)' aggregate: total_retail_price is product_retail_price.sum() avg_gross_margin is gross_margin.avg() where: order_items.status != 'Cancelled' & 'Returned', users.country = 'USA' order_by: avg_gross_margin asc }
Note that if we intend to query these tables and re-use these field definitions frequently, thinking about placing reusable definitions into the model will begin to save us a lot of time in the future.
source: users is duckdb.table('data/users.parquet') source: order_items is duckdb.table('data/order_items.parquet') extend { join_one: users on user_id = users.id dimension: valid_order is status != 'Cancelled' & 'Returned' } source: inventory_items is duckdb.table('inventory_items.parquet') extend { join_one: order_items on id = order_items.inventory_item_id dimension: gross_margin is (product_retail_price - cost) gross_margin_pct is gross_margin / nullif(product_retail_price, 0) gross_margin_pct_tier is gross_margin_pct ? pick 'High (over 55%)' when >= 0.55 pick 'Medium (45% to 55%)' when >= 0.45 else 'Low (up to 45%)' measure: total_retail_price is product_retail_price.sum() avg_gross_margin is gross_margin.avg() } run: inventory_items -> { group_by: gross_margin_pct_tier aggregate: total_retail_price avg_gross_margin where: order_items.valid_order, order_items.users.country = 'USA' order_by: avg_gross_margin asc }
Subqueries / CTEs:
How much of our sales come from repeat customers vs loyal, repeat customers? Written in SQL:
WITH user_facts AS ( SELECT user_id as user_id, COUNT( 1) as lifetime_orders FROM `malloy-data.ecomm.order_items` GROUP BY 1 ORDER BY 2 desc ) SELECT CASE WHEN user_facts.lifetime_orders=1 THEN 'One-Time' ELSE 'Repeat' END AS customer_category, SUM(order_items.sale_price) AS total_sale_price, COUNT(DISTINCT order_items.order_id) AS order_count FROM `malloy-data.ecomm.order_items` AS order_items LEFT JOIN user_facts ON order_items.user_id = user_facts.user_id GROUP BY 1 ORDER BY 2 desc
In Malloy, the user_facts
CTE becomes a source of its own, defined from a query. Any aggregates in this query (for now, just lifetime_orders
) become dimensions of that source.
source: user_facts is duckdb.table('data/order_items.parquet') -> { group_by: user_id aggregate: lifetime_orders is count() } source: order_items is duckdb.table('data/order_items.parquet') extend { join_one: user_facts on user_id = user_facts.user_id } run: order_items -> { group_by: customer_category is user_facts.lifetime_orders ? pick 'One-Time' when = 1 else 'Repeat' aggregate: total_sale_price is sale_price.sum() order_count is count(distinct order_id) }
One can also define a source from a SQL query; see the SQL Sources section for more information on this.
source: order_facts is bigquery.sql(""" SELECT order_id , user_id , ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at) as user_order_sequence_number FROM malloy-data.ecomm.order_items """) source: order_items is bigquery.sql(""" SELECT id , order_id , created_at FROM malloy-data.ecomm.order_items WHERE status NOT IN ('Cancelled', 'Returned') """) extend { join_one: order_facts on order_id = order_facts.order_id measure: order_count is count(distinct order_id) } run: order_items -> { group_by: order_facts.user_order_sequence_number aggregate: order_count }
The above Malloy code will produce this SQL:
SELECT order_facts_0.user_order_sequence_number as user_order_sequence_number, count(distinct order_items.order_id) as order_count FROM ( SELECT id , order_id , created_at FROM malloy-data.ecomm.order_items WHERE status NOT IN ('Cancelled', 'Returned') ) as order_items LEFT JOIN ( SELECT order_id , user_id , ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at) as user_order_sequence_number FROM malloy-data.ecomm.order_items ) AS order_facts_0 ON order_items.order_id=order_facts_0.order_id GROUP BY 1 ORDER BY 2 desc