SQL to Malloy

VSCode

This document is intended to serve as a reference for those who already know SQL and may find it helpful to map Malloy concepts and syntax to SQL.

Components of a Query

SQL	Malloy	Description / Docs
`userName AS user_name`	`user_name is userName`	AS: Names come before definitions in Malloy.
`SELECT id FROM order_items`	`run: order_items -> { select: id }`	SELECT / FROM: Malloy by Example
`LEFT JOIN users` `ON users.id = order_items.user_id`	`join_one: users on users.id = order_items.user_id`	JOIN: Documentation covers more join types and complex relationships. Example
`SELECT` `status` `, COUNT(*) AS items_count` `FROM order_items` `GROUP BY 1`	`run: order_items -> {` `group_by: status` `aggregate: items_count is count()` `}`	GROUP BY: Any field included in Malloy’s group_by selection will be included in both the generated `SELECT` and `GROUP BY`.
`WHERE status = 'Complete'`	`where: status = 'Complete'`	WHERE
`ORDER BY flight_count, avg_elevation`	`order_by: flight_count, avg(elevation)`	ORDER BY: By default, `ORDER BY` is generated following implicit rules; this can be overridden.
`HAVING flight_count > 5`	`having: flight_count > 5`	HAVING
`LIMIT 100` `TOP 100`	`limit: 100`	LIMIT
`SELECT` `...` `FROM (` `SELECT` `...` `FROM order_items` `)`	`order_items -> { ... } -> { ... }`	Pipelines allow the output of one query to be used as the input to the next.
`WITH user_facts AS (...) ...`	`source: user_facts is from(...)`	CTEs: Can be generated through Pipelines and Sources from queries.
`FROM (...) AS user_facts`	`source: user_facts is duckdb.sql("""SELECT ...""")`	Subqueries: Can be written into SQL Blocks. Example below.

Many SQL functions supported by the database can simply be used unchanged in Malloy. In certain cases, we have implemented what we feel are improvements and simplifications of certain SQL functions. This is intended to serve as a quick reference, more complete documentation can be found here.

SQL	Malloy	Description / Docs
`SUM(), AVG(), MAX(), MIN(), COUNT(), etc`	`sum(), avg(), max(), min(), count(), etc...`	Basic SQL aggregations are supported verbatim, but it’s worth learning about Malloy’s additional aggregate locality / symmetric aggregate handling.
`CASE` `WHEN size_n < 3 THEN 'S'` `WHEN size_n <5 THEN 'M'` `ELSE 'L' END`	`size_n ?` `pick 'S' when < 3` `pick 'M' when <5` `else 'L'`	Pick is Malloy’s improvement of SQL’s `CASE` statement. This example also introduces the `?` Apply operator, which "applies" a value to another value, condition, or computation. This is most often used with partial comparisons or alternations.
`COUNT(CASE WHEN status = 'Returned' THEN 1 END),` `AVG(CASE WHEN brand = 'Levi's' THEN price END)`	`count() { where: status = 'Returned' }` `avg_price { where: brand = 'Levi\'s' }`	Aggregates may be filtered using filter expressions. Doc
`CAST(distance AS string),` `distance::string`	`distance::string`	Type Cast. Also worth reviewing Types doc.

Working with Time

The Time Expressions reference contains substantially more detail and examples.

SQL	Malloy	Docs
`TIMESTAMP_TRUNC(created_at, WEEK),` `DATE_TRUNC(order_items.shipped_at, MONTH)`	`created_at.week` `shipped_at.month`	Truncation
`EXTRACT(DAYOFWEEK FROM shipped_at),` `EXTRACT(HOUR FROM created_at)`	`day_of_week(shipped_at)` `hour(created_at)`	Extraction
`DATE_DIFF(DATE(CURRENT_TIMESTAMP()),DATE(created_at), DAY),` `TIMESTAMP_DIFF(TIMESTAMP(shipped_at), created_at, HOUR)`	`days(created_at to now)` `hours(created_at to shipped_at)`	Date Diff / Intervals
`created_at >= TIMESTAMP('2003-01-01', 'UTC') AND created_at < TIMESTAMP('2004-01-01', 'UTC')` `created_at >= TIMESTAMP(DATETIME_SUB(DATETIME(CURRENT_TIMESTAMP()),INTERVAL 1 YEAR)) AND created_at < TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_SUB(DATETIME( CURRENT_TIMESTAMP()),INTERVAL 1 YEAR))),INTERVAL 1 YEAR))` `(EXTRACT(DAYOFWEEK FROM created_at)) NOT IN (1,7)`	`created_at ? @2003` `created_at ? now - 1 year for 1 year` `not(day_of_week(created_at) = 1 \| 7)`	Filter Expressions, Apply Operator

Not Supported and/or Coming Soon

Feature requests are tracked using Issues on Github.

SQL	Notes
`FIRST_VALUE(product_brand) OVER (PARTITION BY user_id ORDER BY created_at ASC)`	Window Functions: For now, reach out for advice on achieving what you need (much is possible with nesting, pipelines, or SQL Blocks). Example.
NA	Querying arrays as nested objects
`SELECT` `id` `, ( SELECT COUNT(*)` `FROM orders o` `WHERE o.id <= orders.id` `AND o.user_id = orders.user_id` `) as sequence_number` `FROM orders ...`	Correlated Subqueries: coming soon; note that functionality will be dependent on the database dialect.

Full Query Examples

Many of the above concepts are best understood in the context of complete queries.

The Basics

We’ll start with a relatively simple SQL query:

SELECT
  TIMESTAMP_TRUNC(created_at, DAY) as order_date,
  SUM(sale_price) as total_sale_price,
  COUNT(DISTINCT order_id) as order_count
FROM `malloy-data.ecomm.order_items`
WHERE created_at>=TIMESTAMP('2021-01-01', 'UTC')
  AND created_at<TIMESTAMP('2021-04-01', 'UTC')
  AND status NOT IN ('Returned','Cancelled')
GROUP BY 1
ORDER BY 1 ASC

In Malloy, this is expressed:

run: duckdb.table('data/order_items.parquet') -> {
 where: created_at = @2021-Q1, status != 'Cancelled' & 'Returned'
 group_by: order_date is created_at.day
 aggregate:
   total_sale_price is sale_price.sum()            -- names come before definitions
   order_count is count(distinct order_id)
 order_by: order_date asc
}

More Complex Example

SELECT
 CASE
   WHEN (100.0*(ii.product_retail_price-ii.cost)) / (NULLIF(ii.product_retail_price,0)) >=55 THEN 'High (over 55%)'
   WHEN (100.0*(ii.product_retail_price-ii.cost)) / (NULLIF(ii.product_retail_price,0))>=45 THEN 'Medium (45% to 55%)'
   ELSE 'Low (up to 45%)' END AS gross_margin_pct_tier,
 SUM(ii.product_retail_price) AS total_retail_price,
 AVG((ii.product_retail_price-ii.cost)) AS avg_gross_margin
FROM `malloy-data.ecomm.inventory_items` AS ii
LEFT JOIN `malloy-data.ecomm.order_items` AS oi
 ON ii.id=oi.inventory_item_id
LEFT JOIN `malloy-data.ecomm.users` AS u
 ON oi.user_id = u.id
WHERE
 oi.status NOT IN ('Returned','Cancelled')
 AND (u.country='USA')
GROUP BY 1
ORDER BY 3 ASC

In Malloy, this can be expressed in a query:

run: inventory_items is duckdb.table('data/inventory_items.parquet') extend {
  join_one: order_items is duckdb.table('data/order_items.parquet') on id = order_items.inventory_item_id
  join_one: users is duckdb.table('data/users.parquet') on order_items.user_id = users.id
  dimension: 
    gross_margin is (product_retail_price - cost)
    gross_margin_pct is gross_margin / nullif(product_retail_price, 0)
} -> {
  group_by:
    gross_margin_pct_tier is gross_margin_pct ?
      pick 'High (over 55%)' when >= 0.55
      pick 'Medium (45% to 55%)' when >= 0.45
      else 'Low (up to 45%)'
  aggregate:
    total_retail_price is product_retail_price.sum()
    avg_gross_margin is gross_margin.avg()
  where:
    order_items.status != 'Cancelled' & 'Returned',
    users.country = 'USA'
  order_by: avg_gross_margin asc
}

Note that if we intend to query these tables and re-use these field definitions frequently, thinking about placing reusable definitions into the model will begin to save us a lot of time in the future.

source: users is duckdb.table('data/users.parquet')

source: order_items is duckdb.table('data/order_items.parquet') extend {
  join_one: users on user_id = users.id
  dimension: valid_order is status != 'Cancelled' & 'Returned'
}

source: inventory_items is duckdb.table('inventory_items.parquet') extend {
  join_one: order_items on id = order_items.inventory_item_id
  dimension:
    gross_margin is (product_retail_price - cost)
    gross_margin_pct is gross_margin / nullif(product_retail_price, 0)
    gross_margin_pct_tier is gross_margin_pct ?
      pick 'High (over 55%)' when >= 0.55
      pick 'Medium (45% to 55%)' when >= 0.45
      else 'Low (up to 45%)'
  measure:
    total_retail_price is product_retail_price.sum()
    avg_gross_margin is gross_margin.avg()
}

run: inventory_items -> {
  group_by: gross_margin_pct_tier
  aggregate:
    total_retail_price
    avg_gross_margin
  where: order_items.valid_order, order_items.users.country = 'USA'
  order_by: avg_gross_margin asc
}

Subqueries / CTEs:

How much of our sales come from repeat customers vs loyal, repeat customers? Written in SQL:

WITH user_facts AS (
 SELECT
   user_id as user_id,
   COUNT(1) as lifetime_orders
 FROM `malloy-data.ecomm.order_items`
 GROUP BY 1
 ORDER BY 2 desc
)
SELECT
  CASE WHEN user_facts.lifetime_orders=1 THEN 'One-Time' ELSE 'Repeat' END AS customer_category,
  SUM(order_items.sale_price) AS total_sale_price,
  COUNT(DISTINCT order_items.order_id) AS order_count
FROM `malloy-data.ecomm.order_items` AS order_items
LEFT JOIN user_facts
 ON order_items.user_id = user_facts.user_id
GROUP BY 1
ORDER BY 2 desc

In Malloy, the user_facts CTE becomes a source of its own, defined from a query. Any aggregates in this query (for now, just lifetime_orders) become dimensions of that source.

source: user_facts is duckdb.table('data/order_items.parquet') -> {
  group_by: user_id
  aggregate: lifetime_orders is count()
}

source: order_items is duckdb.table('data/order_items.parquet') extend {
  join_one: user_facts on user_id = user_facts.user_id
}

run: order_items -> {
  group_by: customer_category is user_facts.lifetime_orders ?
    pick 'One-Time' when = 1
    else 'Repeat'
  aggregate:
    total_sale_price is sale_price.sum()
    order_count is count(distinct order_id)
}

One can also define a source from a SQL query; see the SQL Sources section for more information on this.

source: order_facts is bigquery.sql("""
  SELECT
    order_id
    , user_id
    , ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at) as user_order_sequence_number
  FROM malloy-data.ecomm.order_items
""")

source: order_items is bigquery.sql("""
  SELECT
    id
    , order_id
    , created_at
  FROM malloy-data.ecomm.order_items
  WHERE status NOT IN ('Cancelled', 'Returned')
""") extend {
  join_one: order_facts on order_id = order_facts.order_id
  measure: order_count is count(distinct order_id)
}

run: order_items -> {
  group_by: order_facts.user_order_sequence_number
  aggregate: order_count
}

The above Malloy code will produce this SQL:

SELECT
  order_facts_0.user_order_sequence_number as user_order_sequence_number,
  count(distinct order_items.order_id) as order_count
FROM (
  SELECT
    id
    , order_id
    , created_at
  FROM malloy-data.ecomm.order_items
  WHERE status NOT IN ('Cancelled', 'Returned')
) as order_items
LEFT JOIN (
  SELECT
    order_id
    , user_id
    , ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at) as user_order_sequence_number
  FROM malloy-data.ecomm.order_items
) AS order_facts_0
  ON order_items.order_id=order_facts_0.order_id
GROUP BY 1
ORDER BY 2 desc