Malloy Documentation
search

This document is intended to serve as a reference for those who already know SQL and may find it helpful to map Malloy concepts and syntax to SQL.

Components of a Query

SQL Malloy Description / Docs
userName AS user_name
 user_name is userName  
AS: Names come before definitions in Malloy.
SELECT id FROM order_items
 run: order_items -> { select: id }
SELECT / FROM: Malloy by Example
LEFT JOIN users ON users.id = order_items.user_id 
 join_one: users on users.id = order_items.user_id
JOIN: Documentation covers more join types and complex relationships. Example
SELECT    status   , COUNT(*) AS items_count FROM order_items  GROUP BY 1 
 run: order_items -> {   group_by: status   aggregate: items_count is count() }
GROUP BY: Any field included in Malloy’s group_by selection will be included in both the generated SELECT and GROUP BY.
WHERE status = 'Complete'
 where: status = 'Complete'
WHERE
ORDER BY flight_count, avg_elevation
 order_by: flight_count, avg(elevation)
ORDER BY: By default, ORDER BY is generated following implicit rules; this can be overridden.
HAVING flight_count > 5
 having: flight_count > 5
HAVING
 LIMIT 100 TOP 100 
 limit: 100 
LIMIT
SELECT   ... FROM (   SELECT     ...   FROM order_items )
 order_items -> { ... } -> { ... }
Pipelines allow the output of one query to be used as the input to the next.
 WITH user_facts AS (...) ... 
 source: user_facts is from(...) 
CTEs: Can be generated through Pipelines and Sources from queries.
FROM (...) AS user_facts
 source: user_facts is duckdb.sql("""SELECT ...""") 
Subqueries: Can be written into SQL Blocks. Example below.

Expressions

Many SQL functions supported by the database can simply be used unchanged in Malloy. In certain cases, we have implemented what we feel are improvements and simplifications of certain SQL functions. This is intended to serve as a quick reference, more complete documentation can be found here.

SQL Malloy Description / Docs
SUM(), AVG(), MAX(), MIN(), COUNT(), etc 
 sum(), avg(), max(), min(), count(), etc... 
Basic SQL aggregations are supported verbatim, but it’s worth learning about Malloy’s additional aggregate locality / symmetric aggregate handling.
 CASE   WHEN size_n < 3 THEN 'S'   WHEN size_n <5 THEN 'M' ELSE 'L' END 
 size_n ?   pick 'S' when < 3   pick 'M' when <5  else 'L' 
Pick is Malloy’s improvement of SQL’s CASE statement. This example also introduces the ? Apply operator, which "applies" a value to another value, condition, or computation. This is most often used with partial comparisons or alternations.
 COUNT(CASE WHEN status = 'Returned' THEN 1 END), AVG(CASE WHEN brand = 'Levi's' THEN price END) 
 count() { where: status = 'Returned' } avg_price { where: brand = 'Levi\'s' } 
Aggregates may be filtered using filter expressions. Doc
 CAST(distance AS string), distance::string 
distance::string
Type Cast. Also worth reviewing Types doc.

Working with Time

The Time Expressions reference contains substantially more detail and examples.

SQL Malloy Docs
 TIMESTAMP_TRUNC(created_at, WEEK), DATE_TRUNC(order_items.shipped_at, MONTH) 
 created_at.week shipped_at.month 
Truncation
 EXTRACT(DAYOFWEEK FROM shipped_at), EXTRACT(HOUR FROM created_at)
 day_of_week(shipped_at) hour(created_at) 
Extraction
 DATE_DIFF(DATE(CURRENT_TIMESTAMP()),DATE(created_at), DAY), TIMESTAMP_DIFF(TIMESTAMP(shipped_at), created_at, HOUR)
 days(created_at to now) hours(created_at to shipped_at) 
Date Diff / Intervals
created_at >= TIMESTAMP('2003-01-01', 'UTC') AND created_at < TIMESTAMP('2004-01-01', 'UTC') 
created_at >= TIMESTAMP(DATETIME_SUB(DATETIME(CURRENT_TIMESTAMP()),INTERVAL 1 YEAR)) AND created_at < TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_SUB(DATETIME( CURRENT_TIMESTAMP()),INTERVAL 1 YEAR))),INTERVAL 1 YEAR)) 
(EXTRACT(DAYOFWEEK FROM created_at)) NOT IN (1,7) 
 created_at ? @2003 

 created_at ? now - 1 year for 1 year 

not(day_of_week(created_at) = 1 | 7) 
Filter Expressions, Apply Operator

Not Supported and/or Coming Soon

Feature requests are tracked using Issues on Github.

SQL Notes
FIRST_VALUE(product_brand) OVER (PARTITION BY user_id ORDER BY created_at ASC) 
Window Functions: For now, reach out for advice on achieving what you need (much is possible with nesting, pipelines, or SQL Blocks). Example.
NA Querying arrays as nested objects
 SELECT  id   , ( SELECT COUNT(*)     FROM orders o     WHERE o.id <= orders.id     AND o.user_id = orders.user_id   ) as sequence_number FROM orders ...
Correlated Subqueries: coming soon; note that functionality will be dependent on the database dialect.

Full Query Examples

Many of the above concepts are best understood in the context of complete queries.

The Basics

We’ll start with a relatively simple SQL query:

SELECT
  TIMESTAMP_TRUNC(created_at, DAY) as order_date,
  SUM(sale_price) as total_sale_price,
  COUNT(DISTINCT order_id) as order_count
FROM `malloy-data.ecomm.order_items`
WHERE created_at>=TIMESTAMP('2021-01-01', 'UTC')
  AND created_at<TIMESTAMP('2021-04-01', 'UTC')
  AND status NOT IN ('Returned','Cancelled')
GROUP BY 1
ORDER BY 1 ASC

In Malloy, this is expressed:

run: duckdb.table('data/order_items.parquet') -> {
 where: created_at = @2021-Q1, status != 'Cancelled' & 'Returned'
 group_by: order_date is created_at.day
 aggregate:
   total_sale_price is sale_price.sum()            -- names come before definitions
   order_count is count(distinct order_id)
 order_by: order_date asc
}

More Complex Example

SELECT
 CASE
   WHEN (100.0*(ii.product_retail_price-ii.cost)) / (NULLIF(ii.product_retail_price,0)) >=55 THEN 'High (over 55%)'
   WHEN (100.0*(ii.product_retail_price-ii.cost)) / (NULLIF(ii.product_retail_price,0))>=45 THEN 'Medium (45% to 55%)'
   ELSE 'Low (up to 45%)' END AS gross_margin_pct_tier,
 SUM(ii.product_retail_price) AS total_retail_price,
 AVG((ii.product_retail_price-ii.cost)) AS avg_gross_margin
FROM `malloy-data.ecomm.inventory_items` AS ii
LEFT JOIN `malloy-data.ecomm.order_items` AS oi
 ON ii.id=oi.inventory_item_id
LEFT JOIN `malloy-data.ecomm.users` AS u
 ON oi.user_id = u.id
WHERE
 oi.status NOT IN ('Returned','Cancelled')
 AND (u.country='USA')
GROUP BY 1
ORDER BY 3 ASC

In Malloy, this can be expressed in a query:

run: inventory_items is duckdb.table('data/inventory_items.parquet') extend {
  join_one: order_items is duckdb.table('data/order_items.parquet') on id = order_items.inventory_item_id
  join_one: users is duckdb.table('data/users.parquet') on order_items.user_id = users.id
  dimension: 
    gross_margin is (product_retail_price - cost)
    gross_margin_pct is gross_margin / nullif(product_retail_price, 0)
} -> {
  group_by:
    gross_margin_pct_tier is gross_margin_pct ?
      pick 'High (over 55%)' when >= 0.55
      pick 'Medium (45% to 55%)' when >= 0.45
      else 'Low (up to 45%)'
  aggregate:
    total_retail_price is product_retail_price.sum()
    avg_gross_margin is gross_margin.avg()
  where:
    order_items.status != 'Cancelled' & 'Returned',
    users.country = 'USA'
  order_by: avg_gross_margin asc
}

Note that if we intend to query these tables and re-use these field definitions frequently, thinking about placing reusable definitions into the model will begin to save us a lot of time in the future.

source: users is duckdb.table('data/users.parquet')

source: order_items is duckdb.table('data/order_items.parquet') extend {
  join_one: users on user_id = users.id
  dimension: valid_order is status != 'Cancelled' & 'Returned'
}

source: inventory_items is duckdb.table('inventory_items.parquet') extend {
  join_one: order_items on id = order_items.inventory_item_id
  dimension:
    gross_margin is (product_retail_price - cost)
    gross_margin_pct is gross_margin / nullif(product_retail_price, 0)
    gross_margin_pct_tier is gross_margin_pct ?
      pick 'High (over 55%)' when >= 0.55
      pick 'Medium (45% to 55%)' when >= 0.45
      else 'Low (up to 45%)'
  measure:
    total_retail_price is product_retail_price.sum()
    avg_gross_margin is gross_margin.avg()
}

run: inventory_items -> {
  group_by: gross_margin_pct_tier
  aggregate:
    total_retail_price
    avg_gross_margin
  where: order_items.valid_order, order_items.users.country = 'USA'
  order_by: avg_gross_margin asc
}

Subqueries / CTEs:

How much of our sales come from repeat customers vs loyal, repeat customers? Written in SQL:

WITH user_facts AS (
 SELECT
   user_id as user_id,
   COUNT(1) as lifetime_orders
 FROM `malloy-data.ecomm.order_items`
 GROUP BY 1
 ORDER BY 2 desc
)
SELECT
  CASE WHEN user_facts.lifetime_orders=1 THEN 'One-Time' ELSE 'Repeat' END AS customer_category,
  SUM(order_items.sale_price) AS total_sale_price,
  COUNT(DISTINCT order_items.order_id) AS order_count
FROM `malloy-data.ecomm.order_items` AS order_items
LEFT JOIN user_facts
 ON order_items.user_id = user_facts.user_id
GROUP BY 1
ORDER BY 2 desc

In Malloy, the user_facts CTE becomes a source of its own, defined from a query. Any aggregates in this query (for now, just lifetime_orders) become dimensions of that source.

source: user_facts is duckdb.table('data/order_items.parquet') -> {
  group_by: user_id
  aggregate: lifetime_orders is count()
}

source: order_items is duckdb.table('data/order_items.parquet') extend {
  join_one: user_facts on user_id = user_facts.user_id
}

run: order_items -> {
  group_by: customer_category is user_facts.lifetime_orders ?
    pick 'One-Time' when = 1
    else 'Repeat'
  aggregate:
    total_sale_price is sale_price.sum()
    order_count is count(distinct order_id)
}

One can also define a source from a SQL query; see the SQL Sources section for more information on this.

source: order_facts is bigquery.sql("""
  SELECT
    order_id
    , user_id
    , ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at) as user_order_sequence_number
  FROM malloy-data.ecomm.order_items
""")

source: order_items is bigquery.sql("""
  SELECT
    id
    , order_id
    , created_at
  FROM malloy-data.ecomm.order_items
  WHERE status NOT IN ('Cancelled', 'Returned')
""") extend {
  join_one: order_facts on order_id = order_facts.order_id
  measure: order_count is count(distinct order_id)
}

run: order_items -> {
  group_by: order_facts.user_order_sequence_number
  aggregate: order_count
}

The above Malloy code will produce this SQL:

SELECT
  order_facts_0.user_order_sequence_number as user_order_sequence_number,
  count(distinct order_items.order_id) as order_count
FROM (
  SELECT
    id
    , order_id
    , created_at
  FROM malloy-data.ecomm.order_items
  WHERE status NOT IN ('Cancelled', 'Returned')
) as order_items
LEFT JOIN (
  SELECT
    order_id
    , user_id
    , ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at) as user_order_sequence_number
  FROM malloy-data.ecomm.order_items
) AS order_facts_0
  ON order_items.order_id=order_facts_0.order_id
GROUP BY 1
ORDER BY 2 desc