跳到主要内容

WITH Clause

WITH 子句允许你定义公用表表达式(CTE)。 常规(非递归)公用表表达式本质上是仅在特定查询范围内可见的视图。 CTE 之间可以相互引用,也可以嵌套。递归 CTE 可以引用自身。

基础 CTE 示例

创建一个名为 cte 的 CTE,并在主查询中使用它:

WITH cte AS (SELECT 42 AS x)
SELECT * FROM cte;
x
42

创建两个 CTE:cte1cte2,其中第二个 CTE 引用了第一个 CTE:

WITH
cte1 AS (SELECT 42 AS i),
cte2 AS (SELECT i * 100 AS x FROM cte1)
SELECT * FROM cte2;
x
4200

你可以为 CTE 指定列名:

WITH cte(j) AS (SELECT 42 AS i)
FROM cte;

CTE 物化

Goose 默认将 CTE 作为 materialized(物化)处理,这意味着 CTE 会被计算一次, 并将结果存储到临时表中。不过在某些条件下,Goose 可以将 CTE inline(内联)到主查询中, 这表示 CTE 不会被物化,其定义会在每个引用位置被复制。 内联基于以下启发式规则:

  • CTE 被引用不超过一次。
  • CTE 不包含 VOLATILE 函数。
  • CTE 使用 AS NOT MATERIALIZED 且未使用 AS MATERIALIZED
  • CTE 不执行分组聚合。

可以通过 AS MATERIALIZED 显式启用物化,也可以通过 AS NOT MATERIALIZED 禁用物化。请注意,即使满足启发式规则,内联也不总是可行。例如,如果 CTE 包含 read_csv 函数,则无法内联。

以下查询会调用同一个 CTE 三次:

WITH t(x) AS (⟨complex_query⟩)
SELECT *
FROM
t AS t1,
t AS t2,
t AS t3;

内联会为每个引用复制 t 的定义,结果如下所示:

SELECT *
FROM
(⟨complex_query⟩) AS t1(x),
(⟨complex_query⟩) AS t2(x),
(⟨complex_query⟩) AS t3(x);

如果 complex_query 开销较大,使用 MATERIALIZED 关键字对其进行物化可以提升性能。在这种情况下,complex_query 只会计算一次。

WITH t(x) AS MATERIALIZED (⟨complex_query⟩)
SELECT *
FROM
t AS t1,
t AS t2,
t AS t3;

如果希望禁用物化,可使用 NOT MATERIALIZED

WITH t(x) AS NOT MATERIALIZED (⟨complex_query⟩)
SELECT *
FROM
t AS t1,
t AS t2,
t AS t3;

通常不建议显式使用物化提示,因为 Goose 的查询优化器能够根据查询结构和上述启发式规则决定何时物化或内联 CTE。不过在某些场景下,使用 MATERIALIZEDNOT MATERIALIZED 显式控制行为仍可能有益。

递归 CTE

WITH RECURSIVE 允许定义可自引用的 CTE。注意查询必须以可终止的方式构造,否则可能陷入无限循环。

示例:斐波那契数列

WITH RECURSIVE 可用于递归计算。下面示例展示了如何用 WITH RECURSIVE 计算前十个斐波那契数:

WITH RECURSIVE FibonacciNumbers (
RecursionDepth, FibonacciNumber, NextNumber
) AS (
-- Base case
SELECT
0 AS RecursionDepth,
0 AS FibonacciNumber,
1 AS NextNumber
UNION ALL
-- Recursive step
SELECT
fib.RecursionDepth + 1 AS RecursionDepth,
fib.NextNumber AS FibonacciNumber,
fib.FibonacciNumber + fib.NextNumber AS NextNumber
FROM
FibonacciNumbers fib
WHERE
fib.RecursionDepth + 1 < 10
)
SELECT
fn.RecursionDepth AS FibonacciNumberIndex,
fn.FibonacciNumber
FROM
FibonacciNumbers fn;
FibonacciNumberIndexFibonacciNumber
00
11
21
32
43
55
68
713
821
934

示例:树遍历

WITH RECURSIVE 可用于遍历树结构。比如下面这组标签层级:

Example graph Example graph

CREATE TABLE tag (id INTEGER, name VARCHAR, subclassof INTEGER);
INSERT INTO tag VALUES
(1, 'U2', 5),
(2, 'Blur', 5),
(3, 'Oasis', 5),
(4, '2Pac', 6),
(5, 'Rock', 7),
(6, 'Rap', 7),
(7, 'Music', 9),
(8, 'Movies', 9),
(9, 'Art', NULL);

以下查询返回从节点 Oasis 到树根(Art)的路径。

WITH RECURSIVE tag_hierarchy(id, source, path) AS (
SELECT id, name, [name] AS path
FROM tag
WHERE subclassof IS NULL
UNION ALL
SELECT tag.id, tag.name, list_prepend(tag.name, tag_hierarchy.path)
FROM tag, tag_hierarchy
WHERE tag.subclassof = tag_hierarchy.id
)
SELECT path
FROM tag_hierarchy
WHERE source = 'Oasis';
path
[Oasis, Rock, Music, Art]

图遍历

WITH RECURSIVE 子句可用于在任意图上表达图遍历。不过如果图中存在环,查询必须执行环检测以避免无限循环。 一种做法是将遍历路径存储在 list 中,并在用新边扩展路径前,检查其终点是否已被访问(见后续示例)。

以下有向图来自 LDBC Graphalytics benchmarkExample graph Example graph

CREATE TABLE edge (node1id INTEGER, node2id INTEGER);
INSERT INTO edge VALUES
(1, 3), (1, 5), (2, 4), (2, 5), (2, 10), (3, 1),
(3, 5), (3, 8), (3, 10), (5, 3), (5, 4), (5, 8),
(6, 3), (6, 4), (7, 4), (8, 1), (9, 4);

请注意该图包含有向环,例如节点 1、5、8 之间。

枚举从某节点出发的所有路径

以下查询返回从节点 1 出发的 所有路径

WITH RECURSIVE paths(startNode, endNode, path) AS (
SELECT -- Define the path as the first edge of the traversal
node1id AS startNode,
node2id AS endNode,
[node1id, node2id] AS path
FROM edge
WHERE startNode = 1
UNION ALL
SELECT -- Concatenate new edge to the path
paths.startNode AS startNode,
node2id AS endNode,
array_append(path, node2id) AS path
FROM paths
JOIN edge ON paths.endNode = node1id
-- Prevent adding a repeated node to the path.
-- This ensures that no cycles occur.
WHERE list_position(paths.path, node2id) IS NULL
)
SELECT startNode, endNode, path
FROM paths
ORDER BY length(path), path;
startNodeendNodepath
13[1, 3]
15[1, 5]
15[1, 3, 5]
18[1, 3, 8]
110[1, 3, 10]
13[1, 5, 3]
14[1, 5, 4]
18[1, 5, 8]
14[1, 3, 5, 4]
18[1, 3, 5, 8]
18[1, 5, 3, 8]
110[1, 5, 3, 10]

注意该查询结果不限制为最短路径。例如对于节点 5,结果既包含路径 [1, 5],也包含 [1, 3, 5]

枚举从某节点出发的无权最短路径

在大多数情况下,枚举所有路径并不实用或不可行。通常只关心 (无权)最短路径。为此,需要调整 WITH RECURSIVE 查询的递归部分:仅在节点尚未访问时才纳入。这里通过子查询检查先前路径是否已包含该节点来实现:

WITH RECURSIVE paths(startNode, endNode, path) AS (
SELECT -- Define the path as the first edge of the traversal
node1id AS startNode,
node2id AS endNode,
[node1id, node2id] AS path
FROM edge
WHERE startNode = 1
UNION ALL
SELECT -- Concatenate new edge to the path
paths.startNode AS startNode,
node2id AS endNode,
array_append(path, node2id) AS path
FROM paths
JOIN edge ON paths.endNode = node1id
-- Prevent adding a node that was visited previously by any path.
-- This ensures that (1) no cycles occur and (2) only nodes that
-- were not visited by previous (shorter) paths are added to a path.
WHERE NOT EXISTS (
FROM paths previous_paths
WHERE list_contains(previous_paths.path, node2id)
)
)
SELECT startNode, endNode, path
FROM paths
ORDER BY length(path), path;
startNodeendNodepath
13[1, 3]
15[1, 5]
18[1, 3, 8]
110[1, 3, 10]
14[1, 5, 4]
18[1, 5, 8]

枚举两个节点之间的无权最短路径

WITH RECURSIVE 也可用于查找 两个节点之间所有(无权)最短路径。为了在到达终点节点后尽快停止递归查询,这里使用了一个窗口函数来检查新加入的节点中是否包含终点节点。

以下查询返回节点 1(起点)与节点 8(终点)之间的所有无权最短路径:

WITH RECURSIVE paths(startNode, endNode, path, endReached) AS (
SELECT -- Define the path as the first edge of the traversal
node1id AS startNode,
node2id AS endNode,
[node1id, node2id] AS path,
(node2id = 8) AS endReached
FROM edge
WHERE startNode = 1
UNION ALL
SELECT -- Concatenate new edge to the path
paths.startNode AS startNode,
node2id AS endNode,
array_append(path, node2id) AS path,
max(CASE WHEN node2id = 8 THEN 1 ELSE 0 END)
OVER (ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS endReached
FROM paths
JOIN edge ON paths.endNode = node1id
WHERE NOT EXISTS (
FROM paths previous_paths
WHERE list_contains(previous_paths.path, node2id)
)
AND paths.endReached = 0
)
SELECT startNode, endNode, path
FROM paths
WHERE endNode = 8
ORDER BY length(path), path;
startNodeendNodepath
18[1, 3, 8]
18[1, 5, 8]

USING KEY 的递归 CTE

USING KEY 会改变常规递归 CTE 的行为。

在每次迭代中,常规递归 CTE 会将结果行追加到并集表(union table)中,并最终由该表构成 CTE 的整体结果。相比之下,带 USING KEY 的 CTE 可以更新此前迭代中已放入并集表的行:如果当前迭代产生键为 k 的行,它会替换并集表中同键 k 的行(类似字典)。如果并集表中还没有该键,则新行按常规追加。

这使 CTE 能对并集表内容进行细粒度控制。避免“只追加”行为通常可显著减小并集表规模,从而改善查询运行时间和内存占用,并且让我们可以在迭代尚未结束时访问并集表(常规递归 CTE 无法做到)。在 WITH RECURSIVE T(...) USING KEY ... 中,表 T 表示上一轮迭代新增的行(递归 CTE 的常规语义),而 recurring.T 表示迄今为止构建的并集表。通过引用 recurring.T,可以将较复杂算法更自然、清晰地表达为 SQL。

示例:USING KEY

这是一个递归 CTE 示例,其中 USING KEY 指定了键列(a)和负载列(b)。 负载列对应会被覆盖更新的列。 第一轮迭代中有两个不同键:12。 这两个键会产生两行新数据:(1, 3)(2, 4)。 下一轮迭代会产生一个新键 3,从而生成新行。 同时还会生成 (2, 3),其中 2 是上一轮已存在的键。 这会用新的负载 3 覆盖旧的负载 4

WITH RECURSIVE tbl(a, b) USING KEY (a) AS (
SELECT a, b
FROM (VALUES (1, 3), (2, 4)) t(a, b)
UNION
SELECT a + 1, b
FROM tbl
WHERE a < 3
)
SELECT *
FROM tbl;
ab
13
23
33

使用 VALUES

你可以在 CTE 的初始(锚点)部分使用 VALUES 子句:

WITH RECURSIVE tbl(a, b) USING KEY (a) AS (
VALUES (1, 3), (2, 4)
UNION
SELECT a + 1, b
FROM tbl
WHERE a < 3
)
SELECT *
FROM tbl;

示例:USING KEY 引用并集表

除了将并集表当作字典使用之外,我们现在还可以在查询中直接引用它。这使你不仅能使用上一轮迭代结果,也能使用更早轮次的结果。这个新特性让某些算法更容易实现。

一个示例是连通分量算法。该算法会为每个节点确定与其连通的最小 ID 节点。为此,我们利用并集表中的条目跟踪每个节点当前发现的最小 ID;如果新到达的行包含更小 ID,就更新该值。 Example graph Example graph

CREATE TABLE nodes (id INTEGER);
INSERT INTO nodes VALUES (1), (2), (3), (4), (5), (6), (7), (8);
CREATE TABLE edges (node1id INTEGER, node2id INTEGER);
INSERT INTO edges VALUES
(1, 3), (2, 3), (3, 7), (7, 8), (5, 4), (6, 4);
WITH RECURSIVE connected_components(id, comp) USING KEY (id) AS (
SELECT n.id, n.id AS comp
FROM nodes AS n
UNION (
SELECT DISTINCT ON (previous_iter.id) previous_iter.id, initial_iter.comp
FROM
recurring.connected_components AS previous_iter,
connected_components AS initial_iter,
edges AS e
WHERE ((e.node1id, e.node2id) = (previous_iter.id, initial_iter.id)
OR (e.node2id, e.node1id) = (previous_iter.id, initial_iter.id))
AND initial_iter.comp < previous_iter.comp
ORDER BY initial_iter.id ASC, previous_iter.comp ASC)
)
TABLE connected_components
ORDER BY id;
idcomp
11
21
31
44
54
64
71
81

限制

Goose 不支持互相递归的 CTE。详见 Goose 仓库中的相关 issue 与讨论

语法