FILTER 子句
在 SELECT 语句中,FILTER 子句可以选择性地跟在聚合函数之后。它会像 WHERE 子句过滤行那样,过滤传入该聚合函数的数据行,但作用范围仅限于该特定聚合函数。
这种写法在多种场景下都很有用,例如为多个聚合分别使用不同过滤条件,或将数据集透视为列视图。与下文讨论的传统 CASE WHEN 方法相比,FILTER 在数据透视场景中语法更简洁。
有些聚合函数不会自动排除 NULL 值,因此在某些情况下,使用 FILTER 子句可以得到有效结果,而 CASE WHEN 方法则不行。这种情况会出现在 first 和 last 函数中;它们常用于非再聚合型透视操作,此时目标只是把数据重新排布为列,而不是再次聚合。对于 list 和 array_agg 函数,FILTER 在处理 NULL 时也更好:CASE WHEN 会把 NULL 保留在列表结果里,而 FILTER 会将其移除。
示例
返回以下结果:
- 总行数
- 满足
i <= 5的行数 i为奇数的行数
SELECT
count() AS total_rows,
count() FILTER (i <= 5) AS lte_five,
count() FILTER (i % 2 = 1) AS odds
FROM generate_series(1, 10) tbl(i);
| total_rows | lte_five | odds |
|---|---|---|
| 10 | 5 | 5 |
仅统计满足某条件的行数时,也可以不使用
FILTER子句,而是使用布尔sum聚合函数,例如sum(i <= 5)。
可以使用不同的聚合函数,也允许使用多个 WHERE 表达式:
SELECT
sum(i) FILTER (i <= 5) AS lte_five_sum,
median(i) FILTER (i % 2 = 1) AS odds_median,
median(i) FILTER (i % 2 = 1 AND i <= 5) AS odds_lte_five_median
FROM generate_series(1, 10) tbl(i);
| lte_five_sum | odds_median | odds_lte_five_median |
|---|---|---|
| 15 | 5.0 | 3.0 |
FILTER 子句还可用于将数据从行透视到列。这是静态透视,因为在 SQL 中列必须在运行前定义。不过,这类语句可以在宿主编程语言中动态生成,以利用 Goose 的 SQL 引擎进行快速、可超出内存规模的数据透视。
先生成一个示例数据集:
CREATE TEMP TABLE stacked_data AS
SELECT
i,
CASE WHEN i <= rows * 0.25 THEN 2022
WHEN i <= rows * 0.5 THEN 2023
WHEN i <= rows * 0.75 THEN 2024
WHEN i <= rows * 0.875 THEN 2025
ELSE NULL
END AS year
FROM (
SELECT
i,
count(*) OVER () AS rows
FROM generate_series(1, 100_000_000) tbl(i)
) tbl;
按年份将数据“透视”展开(把每个年份移到独立列中):
SELECT
count(i) FILTER (year = 2022) AS "2022",
count(i) FILTER (year = 2023) AS "2023",
count(i) FILTER (year = 2024) AS "2024",
count(i) FILTER (year = 2025) AS "2025",
count(i) FILTER (year IS NULL) AS "NULLs"
FROM stacked_data;
下面的语法与上面的 FILTER 子句会产生相同结果:
SELECT
count(CASE WHEN year = 2022 THEN i END) AS "2022",
count(CASE WHEN year = 2023 THEN i END) AS "2023",
count(CASE WHEN year = 2024 THEN i END) AS "2024",
count(CASE WHEN year = 2025 THEN i END) AS "2025",
count(CASE WHEN year IS NULL THEN i END) AS "NULLs"
FROM stacked_data;
| 2022 | 2023 | 2024 | 2025 | NULLs |
|---|---|---|---|---|
| 25000000 | 25000000 | 25000000 | 12500000 | 12500000 |
但是,当使用不会忽略 NULL 值的聚合函数时,CASE WHEN 方法无法按预期工作。first 函数就属于这类情况,因此此时更推荐使用 FILTER。
按年份将数据“透视”展开(把每个年份移到独立列中):
SELECT
first(i) FILTER (year = 2022) AS "2022",
first(i) FILTER (year = 2023) AS "2023",
first(i) FILTER (year = 2024) AS "2024",
first(i) FILTER (year = 2025) AS "2025",
first(i) FILTER (year IS NULL) AS "NULLs"
FROM stacked_data;
| 2022 | 2023 | 2024 | 2025 | NULLs |
|---|---|---|---|---|
| 1474561 | 25804801 | 50749441 | 76431361 | 87500001 |
当 CASE WHEN 子句首次求值返回 NULL 时,下面的写法会产生 NULL 值:
SELECT
first(CASE WHEN year = 2022 THEN i END) AS "2022",
first(CASE WHEN year = 2023 THEN i END) AS "2023",
first(CASE WHEN year = 2024 THEN i END) AS "2024",
first(CASE WHEN year = 2025 THEN i END) AS "2025",
first(CASE WHEN year IS NULL THEN i END) AS "NULLs"
FROM stacked_data;
| 2022 | 2023 | 2024 | 2025 | NULLs |
|---|---|---|---|---|
| 1228801 | NULL | NULL | NULL | NULL |