全文检索扩展

Full-Text Search 是 Goose 的一个扩展，可对字符串进行检索，类似 SQLite 的 FTS5 扩展。

安装与加载

fts 扩展首次使用时会从官方扩展仓库自动按需加载。若要手动安装并加载，请执行：

INSTALL fts;
LOAD fts;

用法

该扩展向 Goose 增加了两个 PRAGMA 语句：一个用于创建索引，一个用于删除索引。此外还新增了标量宏 stem，供扩展内部使用。

`PRAGMA create_fts_index`

create_fts_index(input_table, input_id, *input_values, stemmer = 'porter',
                 stopwords = 'english', ignore = '(\\.|[^a-z])+',
                 strip_accents = 1, lower = 1, overwrite = 0)

用于为指定表创建 FTS 索引的 PRAGMA。

名称	类型	说明
`input_table`	`VARCHAR`	指定表的限定名，例如 `'table_name'` 或 `'main.table_name'`
`input_id`	`VARCHAR`	文档标识列名，例如 `'document_identifier'`
`input_values...`	`VARCHAR`	要建立索引的文本字段列名（变参），例如 `'text_field_1'`、`'text_field_2'`、...、`'text_field_N'`；也可用 `'\*'` 表示 `input_table` 中所有 `VARCHAR` 列
`stemmer`	`VARCHAR`	要使用的词干器类型。可选 `'arabic'`、`'basque'`、`'catalan'`、`'danish'`、`'dutch'`、`'english'`、`'finnish'`、`'french'`、`'german'`、`'greek'`、`'hindi'`、`'hungarian'`、`'indonesian'`、`'irish'`、`'italian'`、`'lithuanian'`、`'nepali'`、`'norwegian'`、`'porter'`、`'portuguese'`、`'romanian'`、`'russian'`、`'serbian'`、`'spanish'`、`'swedish'`、`'tamil'`、`'turkish'`，或 `'none'`（不做词干化）。默认 `'porter'`
`stopwords`	`VARCHAR`	包含单个 `VARCHAR` 列（停用词）的表限定名；若不使用停用词则设为 `'none'`。默认 `'english'`（内置 571 个英文停用词）
`ignore`	`VARCHAR`	要忽略模式的正则表达式。默认 `'(\.
`strip_accents`	`BOOLEAN`	是否移除重音（例如将 `á` 转为 `a`）。默认 `1`
`lower`	`BOOLEAN`	是否将所有文本转为小写。默认 `1`
`overwrite`	`BOOLEAN`	是否覆盖表上已存在索引。默认 `0`

该 PRAGMA 会在新建 schema 下构建索引。schema 名称由输入表名派生：若在 'main.table_name' 上建索引，schema 名将为 'fts_main_table_name'。

`PRAGMA drop_fts_index`

drop_fts_index(input_table)

删除指定表的 FTS 索引。

名称	类型	说明
`input_table`	`VARCHAR`	输入表限定名，例如 `'table_name'` 或 `'main.table_name'`

`match_bm25` 函数

match_bm25(input_id, query_string, fields := NULL, k := 1.2, b := 0.75, conjunctive := 0)

索引构建完成后，会创建该检索宏用于搜索索引。

名称	类型	说明
`input_id`	`VARCHAR`	文档标识列名，例如 `'document_identifier'`
`query_string`	`VARCHAR`	在索引中检索的字符串
`fields`	`VARCHAR`	要检索的字段列表（逗号分隔），例如 `'text_field_2, text_field_N'`。默认 `NULL`（检索所有已索引字段）
`k`	`DOUBLE`	Okapi BM25 检索模型中的参数 k₁。默认 `1.2`
`b`	`DOUBLE`	Okapi BM25 检索模型中的参数 b。默认 `0.75`
`conjunctive`	`BOOLEAN`	是否启用合取查询，即查询字符串中所有 term 都必须出现，文档才会被检索到

`stem` 函数

stem(input_string, stemmer)

将词语还原到词干。供扩展内部使用。

名称	类型	说明
`input_string`	`VARCHAR`	要做词干化的列或常量。
`stemmer`	`VARCHAR`	要使用的词干器类型。可选 `'arabic'`、`'basque'`、`'catalan'`、`'danish'`、`'dutch'`、`'english'`、`'finnish'`、`'french'`、`'german'`、`'greek'`、`'hindi'`、`'hungarian'`、`'indonesian'`、`'irish'`、`'italian'`、`'lithuanian'`、`'nepali'`、`'norwegian'`、`'porter'`、`'portuguese'`、`'romanian'`、`'russian'`、`'serbian'`、`'spanish'`、`'swedish'`、`'tamil'`、`'turkish'`，或 `'none'`（不做词干化）。

使用示例

创建表并写入文本数据：

CREATE TABLE documents (
    document_identifier VARCHAR,
    text_content VARCHAR,
    author VARCHAR,
    doc_version INTEGER
);
INSERT INTO documents
    VALUES ('doc1',
            'The mallard is a dabbling duck that breeds throughout the temperate.',
            'Hannes Mühleisen',
            3),
           ('doc2',
            'The cat is a domestic species of small carnivorous mammal.',
            'Laurens Kuiper',
            2
           );

构建索引，并让 text_content 与 author 两列都可检索。

PRAGMA create_fts_index(
    'documents', 'document_identifier', 'text_content', 'author'
);

在 author 字段索引中检索作者为 Muhleisen 的文档。将返回 doc1：

SELECT document_identifier, text_content, score
FROM (
    SELECT *, fts_main_documents.match_bm25(
        document_identifier,
        'Muhleisen',
        fields := 'author'
    ) AS score
    FROM documents
) sq
WHERE score IS NOT NULL
  AND doc_version > 2
ORDER BY score DESC;

document_identifier	text_content	score
doc1	The mallard is a dabbling duck that breeds throughout the temperate.	0.0

检索与 small cats 相关的文档。将返回 doc2：

SELECT document_identifier, text_content, score
FROM (
    SELECT *, fts_main_documents.match_bm25(
        document_identifier,
        'small cats'
    ) AS score
    FROM documents
) sq
WHERE score IS NOT NULL
ORDER BY score DESC;

document_identifier	text_content	score
doc2	The cat is a domestic species of small carnivorous mammal.	0.0

警告：当输入表发生变化时，FTS 索引不会自动更新。可通过重建索引来刷新，作为该限制的临时方案。

安装与加载​

用法​

PRAGMA create_fts_index​

PRAGMA drop_fts_index​

match_bm25 函数​

stem 函数​

使用示例​