全文搜索
Goose 通过 fts 扩展支持全文搜索。
全文索引可让查询快速检索长文本中某个词的所有出现位置。
示例:莎士比亚语料库
下面示例展示如何为莎士比亚戏剧语料构建全文索引。
CREATE TABLE corpus AS
SELECT * FROM 'https://${uri}/shakespeare.parquet';
DESCRIBE corpus;
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| line_id | VARCHAR | YES | NULL | NULL | NULL |
| play_name | VARCHAR | YES | NULL | NULL | NULL |
| line_number | VARCHAR | YES | NULL | NULL | NULL |
| speaker | VARCHAR | YES | NULL | NULL | NULL |
| text_entry | VARCHAR | YES | NULL | NULL | NULL |
每行文本内容在 text_entry,每行唯一键在 line_id。
创建全文搜索索引
首先创建索引,并指定表名、唯一 id 列和待索引列。
这里我们仅索引 text_entry,即戏剧台词文本列。
PRAGMA create_fts_index('corpus', 'line_id', 'text_entry');
现在可以使用 Okapi BM25 排序函数进行查询。
未命中的行会返回 NULL 分数。
莎士比亚如何提到 butter?
SELECT
fts_main_corpus.match_bm25(line_id, 'butter') AS score,
line_id, play_name, speaker, text_entry
FROM corpus
WHERE score IS NOT NULL
ORDER BY score DESC;
| score | line_id | play_name | speaker | text_entry |
|---|---|---|---|---|
| 4.427313429798464 | H4/2.4.494 | Henry IV | Carrier | As fat as butter. |
| 3.836270302568675 | H4/1.2.21 | Henry IV | FALSTAFF | prologue to an egg and butter. |
| 3.836270302568675 | H4/2.1.55 | Henry IV | Chamberlain | They are up already, and call for eggs and butter; |
| 3.3844488405497115 | H4/4.2.21 | Henry IV | FALSTAFF | toasts-and-butter, with hearts in their bellies no |
| 3.3844488405497115 | H4/4.2.62 | Henry IV | PRINCE HENRY | already made thee butter. But tell me, Jack, whose |
| 3.3844488405497115 | AWW/4.1.40 | Alls well that ends well | PAROLLES | butter-womans mouth and buy myself another of |
| 3.3844488405497115 | AYLI/3.2.93 | As you like it | TOUCHSTONE | right butter-womens rank to market. |
| 3.3844488405497115 | KL/2.4.132 | King Lear | Fool | kindness to his horse, buttered his hay. |
| 3.0278411214953107 | AWW/5.2.9 | Alls well that ends well | Clown | henceforth eat no fish of fortunes buttering. |
| 3.0278411214953107 | MWW/2.2.260 | Merry Wives of Windsor | FALSTAFF | Hang him, mechanical salt-butter rogue! I will |
| 3.0278411214953107 | MWW/2.2.284 | Merry Wives of Windsor | FORD | rather trust a Fleming with my butter, Parson Hugh |
| 3.0278411214953107 | MWW/3.5.7 | Merry Wives of Windsor | FALSTAFF | Ill have my brains taen out and buttered, and give |
| 3.0278411214953107 | MWW/3.5.102 | Merry Wives of Windsor | FALSTAFF | to heat as butter; a man of continual dissolution |
| 2.739219044070792 | H4/2.4.115 | Henry IV | PRINCE HENRY | Didst thou never see Titan kiss a dish of butter? |
与标准索引不同,全文索引不会在底层数据变化后自动更新,
因此你需要在合适时机执行 PRAGMA drop_fts_index(my_fts_index) 并重建索引。
关于生成语料表的说明
更多细节请参见 “Generating a Shakespeare corpus for full-text searching from JSON”。
- 列为:line_id、play_name、line_number、speaker、text_entry。
- 全文搜索要求每行都具有唯一键。
line_idKL/2.4.132表示《李尔王》第 2 幕第 4 场第 132 行。