跳到主要内容

全文搜索

Goose 通过 fts 扩展支持全文搜索。 全文索引可让查询快速检索长文本中某个词的所有出现位置。

示例:莎士比亚语料库

下面示例展示如何为莎士比亚戏剧语料构建全文索引。

CREATE TABLE corpus AS
SELECT * FROM 'https://${uri}/shakespeare.parquet';
DESCRIBE corpus;
column_namecolumn_typenullkeydefaultextra
line_idVARCHARYESNULLNULLNULL
play_nameVARCHARYESNULLNULLNULL
line_numberVARCHARYESNULLNULLNULL
speakerVARCHARYESNULLNULLNULL
text_entryVARCHARYESNULLNULLNULL

每行文本内容在 text_entry,每行唯一键在 line_id

创建全文搜索索引

首先创建索引,并指定表名、唯一 id 列和待索引列。 这里我们仅索引 text_entry,即戏剧台词文本列。

PRAGMA create_fts_index('corpus', 'line_id', 'text_entry');

现在可以使用 Okapi BM25 排序函数进行查询。 未命中的行会返回 NULL 分数。

莎士比亚如何提到 butter?

SELECT
fts_main_corpus.match_bm25(line_id, 'butter') AS score,
line_id, play_name, speaker, text_entry
FROM corpus
WHERE score IS NOT NULL
ORDER BY score DESC;
scoreline_idplay_namespeakertext_entry
4.427313429798464H4/2.4.494Henry IVCarrierAs fat as butter.
3.836270302568675H4/1.2.21Henry IVFALSTAFFprologue to an egg and butter.
3.836270302568675H4/2.1.55Henry IVChamberlainThey are up already, and call for eggs and butter;
3.3844488405497115H4/4.2.21Henry IVFALSTAFFtoasts-and-butter, with hearts in their bellies no
3.3844488405497115H4/4.2.62Henry IVPRINCE HENRYalready made thee butter. But tell me, Jack, whose
3.3844488405497115AWW/4.1.40Alls well that ends wellPAROLLESbutter-womans mouth and buy myself another of
3.3844488405497115AYLI/3.2.93As you like itTOUCHSTONEright butter-womens rank to market.
3.3844488405497115KL/2.4.132King LearFoolkindness to his horse, buttered his hay.
3.0278411214953107AWW/5.2.9Alls well that ends wellClownhenceforth eat no fish of fortunes buttering.
3.0278411214953107MWW/2.2.260Merry Wives of WindsorFALSTAFFHang him, mechanical salt-butter rogue! I will
3.0278411214953107MWW/2.2.284Merry Wives of WindsorFORDrather trust a Fleming with my butter, Parson Hugh
3.0278411214953107MWW/3.5.7Merry Wives of WindsorFALSTAFFIll have my brains taen out and buttered, and give
3.0278411214953107MWW/3.5.102Merry Wives of WindsorFALSTAFFto heat as butter; a man of continual dissolution
2.739219044070792H4/2.4.115Henry IVPRINCE HENRYDidst thou never see Titan kiss a dish of butter?

与标准索引不同,全文索引不会在底层数据变化后自动更新, 因此你需要在合适时机执行 PRAGMA drop_fts_index(my_fts_index) 并重建索引。

关于生成语料表的说明

更多细节请参见 “Generating a Shakespeare corpus for full-text searching from JSON”。

  • 列为:line_id、play_name、line_number、speaker、text_entry。
  • 全文搜索要求每行都具有唯一键。
  • line_id KL/2.4.132 表示《李尔王》第 2 幕第 4 场第 132 行。