Skip to main content

Full-Text Search

Goose supports full-text search via the fts extension. A full-text index allows for a query to quickly search for all occurrences of individual words within longer text strings.

Example: Shakespeare Corpus

Here's an example of building a full-text index of Shakespeare's plays.

CREATE TABLE corpus AS
SELECT * FROM 'https://${uri}/shakespeare.parquet';
DESCRIBE corpus;
column_namecolumn_typenullkeydefaultextra
line_idVARCHARYESNULLNULLNULL
play_nameVARCHARYESNULLNULLNULL
line_numberVARCHARYESNULLNULLNULL
speakerVARCHARYESNULLNULLNULL
text_entryVARCHARYESNULLNULLNULL

The text of each line is in text_entry, and a unique key for each line is in line_id.

Creating a Full-Text Search Index

First, we create the index, specifying the table name, the unique id column, and the column(s) to index. We will just index the single column text_entry, which contains the text of the lines in the play.

PRAGMA create_fts_index('corpus', 'line_id', 'text_entry');

The table is now ready to query using the Okapi BM25 ranking function. Rows with no match return a NULL score.

What does Shakespeare say about butter?

SELECT
fts_main_corpus.match_bm25(line_id, 'butter') AS score,
line_id, play_name, speaker, text_entry
FROM corpus
WHERE score IS NOT NULL
ORDER BY score DESC;
scoreline_idplay_namespeakertext_entry
4.427313429798464H4/2.4.494Henry IVCarrierAs fat as butter.
3.836270302568675H4/1.2.21Henry IVFALSTAFFprologue to an egg and butter.
3.836270302568675H4/2.1.55Henry IVChamberlainThey are up already, and call for eggs and butter;
3.3844488405497115H4/4.2.21Henry IVFALSTAFFtoasts-and-butter, with hearts in their bellies no
3.3844488405497115H4/4.2.62Henry IVPRINCE HENRYalready made thee butter. But tell me, Jack, whose
3.3844488405497115AWW/4.1.40Alls well that ends wellPAROLLESbutter-womans mouth and buy myself another of
3.3844488405497115AYLI/3.2.93As you like itTOUCHSTONEright butter-womens rank to market.
3.3844488405497115KL/2.4.132King LearFoolkindness to his horse, buttered his hay.
3.0278411214953107AWW/5.2.9Alls well that ends wellClownhenceforth eat no fish of fortunes buttering.
3.0278411214953107MWW/2.2.260Merry Wives of WindsorFALSTAFFHang him, mechanical salt-butter rogue! I will
3.0278411214953107MWW/2.2.284Merry Wives of WindsorFORDrather trust a Fleming with my butter, Parson Hugh
3.0278411214953107MWW/3.5.7Merry Wives of WindsorFALSTAFFIll have my brains taen out and buttered, and give
3.0278411214953107MWW/3.5.102Merry Wives of WindsorFALSTAFFto heat as butter; a man of continual dissolution
2.739219044070792H4/2.4.115Henry IVPRINCE HENRYDidst thou never see Titan kiss a dish of butter?

Unlike standard indexes, full-text indexes don't auto-update as the underlying data is changed, so you need to PRAGMA drop_fts_index(my_fts_index) and recreate it when appropriate.

Note on Generating the Corpus Table

For more details, see the “Generating a Shakespeare corpus for full-text searching from JSON”.

  • The Columns are: line_id, play_name, line_number, speaker, text_entry.
  • We need a unique key for each row in order for full-text searching to work.
  • The line_id KL/2.4.132 means King Lear, Act 2, Scene 4, Line 132.