Skip to main content

Full-Text Search Extension

Full-Text Search is an extension to Goose that allows for search through strings, similar to SQLite's FTS5 extension.

Installing and Loading

The fts extension will be transparently autoloaded on first use from the official extension repository. If you would like to install and load it manually, run:

INSTALL fts;
LOAD fts;

Usage

The extension adds two PRAGMA statements to Goose: one to create, and one to drop an index. Additionally, a scalar macro stem is added, which is used internally by the extension.

PRAGMA create_fts_index

create_fts_index(input_table, input_id, *input_values, stemmer = 'porter',
stopwords = 'english', ignore = '(\\.|[^a-z])+',
strip_accents = 1, lower = 1, overwrite = 0)

PRAGMA that creates a FTS index for the specified table.

NameTypeDescription
input_tableVARCHARQualified name of specified table, e.g., 'table_name' or 'main.table_name'
input_idVARCHARColumn name of document identifier, e.g., 'document_identifier'
input_values...VARCHARColumn names of the text fields to be indexed (vararg), e.g., 'text_field_1', 'text_field_2', ..., 'text_field_N', or '\*' for all columns in input_table of type VARCHAR
stemmerVARCHARThe type of stemmer to be used. One of 'arabic', 'basque', 'catalan', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil', 'turkish', or 'none' if no stemming is to be used. Defaults to 'porter'
stopwordsVARCHARQualified name of table containing a single VARCHAR column containing the desired stopwords, or 'none' if no stopwords are to be used. Defaults to 'english' for a pre-defined list of 571 English stopwords
ignoreVARCHARRegular expression of patterns to be ignored. Defaults to `'(\.
strip_accentsBOOLEANWhether to remove accents (e.g., convert á to a). Defaults to 1
lowerBOOLEANWhether to convert all text to lowercase. Defaults to 1
overwriteBOOLEANWhether to overwrite an existing index on a table. Defaults to 0

This PRAGMA builds the index under a newly created schema. The schema will be named after the input table: if an index is created on table 'main.table_name', then the schema will be named 'fts_main_table_name'.

PRAGMA drop_fts_index

drop_fts_index(input_table)

Drops a FTS index for the specified table.

NameTypeDescription
input_tableVARCHARQualified name of input table, e.g., 'table_name' or 'main.table_name'

match_bm25 Function

match_bm25(input_id, query_string, fields := NULL, k := 1.2, b := 0.75, conjunctive := 0)

When an index is built, this retrieval macro is created that can be used to search the index.

NameTypeDescription
input_idVARCHARColumn name of document identifier, e.g., 'document_identifier'
query_stringVARCHARThe string to search the index for
fieldsVARCHARComma-separated list of fields to search in, e.g., 'text_field_2, text_field_N'. Defaults to NULL to search all indexed fields
kDOUBLEParameter k1 in the Okapi BM25 retrieval model. Defaults to 1.2
bDOUBLEParameter b in the Okapi BM25 retrieval model. Defaults to 0.75
conjunctiveBOOLEANWhether to make the query conjunctive i.e., all terms in the query string must be present in order for a document to be retrieved

stem Function

stem(input_string, stemmer)

Reduces words to their base. Used internally by the extension.

NameTypeDescription
input_stringVARCHARThe column or constant to be stemmed.
stemmerVARCHARThe type of stemmer to be used. One of 'arabic', 'basque', 'catalan', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hindi', 'hungarian', 'indonesian', 'irish', 'italian', 'lithuanian', 'nepali', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'serbian', 'spanish', 'swedish', 'tamil', 'turkish', or 'none' if no stemming is to be used.

Example Usage

Create a table and fill it with text data:

CREATE TABLE documents (
document_identifier VARCHAR,
text_content VARCHAR,
author VARCHAR,
doc_version INTEGER
);
INSERT INTO documents
VALUES ('doc1',
'The mallard is a dabbling duck that breeds throughout the temperate.',
'Hannes Mühleisen',
3),
('doc2',
'The cat is a domestic species of small carnivorous mammal.',
'Laurens Kuiper',
2
);

Build the index, and make both the text_content and author columns searchable.

PRAGMA create_fts_index(
'documents', 'document_identifier', 'text_content', 'author'
);

Search the author field index for documents that are authored by Muhleisen. This retrieves doc1:

SELECT document_identifier, text_content, score
FROM (
SELECT *, fts_main_documents.match_bm25(
document_identifier,
'Muhleisen',
fields := 'author'
) AS score
FROM documents
) sq
WHERE score IS NOT NULL
AND doc_version > 2
ORDER BY score DESC;
document_identifiertext_contentscore
doc1The mallard is a dabbling duck that breeds throughout the temperate.0.0

Search for documents about small cats. This retrieves doc2:

SELECT document_identifier, text_content, score
FROM (
SELECT *, fts_main_documents.match_bm25(
document_identifier,
'small cats'
) AS score
FROM documents
) sq
WHERE score IS NOT NULL
ORDER BY score DESC;
document_identifiertext_contentscore
doc2The cat is a domestic species of small carnivorous mammal.0.0

Warning The FTS index will not update automatically when the input table changes. A workaround of this limitation can be recreating the index to refresh.