Roger Binns — Sun 17 November 2024
TLDR: APSW now has
comprehensive support for SQLite Full Text Search.
SQLite has a nice full text search search implementation named FTS5. (Yes the older FTS3 and 4 also exist.)
Full text search engines are in theory fairly simple - break content
into terms (typically words), index where in the content the terms
exist, reply to a query's terms by consulting that index. Add in a
way of ranking the results typically by how rare each term is overall,
and how well represented it is in an item of content. (The 5 decade
old BM25 can do that.)
Make the query language a bit more complex so you have AND, OR, and
NOT. Perhaps a NEAR and a way of including or excluding columns.
FTS5 does all that very well, and is fast. It offers a C API for
writing your own tokenizers (converting content in terms/words), and
for writing your own auxiliary functions such as for ranking or
showing highlights. Several months ago I decided to wrap those APIs
in APSW figuring it would take a few weeks at most. (C code requires
lots of testing.) It did of course take several months, but the
result is something I am very proud of.
The tokenizer API is a little
confusing. It is how object oriented C is done where you have
structures of methods, passing this as the
first parameter. I did get confused over what were the analogues to
classes and instances, and had to revise my first implementation.
The FTS5 tokenizer is based on Unicode 6.1 (2012) and stays that way
for stability. We are now in version 16 with there being annual
updates. Python has the unicodedata module which is more
up to date, so I implemented a tokenizer using that as the data source.
I soon discovered that it wasn't really workable for two reasons. The
first is that splitting text into words by whitespace doesn't work.
Some popular languages like Chinese and Japanese don't use spaces,
some use spaces between syllables, some use other schemes, and even
English gets complicated when you have punctuation - is don't one or
two words? What about run-down?
The second is that operating on each codepoint separately breaks apart
what are single user perceived characters (aka grapheme clusters).
You can have a codepoint and one or more combining accents and marks.
It was especially noticeable with emoji where 🤦🏼♂️ is actually 5
different codepoints.
That led me to Unicode Technical Report #29 all about text
segmentation. It tells you how to correctly break text into grapheme
clusters, words, and sentences. The FTS5 snippet function tries to
determine sentences behind the scenes, so all 3 would be useful. I
also have a function to format query tables
that needs to break text into lines. For example in English you don't
want to break between a word and a comma immediately following it, but
rather after the comma. Unicode Technical Report #14 line breaking algorithm
addresses that. There were no practical Python libraries
implementing all of these, so I went off joining the very small club
who had implemented all of TR29 & 14.
With that lengthy diversion over I was able to implement a
UnicodeWords
tokenizer. It performs really well no matter language, punctuation,
and (lack of) whitespace. Just kidding - I had yet another diversion.
I needed Unicode information, way beyond what is in the unicodedata
module. For example I needed to handle extended pictographic (emoji
and similar) correctly. I needed to be able to strip accents,
diacritics, and combining marks. To get my formatted query tables
correctly aligned on the terminal I needed width information. And it
bugged me that unicodedata lags the Unicode standard by years. I stay
up to date with SQLite, so why shouldn't I stay up to date with
Unicode. That resulted in apsw.unicode
and now UnicodeWords does well.
Then it was off to wrap the auxiliary function API.
That was straightforward, although it has a lot of terminology, with
my feedback resulting in the overview being added.
At this point all the pieces were in place for anyone to use APSW to
get the most out of FTS5 from Python. But you had to build everything
yourself. I would have to document how you do things like stop words,
stemming, or synonyms. They are only in the region of 10 lines of
code which was confirmation that the API was good, and easier to to
just go ahead and implement.
The builtin trigram tokenizer was not aware of grapheme clusters, so I
also had to add an ngram tokenizer. (It irritates me how some sites
require typing 3 characters before they do completions, so I made sure
any value can be used.) I also ended up adding library code to deal
with argument parsing that all the tokenizers needed. And one to deal
with stripping accents and case folding for any other tokenizer so the
functionality didn't have to be built in to each individually.
The SQLite website's own search uses its HTML documentation as what is fed to FTS5.
There is no provided HTML tokenizer, and I saw several questions on the forum
about indexing HTML content. Python has an HTML parser so I used that
for the parsing, but it was also important to track the offsets in
UTF8 of the source document to the tokens. This gets fun with entity
and character references where there is no one to one mapping of
offsets. That led to more C code to manage offset mapping, as pure
Python code was too slow. Some people on the forum also wanted JSON
tokenizers so I made one of those too.
I wanted a Python class to wrap a FTS5 table, especially so you could
call methods and access properties rather than constructing SQL. It
become necessary to parse the SQL making up the virtual table
declaration such as getting column names and the various options.
That required parsing code, especially as there are many ways quoting
can be done. The tokenize argument is a quoted string containing
quoted strings, so it gets hairy. (Far more quoting options work than
the documentation says.)
I also wanted functionality to do query suggestion - if you make a
spelling mistake in a query then it should suggest a better
alternative. Even spelling something correctly could be a rare
spelling and something more popular is appropriate. To implement that
required being able to parse and manipulate (and create) queries.
That required dealing with the finer points of the FTS5 query grammar, implicit
AND, and should you have a PHRASES type (answer: no). It was then
possible using the token information
to implement query suggestion, which was fantastic to then use.
I had one remaining item on my wish list - more_like where given
some existing rows, you can come up with additional ones that are like
those. This makes it easy to implement infinite scrolling - just
more_like what you have already shown each time. It was also great
when testing on a large recipe database - given ingredients from one
recipe it would suggest similar ones.
The query suggestion and more like are purely statistical - they have
no understanding of meaning such as jogger being similar to runner.
But they still work well enough. Solutions that do have that
understanding require you to start measuring in gigabytes which going
against the Lite part of SQLite.
I also had to write documentation, with a lot of things to cover. The
final result when printed would be about 50 pages. Various reading
time tools estimate about 50 minutes to read. I had to read it all
multiple times for proofreading and editing.
Now that all is done I'll provide some statistics. Measuring lines of
code is somewhat meaningful as being correlated with development time,
complexity, number of bugs, testing effort etc. It is also as silly
as measuring a painting by weight. Items marked 🆕 are what I had to
write as described above, while the others provide context.
| Lines of code |
Description |
| 2,100 |
🆕Python test code for FTS5 and Unicode functionality |
| 8,400 |
Python test code for everything else |
| 1,100 |
🆕C code to glue FTS5 API to Python C API |
| 1,200 |
C code to glue SQLite prepared statements and execution to
Python C API |
| 1,750 |
C code to glue SQLite virtual tables to Python C API |
| 2,000 |
🆕C code to implement TR29 text segmentation, TR14 line breaking,
case folding, accent stripping, UTF8 to str offset mapping,
and various related functionality |
| 2,250 |
C code to SQLite virtual file system (VFS) to Python C API |
| 1,100 |
🆕Python code to provide API to unicode functionality, text wrapping,
and similar, plus command line tool |
| 860 |
🆕Python code for wrapping FTS5 table, all the tokenizers,
argument parsing, and a command line tool. |
| 1,200 |
🆕Python tooling code to download and parse Unicode data files and build
the tables in the source which aren't included above. |
Category: misc
– Tags:
unicode, python, apsw, sqlite