Index ¦ Archive ¦ RSS > Tag: sqlite

APSW in colour (async)

I've implemented full async support in APSW with documentation here. This is about the how of the implementation. The why is that I kept getting requests for it, and async programming has become very popular.

What is async?

A common problem for programmers is how to run several tasks at the same time, such as some database, network, and file access. You don't want to do them one after the other because that will take the sum of the times of the individual items.

You will eventually end up with a solution that lets you start the multiple tasks and wait for them all to complete, such as Python's concurrent.futures. It works, but quickly gets clumsy when you want to add timeouts, and cancelling tasks if their peers failed.

The "secret" about code that does networking is that it spends most of its time doing nothing, just waiting for more data to be received and responses to be sent. Rather than use heavy weight threads and processes, it instead uses an event loop in a single thread managing task progress with incoming and outgoing data, as well as timeouts. In the olden days that meant a loop using select, and its modern version io_uring. Writing code with this is painful though, especially when network code can go through multiple phases tracking their context, any of which could error. It made code hard to read, hard to write, impossible to debug, and had names like callback hell and pyramid of doom.

A new pattern emerged called async / await. The idea is that calls don't return the result directly, but rather something you can await when you need the result or error. It is easier to read, write, debug etc.

# old way - does them one after the other
response1 = fetch("url1")
response2 = fetch("url2")
response3 = fetch("url3")

# new way - ask an async framework to start them all first ...
task1 = asyncio.create_task(fetch("url1"))
task2 = asyncio.create_task(fetch("url2"))
task3 = asyncio.create_task(fetch("url3"))

# ... and then process the responses as
# they complete
response1 = await task1
if response1.status == 200:
    # etc
    ...

But this has an important consequence. fetch in the first block returns a direct result, while in the second block it returns something that has to be await-ed. In language implementations it also required functions like fetch to be marked as async because they needed different internals, and there are restrictions because only async functions can work with an event loop.

This led to the humorous (to developers) post What Colour is Your Function. It describes the world today for many programming languages including Python where you have to colour your functions.

Consequences

It is very common for there to exist libraries providing some functionality, and a parallel variant that provides an async interface to the same functionality. Often the async version just calls the non-async version in the background, while providing equivalent async marked APIs.

But this is a lot of work - the async variant has to essentially duplicate the API surface of the non-async library, and keep it in sync. In practise that means the async ones have a subset of the API, and lag changes. Even documentation and type information can lag, because this is a thankless work.

A few libraries now do it the other way around, writing the async version as the primary, and then having a secondary library to make sync versions of the same code. It is still a lot of work and duplication.

Async Python SQLite

The Python standard library provides sqlite3, but it isn't async.

Several folks have made async versions that call it, with these being the ones I found: aiosqlite, aiosqlite3, asqlite, py-asqlite3, asqlite3, fastsqlite3, rapptz-asqlite, anysqlite3, sqlite-anyio, anyio-sqlite, anysqlite

There are multiple Python event loops including asyncio (part of Python standard library), trio (a better API), curio (OG that recently became EOL), and anyio (abstracts the actual implementation). The various async versions support different subsets.

Behind the scenes, the implementations are the same. Provide an async function that gathers its parameters, and send them to a background thread for execution, providing an awaitable result. Some go further and provide iteration of query results, so you can use async for. None of them provide the full API of the sqlite3 module.

And none of them let you use an async function in the numerous callbacks SQLite provides. So for example you can't register a function named fetch and then call it from SQL.

The existence of so many async SQLite wrappers shows the demand is there, but the maintenance, event loop support, and Python version updates over time is hard work and hard to keep up.

APSW async principles

I had been considering providing async APSW for a while, and kept thinking about the trade-offs and functionality. That led to a list of principles.

Virtual tables

Virtual tables are the natural case for async. The examples given are to fetch information via network requests to the cloud, and that is a perfect match for async. Note this is calling async code from APSW, not APSW being async itself, but the user would want both at the same time.

No separate package

I am not going to publish a separate async-apsw or similar package. It would require separate documentation, separate maintenance, a system could end up with different versions of sync and async APSW installed etc.

Complete

I am not going to do a small subset of the APSW API. It is important to me that everything you can do from C code with SQLite, you can do from Python code using APSW. That means you should be able to do all the things in async too.

Worker thread

The SQLite C code is inherently synchronous and does not have non-blocking or cooperative scheduling APIs. The only practical approach is to run the SQLite code in a background worker thread. It is possible to have pools of threads and not care which one is used for each call, and APSW is thread safe for this, but that makes things harder to debug, document, and test. It would also result in deadlocks on re-entrant calls. So one thread is the best solution, especially as it ensures a predictable order of operations.

Ergonomic

A common pattern with async wrappers is having different names for doing the same thing as sync versions. So if a developer wants to register a callback there would be register_callback method and a register_async_callback method. Wherever possible that should be avoided, because code should be able to figure it out.

Developers using the sync interface must continue to find using the APSW interface simple and pleasant. That must also be the case for async developers.

Automated maintenance

I am not going to do duplicate manual work. As each SQLite release adds and updates APIs, I have to manually update the APSW C code to reflect that. But I really am not going to repeat the work again for async.

APSW has extensive checking code and tooling behind the scenes for argument parsing, error handling, SQLite mutex handling, database of all objects, methods, attributes, docstrings, and more. I often deliberately update code wrong to verify the tooling detects problems. It must also catch issues with async code too.

What I dreaded

It seemed like the approach would be an automated version of what the other wrappers wrote by hand. For example for Connection.set_busy_timeout I'd generate something like this for an AsyncConnection class.

async def set_busy_timeout(self, milliseconds: int) -> None:
    # there would have to be a call to send execution to the worker
    # thread (exec_in_thread in this case).  The AsyncConnection
    # would have the underlying sync connection as a member,
    return self._exec_in_thread(
        self._sync_connection.set_busy_timeout,
         milliseconds)

But it is ugly. I want immediate exceptions for arguments such as passing a string instead of an int for milliseconds. And I'd have a very hard time verifying the code is correct with automated tooling. It gets more complicated adding in cursors, blobs, backups, sessions etc. (Again notice how the human written async wrappers omit virtually all of those.)

I also have the vast majority of the coding of APSW in C. This results in better performance because of the direct connection of the Python C API and the SQLite C API, but also I extensively use statement expressions in debug and testing builds to verify the integrity of Python and SQLite objects, mutex and GIL state, etc. Those are not visible at the Python level.

No arbitrariness

Unless there are overwhelming good reasons, there shouldn't be limitations. They are annoying to test, annoying to document, annoying to developers. For example that means you should be able to have sync and async connections in the same process. You should be able to use asyncio, trio, anyio, or anything else that comes along in the future.

I should provide defaults, but where more specialised behaviour is needed it should be possible to do so.

It isn't easy

Doing it right and complete is hard. The first 90% is trivial - generate code like above. But the last 10% is the hard part and what has to be right. I look through the issues and pull requests of existing similar projects to see what the edge cases and feature requests are. It is important to look at the closed issues too. This has saved me several times from bad design decisions, and shows what should be covered in the test suite. For example here are the issues for the popular aiosqlite project.

It begins

My first test is what happens if I provide an async function as a callback? The answer was that calling it gave back a coroutine object. At the C level they can be detected by calling PyCoro_CheckExact. A call to asyncio.run_coroutine_threadsafe will send that to the event loop, and let me wait for the result.

Contextvars for the win

To call run_coroutine_threadsafe you need the loop to run it on. In the olden days you could use thread local storage to stash a value somewhere. These days Python has contextvars which work like thread local storage for normal code, but are also async aware.

That made life extremely easy. Stash the event loop in a contextvar, and everything works. I did some manual code that ran in the async event loop receiving a call and forwarding it to the worker thread passing along the loop, and had async callbacks in APSW.

I use PyObject_VectorcallMethod (and the non-Method suffix variant) exclusively in APSW to call callbacks. It was trivial to make a version that checked if the result was a coroutine, and if so use the contextvar to send it to the event loop for evaluation.

A quick #define later, and every single callback in APSW now supported async functions. There are about 120 of these callbacks!

(It later got changed to thread local for semantic and efficiency reasons, although a controller could do either depending on needs.)

Controller

At this point I started formalising the Python code that would make everything work. It needed to start and stop the worker thread, send calls to the worker, and send coroutines back to the event loop for execution. I wrapped this up as a Controller interface.

This abstraction is necessary to support different event loops, but also to allow different implementation decisions. For example it is possible to log every call, gather statistics about how long they take to execute, change the priority of the worker thread, and many other details. I provide a standard implementation, but developers can easily make their own, as they are under 50 lines of code.

The controller to use is of course found by consulting a contextvar.

Calls

The C code wrapping each SQLite C API starts something like this, with the pattern being check for closed, figure out the arguments, acquire mutexes and then do the work.

static PyObject*
Connection_something(PyObject *self, PyObject *const *fast_args, Py_ssize_t fast_nargs, PyObject *fast_kwnames)
{
    /* Sets an exception and returns if connection is closed */
    CHECK_CLOSED();

    {
        /* tool generated and updated code that converts
           parameters into C types goes here */
    }

    /* check database mutex can be acquired in order to
       correctly handle concurrency */
    DBMUTEX_ENSURE();

    /* do the actual calls into SQLite here */

I could easily change the mutex acquire to fault if the connection is async and this is not the worker thread.

This path meant I didn't need separate sync and async classes - the connection knows which it is. I added a simple helper after the parameter parsing that if it is an async connection, uses the controller to call the same function with the same parameters but in the worker thread, returning an awaitable object for the result.

This means the same C implementation supports both sync and async execution paths

Timeouts and cancellation

None of the existing SQLite wrappers support timeouts or cancellations. Using my original example of fetching network resources for virtual tables, they can sometimes take a very long time so it should be possible to set an upper limit on how long something takes, because SQLite is blocked while waiting.

Best practise is to use structured concurrency when you have a group of tasks running at the same time. The concept is that you logically group them and wait for the completion of the group. If any fail, then any outstanding tasks are cancelled. You'd ideally also want a timeout for the group.

It took me two attempts to get this right. The timeouts and cancellations are in the async event loop, and need to be carried to the worker thread, and back again with any async callbacks.

I had originally started out using asyncio's Future as the model, but that turned out inadequate. Each of the async frameworks have the concept of event loop time which can advance differently than real time - for example if you are testing a 10 minute timeout you don't actually want to wait 10 minutes. It is necessary to track deadlines in the event loop time, and separately for in the worker thread where I use time.monotonic. Trio and AnyIO let you get the current_effective_deadline but asyncio doesn't, so I had to provide a contextvar for that.

Fine tuning

Benchmarking insights

It is always important to make sure the code performs as expected. I wrote a quick naive test comparing to aiosqlite, and found performance was dismal. Tracking it down showed the root cause was iterating rows returned from a query. Python uses a __next__ method behind the scenes to get the next iteration value. I was doing just that requiring a trip through the worker thread for each row. That turned out to be very expensive.

aiosqlite was getting batches of rows (default 64) in the worker thread, and then iterating the batch directly in the event loop. That meant having to do the same which included making the batch size configurable via a contextvar. I initially made the C code be async, but later had to redo it so Python code did the work. I put effort into ensuring this batching is invisible - for example if 7 rows and then an exception happen, code will see exactly that. (aiosqlite will only provide the exception.)

Benchmarking is now included in the tools and doc. I was also surprised at how slow the inter-thread communication was. As a result of feedback in a discussion group, I made a standalone repository measuring only that performance - python-async-bench.

The notable results of that are just how much slower each framework's builtin mechanism for dispatching work to threads is.

AnyIO

I originally started writing an AnyIO controller before removing it. The point of AnyIO is to be agnostic of the underlying event loop so a specific controller isn't necessary. Later when documenting the deadline handling for AnyIO users, I had to explain how the current_effective_deadline had to be measured relative to the current_time and then adjusted to time.monotonic() to set the deadline contextvar.

That is a ridiculous burden, so I made a dedicated AnyIO controller which automatically handled the deadlines correctly with no extra coding needed.

Auto controller

Originally you had to specify the controller to use - eg apsw.aio.AsyncIO` for asyncio, apsw.aio.Trio for trio, etc. I was constantly changing the event loop framework in my exploration code, so it got tedious very quickly. If you use the wrong one, the various framework APIs will complain about there being no event loop.

The SniffIO project showed me how to do that detection, although I had to figure out AnyIO because it explicitly is intended to not be detected!

That let me make the default be auto-detection.

Documentation and type stubs

I still had the coloured function problem in the sense that although I only had one implementation that automatically behaved as sync or async, I still needed to document that for each method and attribute.

I also needed the type stubs as used by IDEs and type checkers to believe there are actually flavoured classes - ie Connection and AsyncConnection, Backup and AsyncBackup, Session and AsyncSession etc with each method and attribute in the async version declared awaitable as implemented.

I implemented a test that called all the methods and accessed all the attributes, noting what they did. That was compared to a JSON file, so now I had the behaviour encoded as data. It also meant that all my assertions were exercised so I could verify the C code was doing the right thing. This will also maintain future changes to the codebase,

For documentation I wasn't going to duplicate the classes, so instead I added a badge in each that gives more explanatory text when clicked on.

Duplicating the classes for the type stubs then gave me an idea of what the effort and frustration would have been automatically generating an async wrapper in Python forwarding to the sync methods. It requires getting every little detail right, not just about the methods but also their parameters. The async classes allow both sync and async callbacks in many places so I also had to ensure that was reflected.

Mistake: tasks not coroutines

Because my original exploration had given me a coroutine, my initial work focussed on them as the async unit of work. How do I run one back in the event loop, how do I cancel them, how do I set a timeout etc. That worked out okay until I started fleshing out the test code and realised that while a coroutine represented some executing async code, the actual unit that all the frameworks cared about was a task.

A task wraps a coroutine, and it is tasks that they supervise, cancel, schedule etc. When all I had was a coroutine, the correct path forward is to get them seen as part of a task.

That made some code a lot simpler, and like many mistakes was a valuable learning opportunity.

Test all the things

It takes a lot longer to write the test code than the original working code. But I still test everything, including all the corner cases and combinations, no matter how redundant it seems. In this case it had two benefits - one was finding that the working code had the occasional small issue, and was what revealed my coroutine vs task focus above.

It even lead to late changes like allowing close() to be called on async objects. Normally close() exceptions and you must use the async version aclose() which is still recommended. But aclose() will only work while there is a running event loop and it is await-ed.

This interacts with a second implementation decision which is that the worker thread is not marked as daemon - ie the Python process will not exit until the thread exits. The reason is because there could still be queued operations like committing transactions, backups completing, optimize running etc, so an abrupt exit could rollback all that.

If the event loop exits for whatever reason and the database wasn't closed, the process will hang forever. Allowing the close() call means it can still be manually closed in that circumstance. The database will also be automatically closed when garbage collection happens but it is easy to have a reference that keeps it alive.

Success

It works, it is robust, it is easy to use, and meets all the original goals, especially future maintenance as SQLite and Python update their C APIs. The final result provides complete async support across the entire APSW API, including callbacks, virtual tables, and advanced SQLite features. The implementation supports asyncio, Trio, and AnyIO without requiring separate packages or duplicated APIs.

This is ably demonstrated by the tour.

Thanks

aiosqlite

aiosqlite provided a complete working implementation of how to go about this, together with example usage, documentation, open and closed issues etc. I often wondered why they made particular choices, tried alternatives and quickly found out.

If it meets your needs, use it!

AI

It is customary these days to say how you used AI/llms. What I did not do is hook up a coding agent, and have it do any of the work.

My view is that the closer to the core, the more important it is for the human to make design decisions and their trade-offs, and it is especially important to write the code so that you understand what, why, and how. It also exposes patterns and allows iteration. Even making mistakes is important because they can be learned from. And pretty much all of APSW is core - I have maintained it since 2004 and have no intention of stopping, so current details affect the future.

So what did I use AI for? I had the free versions of Claude, ChatGPT, and Gemini open in adjacent web browser tabs and would frequently post the same prompt into all of them, to compare the results. The "best" answers ebbed and flowed, with Claude the worst, ChatGPT the most helpful, and Gemini getting better after several instructions to not give multi page answers and exposition.

Better web search

Unfortunately regular web search is almost useless these days. My situation is more complicated because I am receiving calls from async, whereas most content is focussed on making calls in async.

Making matters worse is that searching the Python docs is annoying because it covers both the C and Python APIs, and I only want one of them, and the ReadTheDocs search for Trio and AnyIO is downright frustrating for finding things.

The AIs excel at giving the specific magical incantation to do things, versus trying to piece it together yourself, saving a lot of time.

Naming things

This is a lot harder than it sounds. Names have expectations attached to them so it is harmful to use something misleading. But you also want something familiar, otherwise you've imposed cognitive overhead with an unfamiliar name. The AIs are very good at giving a list and reasons for each entry. Often I'd end up combining suggestions from them.

Code Queries and Suggestions

It was nice to express a concept and have them show how they would code it, or paste in a snippet of my code that didn't work. It is a lot like rubber duck debugging with instant feedback. Or I'd try to bend an API a particular way it didn't like, and the AI would give plausible explanations as to why that wasn't allowed.

I especially liked it when they provided suggestions I had never thought of, for example detecting the anyio event loop is running by looking for a stack frame's code object matching the anyio.run method code object.

Code and data

Ok, there are 15 lines of AI generated code buried deep inside some of my processing that produces the documentation where I needed to expand the contents of something that is arbitrarily nested in square brackets.

The CSS for the badges in the doc shows the async/async applicability of each function was also generated by them, with me combining the best bits of each. They also helped with some other CSS.

My absolute favourite usage is the generation of test and example data.

Category: misc – Tags: python, apsw, sqlite, async


20 years of APSW

APSW is my Python SQLite wrapper that glues together the Python C API and the SQLite C API, so you can use SQLite from Python.

It has now been 20 years since I first released APSW! That means time for a retrospective.

Genesis

In 2003 I got a new cell phone back when they were very small screens and physical keyboards of only the numbers. Trying to maintain contacts and other information was painful so I ended up creating BitPim which did all the work over a USB cable. It was implemented in Python using wxPython for the GUI.

Behind the scenes it needed to store information about contacts, calendars etc so I used the most expedient format possible - a Python dictionary (keys and values). It wasn't very large (many phones had limits of 100 contacts and calendar events).

It was trivial to read and write the dictionary to disk as is, and substantially similar to JSON. That also made it easy to edit, backup, compare etc outside of BitPim. But two things annoyed me...

1. You did manual save and load from the menus. That was completely normal for programs of the time, where the user was expected to manage the movement of data storage between permanent (ie still present after power events and reboots) and transient (ie in the memory of the process only). This sort of thing should be automatic, and thankfully is the majority of time these days.

2. I wanted undo and redo, but not just the simple approach commonly taken. I wanted to see what the data looked like at any point in time, a simple example being to see what a contact's phone number was 7 months ago. No information should ever be lost or deleted, and you should easily be able to recover older values.

It was getting later in 2004 and I had to solve this. The solution was obviously structured data, which meant a database.

Storage Options

The various networked databases were not an option. They were large and feature full, required complex configuration including administrators, users, and permissions, and dwarfed the size of BitPim. Installation instructions would be a nightmare especially since BitPim is cross platform and easy to install and run.

There were a family of local key value stores generally called dbm, which in theory could work, but would involve creating my own layer on top to structure and version data, execute queries, and similar work. In addition dbm often weren't available on Windows.

SQLite

The first release of SQLite in late 2000 had used dbm as the storage backend and added a layer of SQL on top. SQLite was intended as a developer alternative when the "real" database wasn't available. SQLite 2 released a year later dropped dbm for a custom btree storage engine. SQLite 1 and 2 only stored strings behind the scenes. (The hard part of storage is getting transactions right.)

In mid 2004 SQLite 3 was released and was a real local database, having learned the lessons from SQLite 2. It included:

  • More effective file format
  • Unicode strings
  • Typed storage in the backend (ie not just strings), including binary objects
  • Bound parameters
  • Better transaction locking model
  • Collations
  • 64 bit rowids
  • More advanced query optimisation
  • Improvements in SQL compliance, error reporting etc

This was the obvious perfect fit for BitPim storage.

pysqlite

The existing Python binding to SQLite was named pysqlite. It had supported the earlier SQLite versions and upgraded to SQLite 3. (It was eventually adopted as the standard library sqlite3 module.) I started using it in BitPim, but quickly encountered several issues.

DBAPI

DBAPI / PEP249 defines a standard abstraction for accessing databases from Python, which does map well onto the common networked databases. It does not map well onto SQLite, so pysqlite had to resort to parsing SQL text to manage transactions. The exceptions didn't map well to SQLite's errors. I felt like I was fighting it just trying to do normal SQLite things, and had no interest in other databases.

Unicode

Unicode strings had been a new feature in Python 2 and SQLite 3, and pysqlite mostly got them right, there were a few places that were not, and doing so could cause backwards compatibility problems.

Threading

SQLite at that time could not be used across threads. Because BitPim had a GUI it meant the main thread was used for the GUI, and work was done in background threads. Even if you are diligent, destructors can run in any thread. It was important to me that cross thread usage was caught and became an error, and not silently ignored.

Ergonomics

The cursor was not an iterator so you couldn't use for row in cursor.execute('...') which was Pythonic. I added an iterator wrapper but it made the cursor a lot slower.

An annoyance (that still persists) is you could only execute one statement at a time. Multiple semi-colon separated statements in a query string gives an error.

Tracing and debugging

There was no support for tracing queries or returned rows. This is important during development.

Because SQLite does all its work in C, it is highly desirable to see what Python values it is working with especially when errors occur. You got no clue.

Error handling

Both SQLite and Python have error indicators. SQLite uses an integer code and error string, while Python uses a thread local pending exception. If a callback had an exception then all you got was an exception with the message user-defined function raised exception which was not useful. (Which function? What exception?) Even today all you get is this.

I'll write my own

It was apparent that many of these issues would not be addressed ever (DBAPI compliance was important), or could be done in the timeline I needed for BitPim. Since BitPim would only ever use SQLite, there was absolutely no need for abstractions, and I wanted to have every feature of SQLite available to me.

Naming things

At the time it looked there would be many SQLite wrappers, especially as some supported SQLite 2. I wanted an alternative that did things the SQLite way, and imaginatively came up with Another Python SQLite Wrapper, expecting it to be one of many. It ended up being one of two.

Coding

I recall it taking about a month to code up the first release. That involved mapping between the SQLite C API and the Python C API. An early decision I made was that the module would only be C code. Even today sqlite3 is split between Python code and C code.

Back then it would have meant potential deployment problems having to keep the Python files and compiled C extension in sync. By having only one output file that was not a problem that could occur.

It was then a matter of supporting SQLite versions and Python versions. It was only in January 2022 that I removed support for Python 2.3 which was the version current in 2004 when the project started!

Today APSW does consist of both Python code and compiled C code, and being out of sync is not an issue.

BitPim

It was a simple matter to switch BitPim from dictionaries on disk to dictionaries from SQLite. I was delighted to add undo and data version history so that data would never be lost from that point on.

Principles

I didn't know it at the time, but some principles ended up happening over the years.

Dependencies

It is easy and often a good idea to use third party packages for functionality you need. PyPI has 600 thousand packages covering almost anything you can think of. When APSW started the ecosystem was a lot less mature, and depending on third party packages was a lot more difficult.

To this day APSW has no runtime third party packages, and for compilation and documentation uses the de facto Python standards, described below. The standard library is used.

Overall that was the right decision because it means I am responsible for every line of code, which means I can ensure the appropriate levels of testing, documentation, and interoperability.

Version Churn

SQLite kept updating, and Python kept updating. To make best use of SQLite I decided to keep up with updates, and eventually decided that everything you could do from C you should be able to do from Python. After all SQLite was adding things for a reason. I also decided to have APSW match the corresponding version of SQLite - ie you wouldn't be able to use an old version of SQLite with a new version of APSW. To do so would hugely increase the combinations of testing needed. However you can use new SQLite with old version of APSW just fine.

For 17 years I added support for each new version of Python, and never dropped any. Python used to have what felt like an arbitrary release schedule, with the 3.8 release in October 2019 finally bringing it on to a predictable annual schedule. It was also increasingly difficult to support the older versions, and unnecessary because modern practice is not to support end of life software.

I did try to keep breaking changes to code using APSW to a minimum. Other than corners of error handling, I believe there has been no change causing user code churn.

Tooling

I haver a fair amount of tooling code. This is to reduce manual maintenance such as keeping code and documentation in step, source code checking such as checking that Connections, Cursors, blobs etc are all verified to be open before functions operate on them, getting an API list from the SQLite website to verify all APIs and constants are used, and even spelling checking where mis-spellings are added whenever they are found. All of the tooling is full of assertions and errors on any uncertainty.

This has turned out excellent for project maintenance because it avoids even minor oversights.

Source Code Control

I originally hosted everything on my web site, and used CVS for version control. Then things moved to SourceForge with Subversion source control. Then it was Google Code hosting with Mercurial source control. The final move was GitHub and git source control.

Despite all those transitions, full fidelity of the the source changes is available back to April 2006. The changelog goes all the way back to the beginning.

Documentation

The documentation was originally hand written HTML. It worked but meant there was no way to programmatically ensure it was complete and up to date. I was an early adopter of Sphinx which has been fantastic.

Behind the scenes I have comments in C source that are extracted to make documentation and help text, tools to check the entire SQLite API is wrapped, tools to help with type annotations in the doc, and several other doc related scripts.

Building

When I started the Python standard library included distutils which could be used for both compiled code and pure Python packages. The Python Packaging Authority now maintains a descendent for compiled code named setuptools which still works well for the purpose.

Perhaps the best thing they do is cibuildwheel which is able to compile and test the code under the supported Python versions, operating systems, chip architectures, and ABIs (eg musl vs glibc on Linux). It produces 50 different builds making it trivial for anyone with any of those combinations to install APSW.

Testing

For a few months I had a Python file that run exercising each of the APIs. I quickly moved to the standard library unittest and still use that (many third party frameworks have come and gone).

Very little of the testing code is about things that work, as both SQLite and Python fundamentally work and are themselves tested. The vast majority is API boundaries, and errors and exceptions. It gets especially gnarly because multiple exceptions can occur inside the top level call into SQLite (eg VFS error recovery, finalizers, callbacks). For Python 2 support I was stuck with unraisable errors, while, while Python 3 added chained exceptions.

I used coverage analysis of both Python and C code to make sure all the error/exception conditions were caused. That wasn't sufficient because there are functions such as allocating small amounts of memory, Python objects, or operations like appending to a list that are practically impossible to make fail. I used to have to wrap each location in a macro so that I could cause failure on demand, but that was tedious. It also meant the code didn't look normal, and was obfuscated for editors and analyzers.

Finally statement expressions came to the rescue, where I could macroize calls without having to do something manual in each location. The statement expression would query if that location (file, line, arg list) should fail and how, allowing my harness to then iterate through all failure locations. There are almost two thousand locations. This gets combined with a debug build of Python that does far more assertions and error checking to catch code issues - the most likely being calling into the Python C API while an exception is pending.

Because C code is involved, that means memory issues. I used valgrind to check everything was good. Its cachegrind mode is used for detailed profiling. Sanitizers significantly improved the testing because they saw the C level source code and so better catch issues - valgrind interprets the executable and so has to deduce what is going on.

Overall adding error handling to the code, and the corresponding testing is the vast majority of any code changes - usually three times as much as the actual normal code path. The good news is this gives a lot of assurance errors are handled well, even though most users are unlikely to ever encounter them!

A release test today involves:

  • Running my checking tools that looks for code issues, missing SQLite APIs etc
  • Running coverage to check all lines of C and Python (including the test suite itself) are run by testing
  • 32 and 64 bit build and test of all supported Python versions on Windows, because it is most likely to have differences
  • Megatest of all these permutations under Linux
    • 32 and 64 bit
    • Every supported Python version
    • All supported SQLite versions
    • Python in regular build, and in debug, assertions, all warnings build
    • APSW in regular and debug build (including SQLite debug/assertions configuration)
    • Default APSW SQLite configuration (everything on), and Debian default SQLite configuration (only some things on)
  • The Github actions build then also tests the build on each operating system and configuration

It is very rare for the tests to catch any problems, and is delightful when they do.

Python code

For the longest time APSW was almost 100% C code. It was just gluing together the C APIs. Several years ago I realised that users were trying to do higher level things, and it was preferable to go beyond documentation, and also add that as a standard part of APSW.

I'm very pleased that best practices has been popular - almost all queries and code shows it being used.

The ext module is a nice collection of things. My favourite is three lines of code turns any Python function with positional and keyword arguments into a virtual table.

Size

The video below shows how the source tree has changed over time. It is often the same set of files for each release, updating C code, Python test code, tools, and documentation.

Lines of code is a silly metric, like weighing art, but is correlated with functionality, complexity, effort etc. Here are some comparisons between early 2006 when APSW was functionally complete, and the end of 2024 when it is very complete.

Kind 2006 2024
C source files 1 20
Lines of C source 2,461 15,901
Python files (tools, tests, everything) 3 41
Lines of Python 764 19,486
Lines in main test 652 8,800
All test code 652 11,992
Documentation files 1 27
Lines of documentation 1,200 10,400
Makefile targets 5 46

Popularity Contest

So just how popular is APSW?

There are 750 stars on GitHub putting it around position 45 thousand.

APSW is available packaged in most Linux, BSD, and even termux (Android).

From PyPI there are a million monthly downloads which puts it in the two thousands of all PyPI projects.

Final words

I've seen it said that project maintenance is a lot like gardening. You need to do occasional weeding and pruning, spruce things up, and generally just keep your hand in. I agree.

It has kept me up to date on Python's C API as well as Python library (I'm a fan of dataclasses). SQLite's changes (click the right column) have also been interesting, and I'm glad I did the work to wrap extensions like FTS5.

The long list of differences to the standard module shows the work continues to be worth doing.

Category: misc – Tags: python, apsw, sqlite


SQLite configuration

SQLite is really easy to compile into your application. However Linux and BSD maintainers prefer to have a single centralized shared library and have all platform provided programs use that. You can see a (very long) list of platforms and their SQLite version here.

SQLite can be compiled with many options such as full text search which almost all seem to enable, even though it defaults off. That got me curious to see what they all do different than the default.

I went off and investigated, with the per platform details at the end.

Commentary

It was noticeable just how arbitrary each platform seemed. With the exception of OpenBSD, every platform had several configuration settings to their defaults. This does not result in any warnings of errors. It is generally the result of things like JSON which used to be off by default, and had to be explicitly enabled, but is now on by default. I don't know of any reasonable way for maintainers to find out that they are redundantly setting things to their default values.

COLUMN_METADATA

Column metadata is enabled by everyone. I did see some comments saying that Firefox needs it (and UNLOCK_NOTIFY), but it is enabled even by platforms that don't ship Firefox.

My own experience with Python developers has been that many also use COLUMN_METADATA, and I suspect many other dynamic languages like Ruby, Perl, and Lua do too. The problem is you can't map the text of a SQL query back to named columns because SQL queries are too difficult to parse and manipulate. Other databases provide the column and table names for query results, so code out there expects it. Even the Python database spec (DBAPI aka PEP249) requires it.

Best practice is to explicitly name the result columns (using AS) and this works without requiring COLUMN_METADATA.

SELECT name AS name, SUM(abc) AS abc_count, two.name AS two_name
  FROM contacts, two;

Extensions

FTS3 and FTS4 are enabled by default, but FTS5 is not. There is a good description of the differences. I measured how big extensions compiled on my system are, as a percentage of the SQLite library with all extensions enabled (1.6MB)

Extension Size Percent
FTS3 & 4 93kb 6%
FTS5 200kb 13%
JSON 50kb 3%
Session 53kb 3%
Soundex 813 bytes 0%

MAX_VARIABLE_NUMBER

This I can't make any sense of. You can use numbered bindings, that start at 1.

SELECT name, email FROM contacts WHERE name = ? AND age < ?;

When programmatically executing this, you would supply bindings number 1 and 2. It is also possible to explicitly number them, so in this example you'd provide 97 of them although only 73 and 97 are used.

SELECT ?73, ?97;

You can also provide bindings by name using :, $, or @ like this, providing the bindings as a hash/dictionary type object to the API:

SELECT :foo, $bar, @bam;

The default limit on numbered bindings is 32,767. I cannot think of any scenario under which you would want there to be a quarter or half a million of them, yet that happens!

USE_URI

URI filenames let you easily provide extra parameters when opening a database. You can programmatically turn on URI filenames, but unfortunately that has to be done very first thing after loading SQLite but before using it which is hard to arrange. You can also use SQLITE_OPEN_URI when calling sqlite3_open_v2 added in 2007.

I'm guessing this default was added to some platforms years ago and never changed.

Security

SECURE_DELETE ensures deleted data is overwritten with zeroes. Without it, old data can still be visible to anyone examining the database file contents. It can also be controlled by a pragma. There is a good case to be made for it to be on by default to avoid unintentionally leaking data, for example when databases are backed up, or shared with other people.

API_ARMOR helps defend against programming errors, which are especially easy to make in C. As a maintainer I would enable it, because a more complex program using SQLite often uses lots of libraries, and all it takes is a bug in any one of them to cause problems. Every few months an issue comes up in the SQLite forums where that has happened.

STRICT_SUBTYPE is off by default, with threats in the SQLite code that it may be turned on by default in the future. Maintainers should probably turn this on.

LIKE_DOESNT_MATCH_BLOBS should be on by default, because using LIKE doesn't make sense against blobs. For example what encoding should be used?

DQS isn't set by anyone, but really should be. The default setting allows developers to accidentally use double quotes for strings when they should be used for identifiers (column and table names etc) only. This is a frequent cause of bugs on the SQLite forums, because programming languages use double quotes, and it is easy to assume SQL does too. It doesn't.

Miscellaneous

STAT4 should get you better queries with multi-column indices.

I can't imagine why offset would be needed.

MAX_SCHEMA_RETRY defaults to 50, so I don't know why half that amount would make any difference.

Summary

Other than COLUMN_METADATA, there isn't much consistency between the platforms. SQLite is trying to serve many users, from tiny devices with trivial amounts of memory and storage, to gigantic servers with terabytes of databases, and corresponding amounts of CPU and RAM. Heck the default cache size is 2MB which is wrong for everyone.

So far platform maintainers are making decisions about what to configure for their platform, but that does result in redundancy, duplicated effort, and some amount of guessing.

SQLite doesn't tend to change things because there is an implicit guarantee that if you have a working program using SQLite, and you upgrade the SQLite library, then your program will continue to work.

There are recommended options which are appropriate when compiling SQLite for your own application, but not usable when providing SQLite for a platform used by many applications.

Details

Each entry shows what looked different to me as I did this by visual inspection looking at the default build rules for each platform. You can get more information by looking at https://www.sqlite.org/compile.html and I've left off SQLITE, ENABLE etc from the names. The most recent version of the platform is what I looked at,

You can use pragma compile_options on your own platform to see what is in use. SQLite can have different defaults between the plain library and the cli making this a little more confusing.

Platform Configuration difference
MacOS
CKPTFULLFSYNC
JOURNAL_SIZE_LIMIT=32768
LOOKASIDE=1200,102
COLUMN_METADATA
FTS5
LOCKING_STYLE=1
NORMALIZE
PREUPDATE_HOOK
SESSION
SNAPSHOT
SQLLOG
STMT_SCANSTATUS
UPDATE_DELETE_LIMIT
HAS_CODEC_RESTRICTED
MAX_VARIABLE_NUMBER=500000
THREADSAFE=2
USE_URI
Debian
SECURE_DELETE
COLUMN_METADATA
SOUNDEX
UNLOCK_NOTIFY
ALLOW_ROWID_IN_VIEW
UPDATE_DELETE_LIMIT
LOAD_EXTENSION
LIKE_DOESNT_MATCH_BLOBS
USE_URI
MAX_SCHEMA_RETRY=25
PREUPDATE_HOOK
SESSION
STRICT_SUBTYPE
MAX_VARIABLE_NUMBER=250000
Fedora
COLUMN_METADATA
DISABLE_DIRSYNC
SECURE_DELETE
UNLOCK_NOTIFY
Gentoo
LOAD_EXTENSION
API_ARMOR
COLUMN_METADATA
EXPLAIN_COMMENTS
HIDDEN_COLUMNS
MEMSYS5
NORMALIZE
OFFSET_SQL_FUNC
PREUPDATE_HOOK
GEOPOLY
SESSION
STMT_SCANSTATUS
UNLOCK_NOTIFY
UPDATE_DELETE_LIMIT
SOUNDEX
USE_URI
Arch
COLUMN_METADATA
UNLOCK_NOTIFY
SECURE_DELETE
STAT4
MAX_VARIABLE_NUMBER=250000
MAX_EXPR_DEPTH=10000
FTS5
Termux
COLUMN_METADATA
UPDATE_DELETE_LIMIT
FreeBSD
DEFAULT_FILE_PERMISSIONS
DIRECT_READ
LOAD_EXTENSION
FTS5
COLUMN_METADATA
SECURE_DELETE
UNLOCK_NOTIFY
USE_URI
OpenBSD
FTS5
UNLOCK_NOTIFY
COLUMN_METADATA
Brew (homebrew)
API_ARMOR
COLUMN_METADATA
FTS5
MEMORY_MANAGEMENT
STAT4
UNLOCK_NOTIFY
MAX_VARIABLE_NUMBER=250000
USE_URI
PREUPDATE_HOOK
SESSION
Macports
DISABLE_INTRINSIC
COLUMN_METADATA
FTS5
SECURE_DELETE
STAT4
UNLOCK_NOTIFY
SOUNDEX
Nix
COLUMN_METADATA
FTS5
STMT_SCANSTATUS
UNLOCK_NOTIFY
SOUNDEX
SECURE_DELETE
MAX_VARIABLE_NUMBER=250000
MAX_EXPR_DEPTH=10000
Haikuports
COLUMN_METADATA
UNLOCK_NOTIFY
SECURE_DELETE
FTS5
msys2
COLUMN_METADATA
DISABLE_DIRSYNC
SOUNDEX
PREUPDATE_HOOK
SESSION
(They also configure --enable-all)

Category: misc – Tags: sqlite


SQLite Full Text Search added to APSW

TLDR: APSW now has comprehensive support for SQLite Full Text Search.

SQLite has a nice full text search search implementation named FTS5. (Yes the older FTS3 and 4 also exist.)

Full text search engines are in theory fairly simple - break content into terms (typically words), index where in the content the terms exist, reply to a query's terms by consulting that index. Add in a way of ranking the results typically by how rare each term is overall, and how well represented it is in an item of content. (The 5 decade old BM25 can do that.) Make the query language a bit more complex so you have AND, OR, and NOT. Perhaps a NEAR and a way of including or excluding columns.

FTS5 does all that very well, and is fast. It offers a C API for writing your own tokenizers (converting content in terms/words), and for writing your own auxiliary functions such as for ranking or showing highlights. Several months ago I decided to wrap those APIs in APSW figuring it would take a few weeks at most. (C code requires lots of testing.) It did of course take several months, but the result is something I am very proud of.

The tokenizer API is a little confusing. It is how object oriented C is done where you have structures of methods, passing this as the first parameter. I did get confused over what were the analogues to classes and instances, and had to revise my first implementation.

The FTS5 tokenizer is based on Unicode 6.1 (2012) and stays that way for stability. We are now in version 16 with there being annual updates. Python has the unicodedata module which is more up to date, so I implemented a tokenizer using that as the data source.

I soon discovered that it wasn't really workable for two reasons. The first is that splitting text into words by whitespace doesn't work. Some popular languages like Chinese and Japanese don't use spaces, some use spaces between syllables, some use other schemes, and even English gets complicated when you have punctuation - is don't one or two words? What about run-down?

The second is that operating on each codepoint separately breaks apart what are single user perceived characters (aka grapheme clusters). You can have a codepoint and one or more combining accents and marks. It was especially noticeable with emoji where 🤦🏼‍♂️ is actually 5 different codepoints.

That led me to Unicode Technical Report #29 all about text segmentation. It tells you how to correctly break text into grapheme clusters, words, and sentences. The FTS5 snippet function tries to determine sentences behind the scenes, so all 3 would be useful. I also have a function to format query tables that needs to break text into lines. For example in English you don't want to break between a word and a comma immediately following it, but rather after the comma. Unicode Technical Report #14 line breaking algorithm addresses that. There were no practical Python libraries implementing all of these, so I went off joining the very small club who had implemented all of TR29 & 14.

With that lengthy diversion over I was able to implement a UnicodeWords tokenizer. It performs really well no matter language, punctuation, and (lack of) whitespace. Just kidding - I had yet another diversion. I needed Unicode information, way beyond what is in the unicodedata module. For example I needed to handle extended pictographic (emoji and similar) correctly. I needed to be able to strip accents, diacritics, and combining marks. To get my formatted query tables correctly aligned on the terminal I needed width information. And it bugged me that unicodedata lags the Unicode standard by years. I stay up to date with SQLite, so why shouldn't I stay up to date with Unicode. That resulted in apsw.unicode and now UnicodeWords does well.

Then it was off to wrap the auxiliary function API. That was straightforward, although it has a lot of terminology, with my feedback resulting in the overview being added.

At this point all the pieces were in place for anyone to use APSW to get the most out of FTS5 from Python. But you had to build everything yourself. I would have to document how you do things like stop words, stemming, or synonyms. They are only in the region of 10 lines of code which was confirmation that the API was good, and easier to to just go ahead and implement.

The builtin trigram tokenizer was not aware of grapheme clusters, so I also had to add an ngram tokenizer. (It irritates me how some sites require typing 3 characters before they do completions, so I made sure any value can be used.) I also ended up adding library code to deal with argument parsing that all the tokenizers needed. And one to deal with stripping accents and case folding for any other tokenizer so the functionality didn't have to be built in to each individually.

The SQLite website's own search uses its HTML documentation as what is fed to FTS5. There is no provided HTML tokenizer, and I saw several questions on the forum about indexing HTML content. Python has an HTML parser so I used that for the parsing, but it was also important to track the offsets in UTF8 of the source document to the tokens. This gets fun with entity and character references where there is no one to one mapping of offsets. That led to more C code to manage offset mapping, as pure Python code was too slow. Some people on the forum also wanted JSON tokenizers so I made one of those too.

I wanted a Python class to wrap a FTS5 table, especially so you could call methods and access properties rather than constructing SQL. It become necessary to parse the SQL making up the virtual table declaration such as getting column names and the various options. That required parsing code, especially as there are many ways quoting can be done. The tokenize argument is a quoted string containing quoted strings, so it gets hairy. (Far more quoting options work than the documentation says.)

I also wanted functionality to do query suggestion - if you make a spelling mistake in a query then it should suggest a better alternative. Even spelling something correctly could be a rare spelling and something more popular is appropriate. To implement that required being able to parse and manipulate (and create) queries. That required dealing with the finer points of the FTS5 query grammar, implicit AND, and should you have a PHRASES type (answer: no). It was then possible using the token information to implement query suggestion, which was fantastic to then use.

I had one remaining item on my wish list - more_like where given some existing rows, you can come up with additional ones that are like those. This makes it easy to implement infinite scrolling - just more_like what you have already shown each time. It was also great when testing on a large recipe database - given ingredients from one recipe it would suggest similar ones.

The query suggestion and more like are purely statistical - they have no understanding of meaning such as jogger being similar to runner. But they still work well enough. Solutions that do have that understanding require you to start measuring in gigabytes which going against the Lite part of SQLite.

I also had to write documentation, with a lot of things to cover. The final result when printed would be about 50 pages. Various reading time tools estimate about 50 minutes to read. I had to read it all multiple times for proofreading and editing.

Now that all is done I'll provide some statistics. Measuring lines of code is somewhat meaningful as being correlated with development time, complexity, number of bugs, testing effort etc. It is also as silly as measuring a painting by weight. Items marked 🆕 are what I had to write as described above, while the others provide context.

Lines of code Description
2,100 🆕Python test code for FTS5 and Unicode functionality
8,400 Python test code for everything else
1,100 🆕C code to glue FTS5 API to Python C API
1,200 C code to glue SQLite prepared statements and execution to Python C API
1,750 C code to glue SQLite virtual tables to Python C API
2,000 🆕C code to implement TR29 text segmentation, TR14 line breaking, case folding, accent stripping, UTF8 to str offset mapping, and various related functionality
2,250 C code to SQLite virtual file system (VFS) to Python C API
1,100 🆕Python code to provide API to unicode functionality, text wrapping, and similar, plus command line tool
860 🆕Python code for wrapping FTS5 table, all the tokenizers, argument parsing, and a command line tool.
1,200 🆕Python tooling code to download and parse Unicode data files and build the tables in the source which aren't included above.

Category: misc – Tags: unicode, python, apsw, sqlite

Contact me