Index ¦ Archive ¦ RSS > Category: misc

Babylon 5

Roger Binns — Fri 02 January 2026

I first watched the Babylon 5 TV show smeared over 5 years three decades ago! It was enjoyable, and notable for the visuals, references between episodes, and Internet communities. But you had to catch the episodes when they aired, either directly or on your VCR.

Having digitized my DVDs, I have finally been able to binge watch the whole show again, while on the treadmill.

Wow.

It was so much better than I remembered. I was always eager to know what happens next, and was delighted at all the forward and backward references. The story telling depth and breadth was staggering. Sadly I have finished the watch, and felt I had to say something.

Category: misc

Updating home tech

Roger Binns — Thu 06 November 2025

I noticed that a fair bit of my home tech had changed recently. Each one is small, but they do add up. Here is a look at them, with a wave goodbye at the ones that served in the past, and what the new ones bring. Fortunately there have recently been lots of sales and promotions making it affordable to upgrade.

"We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run."

—Amara's Law

Contents

Internet service

Out: Fixed wireless local ISP

I was saddest about this after being a customer for over two decades, starting in the days of dial up and then DSL! Sadly they have way oversubscribed their network, and weren't doing anything to address it. I used to get ~140Mbps (and could still in the small hours), but that kept creeping down until 30Mbps during the day was the norm and streaming in the evening became unusable.

You can get a Speedtest CLI and running that over a week every 15 minutes to an endpoint at the ISP showed that peak streaming hours were getting 2Mbps and sometimes even failed to run speedtest and timed out! Speeds out to the Internet were even worse.

I had also been annoyed that my queries about IPv6 support were met with a confused response about what it is, and why anyone would care.

In: National cable company

I now get gigabit speed which is fantastic. Downloads now take seconds that before that measured in half-hours. It is especially noticeable when software updates come out - there is no waiting for them to download.

I'm most pleased that I also get IPv6. I can finally check everything works correctly. Most systems prefer IPv6 over IPv4. This matters because systems on my network become directly addressable and there is no need for complicated NAT. (Firewall rules still apply so systems inside need to initiate communications.)

Remote access

Out: Firewall holes and port forwarding

In order to access my systems remotely I had to configure firewall holes and port forwarding which directly expose internal systems to the onslaught of constant ssh and web attacks. It got more complicated if I wanted something beyond just ssh access. If I wanted SSL/TLS with Lets Encrypt certificates such as a feed reader, then that too had to be widely exposed. I also had to ensure the external IP address was known and tracked.

In: Tailscale

Tailscale solved all the problems. I have no open ports, and no services exposed to the outside. It is available on a wide variety of systems and devices which now all transparently work together, no matter where I am. I no longer track or even know IP addresses for external or internal systems.

The serve functionality lets me deploy internal services that are only accessible to devices on my tailnet, but still get a Lets Encrypt certificate. (You can use funnel to expose them externally.)

There is a constant beat of functionality being added that address issues you didn't know you had with simple solutions.

Router and wifi

Due to where the Internet connection comes in and getting coverage, I needed one router/access point, and another access point further away. I run an outside ethernet cable between them (for performance) but also want wifi backhaul as backup should the cable fail.

This limits what systems could be used, because I will only use systems that have local administration - ie do not use/require some sort of cloud logins, subscriptions etc.

Out: Netgear Orbi

I had long used Netgear products even including DSL modems. They always had a wide range of model numbers and often successor versions of the same model. This always made firmware updates tricky.

When they came out with their Orbi product I thought it would be a simplification. I was expecting firmware updates across the entire range at the same time, with all models having about the same features and security fixes.

That did not happen, and it remained the same Netgear approach as before with random model numbers, multiple versions, and sporadic updates.

The systems did work, and didn't require lots of administration. But as the number of devices on my network grew and the ISP bandwidth decreasing, it became increasingly frustrating that I had no client visibility. What systems were using the bandwidth? When?

In: Ubiquiti

You can't tell what you will get with Ubiquiti before you actually get it. There are no online demos or similar. It is a system where it is best (but not required) to go all in - ie get as many networking components from them as possible.

The leap of faith is rewarded. Everything works together. You have fantastic visibility and control. I got a Dream Router (combo router, access point, network switch) plus an access point and two switches. I have fine grained access control, multiple VLANs, and restrictions on the ever increasing number of IoT devices on the network. No subscriptions and no cloud required.

The devices are also power over ethernet centric. This drastically reduced the number of power adapters I needed and removed power cables snaking up to equipment. It simplified what was on battery backup.

And then they did what I expected Orbi to do - periodic system updates across the range improving and adding functionality. I couldn't be happier!

Doorbell camera

Out: Netgear Arlo

I did go all in on an Arlo camera system (see next section) including the doorbell. It was better than nothing, but was slow to respond.

In: Aqara G410

The Aqara G410 is an excellent system, and they took a very clever approach. Almost everyone else (including Arlo) have a single piece of camera hardware that you install outside. It has to include camera, speaker, wifi, battery backup etc. That makes it bulky, and wifi has to go through the walls and door.

The Aqara G410 has the outside camera system as simple as possible - just the camera, button, speaker, backup batteries (~6 month life!), and a low power radio. Separate from that is what they call the chime, but it is actually an indoor networking hub. It talks the low power radio to the camera, and then forwards on to Matter, Homekit etc. It can also act as a Zigbee receiver and a Matter hub. The Zigbee functionality let me add some no neutral light switches which was an unexpected bonus.

While there is cloud functionality, I don't use that and only use HomeKit. Now when somebody is at the door, I get live video footage on my watch, almost instantly!

Outdoor cameras

The main use of cameras is monitoring what is going on outside, primarily pets and animal visitors. What is the dog barking at? How often do skunks come around in the middle of the night? How close to the skunk do the cats get? The cameras are active all the time,

Because of the locations it isn't possible to run power or ethernet, so I required battery powered wifi cameras.

Shoutout to the Youtube channel The Hook Up with excellent content covering outdoor cameras, and other devices including local administration.

Out: Netgear Arlo

Arlo was originally a good solution. The batteries lasted a long time, and solar panels made it indefinite. You used a hub which had its own specifically configured wifi for range and reliability. Eventually HomeKit support was added.

But times changed, and so did Arlo. The system is cloud based, so all videos are uploaded. This required paying a subscription which was okay in the beginning, but kept getting more expensive especially as they want to be in the business of having human review and respond to security incidents.

The system kept getting slower and not capturing enough (it saves battery) or excessive recordings of leaves moving in the wind. The HomeKit support become useless.

Doubling the annual subscription cost was the final straw.

In: Reolink

Reolink has an all local solution, with low prices (and frequent sales). For the price of the Arlo annual subscription I could replace 3+ cameras a year! There are a wide variety of cameras, so I ended up with several different types.

For example the Duo models are two cameras in one providing a 180° view which let me replace several pairs of Arlo with one Reolink Duo. There are models with pre-record buffer which is helpful for getting footage of what happened before the event trigger. I even got one that has pan and tilt which is occasionally useful.

I was also able to use two PoE cameras thanks to the Ubiquiti switches, which perform so much better than wifi cameras, and are much cheaper!

A Home Hub gathers footage from all the cameras in one place, and Tailscale makes it accessible from anywhere. Ubiquiti lets me keep the cameras and Home Hub completely isolated from the rest of my network except for incoming viewers. A neat Home Hub feature is time lapse recording which is great for a 24 hour view of an area as a curiosity.

Reolink aren't perfect - for example camera firmware updates are erratic. But they are considerably cheaper, do work, have steadily improving cameras every year, and are all local.

Electrical usage monitoring

Our electricity bill was a bit high, and it was quite difficult to work out where it was all going. There is some constant background activity like air filters, fish tanks, items that turn on and off like lights, items that go to sleep when not in use like computers and TV, and appliances like air fryer, fridge, washer, and dryer.

California also keeps making electricity more and more expensive. It is over 50c per kWh!

Out: CSV analysis

The power company lets me download a CSV of power consumption with a one hour granularity. This provides for some analysis, but is very difficult to figure out root causes of power consumption.

Eventually I suspected the dryer was using rather a lot, and was able to estimate how much it was using. It seemed excessive.

In: Emporia

I eventually discovered the Emporia Utility Connect. The electric smart meter can be configured (via a tedious form on the utility website) to send consumption every 30 seconds to a targeted zigbee device. The Utility Connect receives that data, and uses wifi to upload it to Emporia.

This made life a lot easier. By turning circuit breakers or individual devices off and on, it required waiting 30 seconds to see the change in overall consumption to narrow down usage.

That doesn't help with items like fridges that turn on and off during the day. I later got some smart plugs to monitor individual devices. In the Emporia app you see power consumption of each smart plug as a portion of the whole from the Utility Connect. The fridge uses 1.6kWh per day.

Washer dryer

It was becoming apparent that the washing machine would need replacing within a few years, and that reducing electricity consumption of the dryer as soon as possible would be good.

But just like everything consumer related keeps getting bigger, the same has been happening with laundry appliances. There were no dryers on the market the same size as the existing dryer, only ones considerably smaller, with everything else had become a lot physically bigger. That meant they wouldn't fit in the space available and we had nowhere else to put them.

I spent a lot of time with tape measures, using spec sheets trying to work out dimensions especially including exhaust hoses and even thickness and angle of the power plugs. That was before guessing just how much more efficient than the current dryer they could be based on the manuals, reviews, marketing etc. There was just no solution I could find.

Out: Generic top loader washer, electric dryer

While top loading washing machines are easier to use, it did have an infuriating tendency to have an unbalanced load when trying to do the spin dry portion of the cycle. There would be a massive thumping sound as it seemed to try destroy itself, then lots of water added, followed by further thumping spin attempts until it would eventually give up. I would often have to take the wet items out, repack them inside attempting to get the balance right and retry the spin dry. The machine seemed to be starting to decline overall through age and wear and tear.

The dryer seemed to be using huge amounts of electricity. It also pumped in the hot air and out the moist air both at the back, which meant inconsistent drying of items, and probably longer drying cycles.

In: GE Combo Washer Dryer (heat pump)

One day when walking through a hardware store we passed this system, pricey but on a deep sale. It didn't require a vent hose, saving many inches in depth. And no other space behind was required. It took more poring through manuals and tape measures but looked like it would barely fit. Hours of youtube videos laid out all the potential flaws and benefits.

I looked at all the equivalent competition, but they were all deeper and required more clearance. It was essentially this system, or nothing. During a later sale, we got it.

It did fit, but I had to remove the doors of the laundry space and replace them with a curtain - it was that tight.

The system has been an absolute delight. It is a front loader, and I've never had to deal with unbalanced loads again. There is an app that works well, and they correctly handled how behind the scenes networking is done (thanks Ubiquiti).

Laundry immediately seemed to come out looking a little brighter. I'm reasonably certain the reason is because it has automatic detergent dosing based on the quantity of laundry, and is using the right amount for each load. Being a front loader means the wash water is better distributed through the load, and I most likely was using too much detergent with the previous machine.

It also has a pet mode which runs the blower before starting the wash that catches a lot of the pet hair in the lint trap, rather than washing it into the clothes as before.

The controls are even simple and logical. One of the competitors has an Android touch screen tablet as the control panel and over 50 cycles, most slight variations of each other. I know which is more likely to last over a decade.

On the electricity side, our total monthly usage has decreased by 15%. It will likely be even more saving during winter (rainy season) when outside drying is less available, and inside air is not pumped outside. The washing machine portion was separately energy star certified, as well as heat pumps using considerably less power than resistive heating for the dryer. I'm surprised at how low the numbers are when the app shows estimated electricity usage. The payback period is going to be a handful of years, and even sooner if California electricity prices keep climbing.

DVDs

We own almost 800 DVDs of movies, and box sets of many TV series (almost 3,500 episodes it turned out).

Out: Never watching them

Since streaming became pervasive, it was always easier to quickly put on whatever service we used. It kept track of progress no matter where you watched it. You got 4K and great sound.

The DVD player required finding the disc you wanted, putting it in the player, putting away whatever disc was already in there, finding the remote control, sitting through unskippable initial content, and eventually getting to see the content.

In practise, we didn't do that, leaving the DVDs gathering dust.

In: Handbrake then Jellyfin

DVDs have menus, multiple sound tracks, and lots of extra content. I really wanted to keep access to that, but software that worked with ISO images of the discs and handled that functionality was thin. Modern video formats don't support the menus or linking main features with their extras.

Eventually I bit the bullet and decided to convert to something we could watch as easily as streaming. Jellyfin is used for managing and accessing the library of content, and has clients for all the platforms.

Turning the physical DVDs into digital content took 3 months of 24 hour operation. I used two USB DVD drives to read the discs to ISO files, and Handbrake to encode it.

The process was doing all this simultaneously:

Use lsdvd to suggest a title for the disc
Use ddrescue to read the contents. Some discs had bad sectors because of their age. Sometimes one drive would do all it could, and then putting the disc in the other drive was able to read the remaining sectors just fine. ddrescue is very good at helping with this.
Load the ISO in VLC. This is necessary to navigate the menus, and get the title/track number for the content. If a movie was available in both pan and scan, and letterboxed I needed to ensure I picked the latter. Sometimes there are extended versions, so again I wanted that. And TV series would have multiple episodes per disc.
Load the ISO in Handbrake. I would need to ensure the output file had the correct naming convention (especially important for TV episodes), and that I picked the intended soundtracks and subtitles. Handbrake does have defaults for a first pass, but for some content (eg Das Boot) the original language is preferred in addition to English. I would then add that title to the Handbrake queue, which it continued working on in the background.
Delete ISOs that had been encoded after checking the encoding, to make space for new ones

While the results are DVD quality, it turns out hardly noticeable, and good content is always good content. It all worked and I'm delighted with the results. The DVDs are now all in storage containers. Sadly all that extra content won't be viewed.

Category: misc

20 years of APSW

Roger Binns — Fri 20 December 2024

APSW is my Python SQLite wrapper that glues together the Python C API and the SQLite C API, so you can use SQLite from Python.

It has now been 20 years since I first released APSW! That means time for a retrospective.

Contents

Genesis

In 2003 I got a new cell phone back when they were very small screens and physical keyboards of only the numbers. Trying to maintain contacts and other information was painful so I ended up creating BitPim which did all the work over a USB cable. It was implemented in Python using wxPython for the GUI.

Behind the scenes it needed to store information about contacts, calendars etc so I used the most expedient format possible - a Python dictionary (keys and values). It wasn't very large (many phones had limits of 100 contacts and calendar events).

It was trivial to read and write the dictionary to disk as is, and substantially similar to JSON. That also made it easy to edit, backup, compare etc outside of BitPim. But two things annoyed me...

1. You did manual save and load from the menus. That was completely normal for programs of the time, where the user was expected to manage the movement of data storage between permanent (ie still present after power events and reboots) and transient (ie in the memory of the process only). This sort of thing should be automatic, and thankfully is the majority of time these days.

2. I wanted undo and redo, but not just the simple approach commonly taken. I wanted to see what the data looked like at any point in time, a simple example being to see what a contact's phone number was 7 months ago. No information should ever be lost or deleted, and you should easily be able to recover older values.

It was getting later in 2004 and I had to solve this. The solution was obviously structured data, which meant a database.

Storage Options

The various networked databases were not an option. They were large and feature full, required complex configuration including administrators, users, and permissions, and dwarfed the size of BitPim. Installation instructions would be a nightmare especially since BitPim is cross platform and easy to install and run.

There were a family of local key value stores generally called dbm, which in theory could work, but would involve creating my own layer on top to structure and version data, execute queries, and similar work. In addition dbm often weren't available on Windows.

SQLite

The first release of SQLite in late 2000 had used dbm as the storage backend and added a layer of SQL on top. SQLite was intended as a developer alternative when the "real" database wasn't available. SQLite 2 released a year later dropped dbm for a custom btree storage engine. SQLite 1 and 2 only stored strings behind the scenes. (The hard part of storage is getting transactions right.)

In mid 2004 SQLite 3 was released and was a real local database, having learned the lessons from SQLite 2. It included:

More effective file format
Unicode strings
Typed storage in the backend (ie not just strings), including binary objects
Bound parameters
Better transaction locking model
Collations
64 bit rowids
More advanced query optimisation
Improvements in SQL compliance, error reporting etc

This was the obvious perfect fit for BitPim storage.

pysqlite

The existing Python binding to SQLite was named pysqlite. It had supported the earlier SQLite versions and upgraded to SQLite 3. (It was eventually adopted as the standard library sqlite3 module.) I started using it in BitPim, but quickly encountered several issues.

DBAPI

DBAPI / PEP249 defines a standard abstraction for accessing databases from Python, which does map well onto the common networked databases. It does not map well onto SQLite, so pysqlite had to resort to parsing SQL text to manage transactions. The exceptions didn't map well to SQLite's errors. I felt like I was fighting it just trying to do normal SQLite things, and had no interest in other databases.

Unicode

Unicode strings had been a new feature in Python 2 and SQLite 3, and pysqlite mostly got them right, there were a few places that were not, and doing so could cause backwards compatibility problems.

Threading

SQLite at that time could not be used across threads. Because BitPim had a GUI it meant the main thread was used for the GUI, and work was done in background threads. Even if you are diligent, destructors can run in any thread. It was important to me that cross thread usage was caught and became an error, and not silently ignored.

Ergonomics

The cursor was not an iterator so you couldn't use for row in cursor.execute('...') which was Pythonic. I added an iterator wrapper but it made the cursor a lot slower.

An annoyance (that still persists) is you could only execute one statement at a time. Multiple semi-colon separated statements in a query string gives an error.

Tracing and debugging

There was no support for tracing queries or returned rows. This is important during development.

Because SQLite does all its work in C, it is highly desirable to see what Python values it is working with especially when errors occur. You got no clue.

Error handling

Both SQLite and Python have error indicators. SQLite uses an integer code and error string, while Python uses a thread local pending exception. If a callback had an exception then all you got was an exception with the message user-defined function raised exception which was not useful. (Which function? What exception?) Even today all you get is this.

I'll write my own

It was apparent that many of these issues would not be addressed ever (DBAPI compliance was important), or could be done in the timeline I needed for BitPim. Since BitPim would only ever use SQLite, there was absolutely no need for abstractions, and I wanted to have every feature of SQLite available to me.

Naming things

At the time it looked there would be many SQLite wrappers, especially as some supported SQLite 2. I wanted an alternative that did things the SQLite way, and imaginatively came up with Another Python SQLite Wrapper, expecting it to be one of many. It ended up being one of two.

Coding

I recall it taking about a month to code up the first release. That involved mapping between the SQLite C API and the Python C API. An early decision I made was that the module would only be C code. Even today sqlite3 is split between Python code and C code.

Back then it would have meant potential deployment problems having to keep the Python files and compiled C extension in sync. By having only one output file that was not a problem that could occur.

It was then a matter of supporting SQLite versions and Python versions. It was only in January 2022 that I removed support for Python 2.3 which was the version current in 2004 when the project started!

Today APSW does consist of both Python code and compiled C code, and being out of sync is not an issue.

BitPim

It was a simple matter to switch BitPim from dictionaries on disk to dictionaries from SQLite. I was delighted to add undo and data version history so that data would never be lost from that point on.

Principles

I didn't know it at the time, but some principles ended up happening over the years.

Dependencies

It is easy and often a good idea to use third party packages for functionality you need. PyPI has 600 thousand packages covering almost anything you can think of. When APSW started the ecosystem was a lot less mature, and depending on third party packages was a lot more difficult.

To this day APSW has no runtime third party packages, and for compilation and documentation uses the de facto Python standards, described below. The standard library is used.

Overall that was the right decision because it means I am responsible for every line of code, which means I can ensure the appropriate levels of testing, documentation, and interoperability.

Version Churn

SQLite kept updating, and Python kept updating. To make best use of SQLite I decided to keep up with updates, and eventually decided that everything you could do from C you should be able to do from Python. After all SQLite was adding things for a reason. I also decided to have APSW match the corresponding version of SQLite - ie you wouldn't be able to use an old version of SQLite with a new version of APSW. To do so would hugely increase the combinations of testing needed. However you can use new SQLite with old version of APSW just fine.

For 17 years I added support for each new version of Python, and never dropped any. Python used to have what felt like an arbitrary release schedule, with the 3.8 release in October 2019 finally bringing it on to a predictable annual schedule. It was also increasingly difficult to support the older versions, and unnecessary because modern practice is not to support end of life software.

I did try to keep breaking changes to code using APSW to a minimum. Other than corners of error handling, I believe there has been no change causing user code churn.

Tooling

I haver a fair amount of tooling code. This is to reduce manual maintenance such as keeping code and documentation in step, source code checking such as checking that Connections, Cursors, blobs etc are all verified to be open before functions operate on them, getting an API list from the SQLite website to verify all APIs and constants are used, and even spelling checking where mis-spellings are added whenever they are found. All of the tooling is full of assertions and errors on any uncertainty.

This has turned out excellent for project maintenance because it avoids even minor oversights.

Source Code Control

I originally hosted everything on my web site, and used CVS for version control. Then things moved to SourceForge with Subversion source control. Then it was Google Code hosting with Mercurial source control. The final move was GitHub and git source control.

Despite all those transitions, full fidelity of the the source changes is available back to April 2006. The changelog goes all the way back to the beginning.

Documentation

The documentation was originally hand written HTML. It worked but meant there was no way to programmatically ensure it was complete and up to date. I was an early adopter of Sphinx which has been fantastic.

Behind the scenes I have comments in C source that are extracted to make documentation and help text, tools to check the entire SQLite API is wrapped, tools to help with type annotations in the doc, and several other doc related scripts.

Building

When I started the Python standard library included distutils which could be used for both compiled code and pure Python packages. The Python Packaging Authority now maintains a descendent for compiled code named setuptools which still works well for the purpose.

Perhaps the best thing they do is cibuildwheel which is able to compile and test the code under the supported Python versions, operating systems, chip architectures, and ABIs (eg musl vs glibc on Linux). It produces 50 different builds making it trivial for anyone with any of those combinations to install APSW.

Testing

For a few months I had a Python file that run exercising each of the APIs. I quickly moved to the standard library unittest and still use that (many third party frameworks have come and gone).

Very little of the testing code is about things that work, as both SQLite and Python fundamentally work and are themselves tested. The vast majority is API boundaries, and errors and exceptions. It gets especially gnarly because multiple exceptions can occur inside the top level call into SQLite (eg VFS error recovery, finalizers, callbacks). For Python 2 support I was stuck with unraisable errors, while, while Python 3 added chained exceptions.

I used coverage analysis of both Python and C code to make sure all the error/exception conditions were caused. That wasn't sufficient because there are functions such as allocating small amounts of memory, Python objects, or operations like appending to a list that are practically impossible to make fail. I used to have to wrap each location in a macro so that I could cause failure on demand, but that was tedious. It also meant the code didn't look normal, and was obfuscated for editors and analyzers.

Finally statement expressions came to the rescue, where I could macroize calls without having to do something manual in each location. The statement expression would query if that location (file, line, arg list) should fail and how, allowing my harness to then iterate through all failure locations. There are almost two thousand locations. This gets combined with a debug build of Python that does far more assertions and error checking to catch code issues - the most likely being calling into the Python C API while an exception is pending.

Because C code is involved, that means memory issues. I used valgrind to check everything was good. Its cachegrind mode is used for detailed profiling. Sanitizers significantly improved the testing because they saw the C level source code and so better catch issues - valgrind interprets the executable and so has to deduce what is going on.

Overall adding error handling to the code, and the corresponding testing is the vast majority of any code changes - usually three times as much as the actual normal code path. The good news is this gives a lot of assurance errors are handled well, even though most users are unlikely to ever encounter them!

A release test today involves:

Running my checking tools that looks for code issues, missing SQLite APIs etc
Running coverage to check all lines of C and Python (including the test suite itself) are run by testing
32 and 64 bit build and test of all supported Python versions on Windows, because it is most likely to have differences
Megatest of all these permutations under Linux
- 32 and 64 bit
- Every supported Python version
- All supported SQLite versions
- Python in regular build, and in debug, assertions, all warnings build
- APSW in regular and debug build (including SQLite debug/assertions configuration)
- Default APSW SQLite configuration (everything on), and Debian default SQLite configuration (only some things on)
The Github actions build then also tests the build on each operating system and configuration

It is very rare for the tests to catch any problems, and is delightful when they do.

Python code

For the longest time APSW was almost 100% C code. It was just gluing together the C APIs. Several years ago I realised that users were trying to do higher level things, and it was preferable to go beyond documentation, and also add that as a standard part of APSW.

I'm very pleased that best practices has been popular - almost all queries and code shows it being used.

The ext module is a nice collection of things. My favourite is three lines of code turns any Python function with positional and keyword arguments into a virtual table.

Size

The video below shows how the source tree has changed over time. It is often the same set of files for each release, updating C code, Python test code, tools, and documentation.

Lines of code is a silly metric, like weighing art, but is correlated with functionality, complexity, effort etc. Here are some comparisons between early 2006 when APSW was functionally complete, and the end of 2024 when it is very complete.

Kind	2006	2024
C source files	1	20
Lines of C source	2,461	15,901
Python files (tools, tests, everything)	3	41
Lines of Python	764	19,486
Lines in main test	652	8,800
All test code	652	11,992
Documentation files	1	27
Lines of documentation	1,200	10,400
Makefile targets	5	46

Popularity Contest

So just how popular is APSW?

There are 750 stars on GitHub putting it around position 45 thousand.

APSW is available packaged in most Linux, BSD, and even termux (Android).

From PyPI there are a million monthly downloads which puts it in the two thousands of all PyPI projects.

Final words

I've seen it said that project maintenance is a lot like gardening. You need to do occasional weeding and pruning, spruce things up, and generally just keep your hand in. I agree.

It has kept me up to date on Python's C API as well as Python library (I'm a fan of dataclasses). SQLite's changes (click the right column) have also been interesting, and I'm glad I did the work to wrap extensions like FTS5.

The long list of differences to the standard module shows the work continues to be worth doing.

Category: misc – Tags: python, apsw, sqlite

SQLite configuration

Roger Binns — Thu 12 December 2024

SQLite is really easy to compile into your application. However Linux and BSD maintainers prefer to have a single centralized shared library and have all platform provided programs use that. You can see a (very long) list of platforms and their SQLite version here.

SQLite can be compiled with many options such as full text search which almost all seem to enable, even though it defaults off. That got me curious to see what they all do different than the default.

I went off and investigated, with the per platform details at the end.

Commentary

It was noticeable just how arbitrary each platform seemed. With the exception of OpenBSD, every platform had several configuration settings to their defaults. This does not result in any warnings of errors. It is generally the result of things like JSON which used to be off by default, and had to be explicitly enabled, but is now on by default. I don't know of any reasonable way for maintainers to find out that they are redundantly setting things to their default values.

COLUMN_METADATA

Column metadata is enabled by everyone. I did see some comments saying that Firefox needs it (and UNLOCK_NOTIFY), but it is enabled even by platforms that don't ship Firefox.

My own experience with Python developers has been that many also use COLUMN_METADATA, and I suspect many other dynamic languages like Ruby, Perl, and Lua do too. The problem is you can't map the text of a SQL query back to named columns because SQL queries are too difficult to parse and manipulate. Other databases provide the column and table names for query results, so code out there expects it. Even the Python database spec (DBAPI aka PEP249) requires it.

Best practice is to explicitly name the result columns (using AS) and this works without requiring COLUMN_METADATA.

SELECT name AS name, SUM(abc) AS abc_count, two.name AS two_name
  FROM contacts, two;

Extensions

FTS3 and FTS4 are enabled by default, but FTS5 is not. There is a good description of the differences. I measured how big extensions compiled on my system are, as a percentage of the SQLite library with all extensions enabled (1.6MB)

Extension	Size	Percent
FTS3 & 4	93kb	6%
FTS5	200kb	13%
JSON	50kb	3%
Session	53kb	3%
Soundex	813 bytes	0%

MAX_VARIABLE_NUMBER

This I can't make any sense of. You can use numbered bindings, that start at 1.

SELECT name, email FROM contacts WHERE name = ? AND age < ?;

When programmatically executing this, you would supply bindings number 1 and 2. It is also possible to explicitly number them, so in this example you'd provide 97 of them although only 73 and 97 are used.

SELECT ?73, ?97;

You can also provide bindings by name using :, $, or @ like this, providing the bindings as a hash/dictionary type object to the API:

SELECT :foo, $bar, @bam;

The default limit on numbered bindings is 32,767. I cannot think of any scenario under which you would want there to be a quarter or half a million of them, yet that happens!

USE_URI

URI filenames let you easily provide extra parameters when opening a database. You can programmatically turn on URI filenames, but unfortunately that has to be done very first thing after loading SQLite but before using it which is hard to arrange. You can also use SQLITE_OPEN_URI when calling sqlite3_open_v2 added in 2007.

I'm guessing this default was added to some platforms years ago and never changed.

Security

SECURE_DELETE ensures deleted data is overwritten with zeroes. Without it, old data can still be visible to anyone examining the database file contents. It can also be controlled by a pragma. There is a good case to be made for it to be on by default to avoid unintentionally leaking data, for example when databases are backed up, or shared with other people.

API_ARMOR helps defend against programming errors, which are especially easy to make in C. As a maintainer I would enable it, because a more complex program using SQLite often uses lots of libraries, and all it takes is a bug in any one of them to cause problems. Every few months an issue comes up in the SQLite forums where that has happened.

STRICT_SUBTYPE is off by default, with threats in the SQLite code that it may be turned on by default in the future. Maintainers should probably turn this on.

LIKE_DOESNT_MATCH_BLOBS should be on by default, because using LIKE doesn't make sense against blobs. For example what encoding should be used?

DQS isn't set by anyone, but really should be. The default setting allows developers to accidentally use double quotes for strings when they should be used for identifiers (column and table names etc) only. This is a frequent cause of bugs on the SQLite forums, because programming languages use double quotes, and it is easy to assume SQL does too. It doesn't.

Miscellaneous

STAT4 should get you better queries with multi-column indices.

I can't imagine why offset would be needed.

MAX_SCHEMA_RETRY defaults to 50, so I don't know why half that amount would make any difference.

Summary

Other than COLUMN_METADATA, there isn't much consistency between the platforms. SQLite is trying to serve many users, from tiny devices with trivial amounts of memory and storage, to gigantic servers with terabytes of databases, and corresponding amounts of CPU and RAM. Heck the default cache size is 2MB which is wrong for everyone.

So far platform maintainers are making decisions about what to configure for their platform, but that does result in redundancy, duplicated effort, and some amount of guessing.

SQLite doesn't tend to change things because there is an implicit guarantee that if you have a working program using SQLite, and you upgrade the SQLite library, then your program will continue to work.

There are recommended options which are appropriate when compiling SQLite for your own application, but not usable when providing SQLite for a platform used by many applications.

Details

Each entry shows what looked different to me as I did this by visual inspection looking at the default build rules for each platform. You can get more information by looking at https://www.sqlite.org/compile.html and I've left off SQLITE, ENABLE etc from the names. The most recent version of the platform is what I looked at,

You can use pragma compile_options on your own platform to see what is in use. SQLite can have different defaults between the plain library and the cli making this a little more confusing.

Platform	Configuration difference
MacOS	CKPTFULLFSYNC JOURNAL_SIZE_LIMIT=32768 LOOKASIDE=1200,102 COLUMN_METADATA FTS5 LOCKING_STYLE=1 NORMALIZE PREUPDATE_HOOK SESSION SNAPSHOT SQLLOG STMT_SCANSTATUS UPDATE_DELETE_LIMIT HAS_CODEC_RESTRICTED MAX_VARIABLE_NUMBER=500000 THREADSAFE=2 USE_URI
Debian	SECURE_DELETE COLUMN_METADATA SOUNDEX UNLOCK_NOTIFY ALLOW_ROWID_IN_VIEW UPDATE_DELETE_LIMIT LOAD_EXTENSION LIKE_DOESNT_MATCH_BLOBS USE_URI MAX_SCHEMA_RETRY=25 PREUPDATE_HOOK SESSION STRICT_SUBTYPE MAX_VARIABLE_NUMBER=250000
Fedora	COLUMN_METADATA DISABLE_DIRSYNC SECURE_DELETE UNLOCK_NOTIFY
Gentoo	LOAD_EXTENSION API_ARMOR COLUMN_METADATA EXPLAIN_COMMENTS HIDDEN_COLUMNS MEMSYS5 NORMALIZE OFFSET_SQL_FUNC PREUPDATE_HOOK GEOPOLY SESSION STMT_SCANSTATUS UNLOCK_NOTIFY UPDATE_DELETE_LIMIT SOUNDEX USE_URI
Arch	COLUMN_METADATA UNLOCK_NOTIFY SECURE_DELETE STAT4 MAX_VARIABLE_NUMBER=250000 MAX_EXPR_DEPTH=10000 FTS5
Termux	COLUMN_METADATA UPDATE_DELETE_LIMIT
FreeBSD	DEFAULT_FILE_PERMISSIONS DIRECT_READ LOAD_EXTENSION FTS5 COLUMN_METADATA SECURE_DELETE UNLOCK_NOTIFY USE_URI
OpenBSD	FTS5 UNLOCK_NOTIFY COLUMN_METADATA
Brew (homebrew)	API_ARMOR COLUMN_METADATA FTS5 MEMORY_MANAGEMENT STAT4 UNLOCK_NOTIFY MAX_VARIABLE_NUMBER=250000 USE_URI PREUPDATE_HOOK SESSION
Macports	DISABLE_INTRINSIC COLUMN_METADATA FTS5 SECURE_DELETE STAT4 UNLOCK_NOTIFY SOUNDEX
Nix	COLUMN_METADATA FTS5 STMT_SCANSTATUS UNLOCK_NOTIFY SOUNDEX SECURE_DELETE MAX_VARIABLE_NUMBER=250000 MAX_EXPR_DEPTH=10000
Haikuports	COLUMN_METADATA UNLOCK_NOTIFY SECURE_DELETE FTS5
msys2	COLUMN_METADATA DISABLE_DIRSYNC SOUNDEX PREUPDATE_HOOK SESSION (They also `configure --enable-all`)

Category: misc – Tags: sqlite

SQLite Full Text Search added to APSW

Roger Binns — Sun 17 November 2024

TLDR: APSW now has comprehensive support for SQLite Full Text Search.

SQLite has a nice full text search search implementation named FTS5. (Yes the older FTS3 and 4 also exist.)

Full text search engines are in theory fairly simple - break content into terms (typically words), index where in the content the terms exist, reply to a query's terms by consulting that index. Add in a way of ranking the results typically by how rare each term is overall, and how well represented it is in an item of content. (The 5 decade old BM25 can do that.) Make the query language a bit more complex so you have AND, OR, and NOT. Perhaps a NEAR and a way of including or excluding columns.

FTS5 does all that very well, and is fast. It offers a C API for writing your own tokenizers (converting content in terms/words), and for writing your own auxiliary functions such as for ranking or showing highlights. Several months ago I decided to wrap those APIs in APSW figuring it would take a few weeks at most. (C code requires lots of testing.) It did of course take several months, but the result is something I am very proud of.

The tokenizer API is a little confusing. It is how object oriented C is done where you have structures of methods, passing this as the first parameter. I did get confused over what were the analogues to classes and instances, and had to revise my first implementation.

The FTS5 tokenizer is based on Unicode 6.1 (2012) and stays that way for stability. We are now in version 16 with there being annual updates. Python has the unicodedata module which is more up to date, so I implemented a tokenizer using that as the data source.

I soon discovered that it wasn't really workable for two reasons. The first is that splitting text into words by whitespace doesn't work. Some popular languages like Chinese and Japanese don't use spaces, some use spaces between syllables, some use other schemes, and even English gets complicated when you have punctuation - is don't one or two words? What about run-down?

The second is that operating on each codepoint separately breaks apart what are single user perceived characters (aka grapheme clusters). You can have a codepoint and one or more combining accents and marks. It was especially noticeable with emoji where 🤦🏼‍♂️ is actually 5 different codepoints.

That led me to Unicode Technical Report #29 all about text segmentation. It tells you how to correctly break text into grapheme clusters, words, and sentences. The FTS5 snippet function tries to determine sentences behind the scenes, so all 3 would be useful. I also have a function to format query tables that needs to break text into lines. For example in English you don't want to break between a word and a comma immediately following it, but rather after the comma. Unicode Technical Report #14 line breaking algorithm addresses that. There were no practical Python libraries implementing all of these, so I went off joining the very small club who had implemented all of TR29 & 14.

With that lengthy diversion over I was able to implement a UnicodeWords tokenizer. It performs really well no matter language, punctuation, and (lack of) whitespace. Just kidding - I had yet another diversion. I needed Unicode information, way beyond what is in the unicodedata module. For example I needed to handle extended pictographic (emoji and similar) correctly. I needed to be able to strip accents, diacritics, and combining marks. To get my formatted query tables correctly aligned on the terminal I needed width information. And it bugged me that unicodedata lags the Unicode standard by years. I stay up to date with SQLite, so why shouldn't I stay up to date with Unicode. That resulted in apsw.unicode and now UnicodeWords does well.

Then it was off to wrap the auxiliary function API. That was straightforward, although it has a lot of terminology, with my feedback resulting in the overview being added.

At this point all the pieces were in place for anyone to use APSW to get the most out of FTS5 from Python. But you had to build everything yourself. I would have to document how you do things like stop words, stemming, or synonyms. They are only in the region of 10 lines of code which was confirmation that the API was good, and easier to to just go ahead and implement.

The builtin trigram tokenizer was not aware of grapheme clusters, so I also had to add an ngram tokenizer. (It irritates me how some sites require typing 3 characters before they do completions, so I made sure any value can be used.) I also ended up adding library code to deal with argument parsing that all the tokenizers needed. And one to deal with stripping accents and case folding for any other tokenizer so the functionality didn't have to be built in to each individually.

The SQLite website's own search uses its HTML documentation as what is fed to FTS5. There is no provided HTML tokenizer, and I saw several questions on the forum about indexing HTML content. Python has an HTML parser so I used that for the parsing, but it was also important to track the offsets in UTF8 of the source document to the tokens. This gets fun with entity and character references where there is no one to one mapping of offsets. That led to more C code to manage offset mapping, as pure Python code was too slow. Some people on the forum also wanted JSON tokenizers so I made one of those too.

I wanted a Python class to wrap a FTS5 table, especially so you could call methods and access properties rather than constructing SQL. It become necessary to parse the SQL making up the virtual table declaration such as getting column names and the various options. That required parsing code, especially as there are many ways quoting can be done. The tokenize argument is a quoted string containing quoted strings, so it gets hairy. (Far more quoting options work than the documentation says.)

I also wanted functionality to do query suggestion - if you make a spelling mistake in a query then it should suggest a better alternative. Even spelling something correctly could be a rare spelling and something more popular is appropriate. To implement that required being able to parse and manipulate (and create) queries. That required dealing with the finer points of the FTS5 query grammar, implicit AND, and should you have a PHRASES type (answer: no). It was then possible using the token information to implement query suggestion, which was fantastic to then use.

I had one remaining item on my wish list - more_like where given some existing rows, you can come up with additional ones that are like those. This makes it easy to implement infinite scrolling - just more_like what you have already shown each time. It was also great when testing on a large recipe database - given ingredients from one recipe it would suggest similar ones.

The query suggestion and more like are purely statistical - they have no understanding of meaning such as jogger being similar to runner. But they still work well enough. Solutions that do have that understanding require you to start measuring in gigabytes which going against the Lite part of SQLite.

I also had to write documentation, with a lot of things to cover. The final result when printed would be about 50 pages. Various reading time tools estimate about 50 minutes to read. I had to read it all multiple times for proofreading and editing.

Now that all is done I'll provide some statistics. Measuring lines of code is somewhat meaningful as being correlated with development time, complexity, number of bugs, testing effort etc. It is also as silly as measuring a painting by weight. Items marked 🆕 are what I had to write as described above, while the others provide context.

Lines of code	Description
2,100	🆕Python test code for FTS5 and Unicode functionality
8,400	Python test code for everything else
1,100	🆕C code to glue FTS5 API to Python C API
1,200	C code to glue SQLite prepared statements and execution to Python C API
1,750	C code to glue SQLite virtual tables to Python C API
2,000	🆕C code to implement TR29 text segmentation, TR14 line breaking, case folding, accent stripping, UTF8 to str offset mapping, and various related functionality
2,250	C code to SQLite virtual file system (VFS) to Python C API
1,100	🆕Python code to provide API to unicode functionality, text wrapping, and similar, plus command line tool
860	🆕Python code for wrapping FTS5 table, all the tokenizers, argument parsing, and a command line tool.
1,200	🆕Python tooling code to download and parse Unicode data files and build the tables in the source which aren't included above.

Category: misc – Tags: unicode, python, apsw, sqlite

A small club: Implementing Unicode TR29 and TR14

Roger Binns — Fri 05 July 2024

I recently implemented Unicode Technical Report #29 which covers how to turn "text" into "letters", words, and sentences. I also ended up implementing Technical Report #14 which covers line breaking - good places to start text on the next line when the current line is getting too long.

My guess is that less than 10 developers have fully implemented both of these, and I have now joined that very small club.

November 2024 Unicode 16.0 came out - see the end for my thoughts

Contents

How hard could it be?
Codepoints
The fateful decision
The consequences
Lessons learned
Performance
Table unrolling
Unicode 16 update

How hard could it be?

Unicode aims to have one way of representing human text for computers. Before its widespread adoption, different computer systems and even programs had completely different and very incompatible ways of representing text, and it was a mess to read, write, print, copy, and backup text.

This is an extraordinarily hard problem to solve. Are upper and lower case A` the same letter? What about bold, and italic? What about rotated? Fonts? Is ß a different letter or just a different way of drawing two lower case s? Ligatures? Is a lower case e in French without an accent the same as e in English? Is Roman numeral ⅰ different than i? That barely scratches the surface of Western Europe, and now throw in the whole world with a rich diversity and history of writing. Thankfully experts have been working on this for several decades.

Codepoints

Unicode assigned a number (codepoint) per "letter". This doesn't work because some writing systems use a base letter (eg a consonant) and then add marks on the letter (eg vowels), and having one codepoint for each combination would be a very large number.

This led to combining codepoints that could modify a base, allowing adding accents, vowels, and other marks to represent human writing,

The fateful decision

There were two possible ways of representing these values - for example letter-e and combing-acute-accent

Modifier first

combing-acute-accent then letter-e

Base first

letter-e then combing-acute-accent

If modifier first had been chosen, then each letter as perceived by a human (ie with all the modifiers applied) would be easy to determine by computer code - accumulate modifiers until you find a base. You now have one user perceived letter (aka grapheme cluster).

Base first was chosen. That means code sees a codepoint, but has to keep looking ahead to see if there are more modifiers that apply to the base. It gets way more complicated with zero width joiners, variation selectors, regional indicators, and Indic syllabic categories to name a few.

The consequences

Because of the complexity, most programming languages just pretend that each codepoint is one "letter" and takes one unit of storage, but often worse (well worth a read) conflating byte encoding, grapheme clusters, and indexing. They then require third party libraries to get words, sentences, and line breaking.

I've been adding support for SQLite's full text search to APSW (my Python SQLite wrapper) and it became very clear very quickly that doing things properly was necessary.

SQLite has a unicode61 tokenizer but it works on individual codepoints and their categories on a version of Unicode from 2012 (to remain stable, and Lite). It doesn't implement any rules for words other than whitespace and punctuation which makes it unsuitable for the languages used by billions of people. It also doesn't handle regional indicators, variation selectors, zero width joiners, some emoji etc.

There is a standard International Components for Unicode library available that deals with all the Unicode rules, text segmentation, locale specific customisation, message generation, and localization. It is available as a 5MB Python binding, to the 5MB of code and 30MB of tables making up ICU. ICU isn't installed on every platform, and when it is you have no control over the version.

The ICU solution is also impractical to distribute for cross platform libraries like APSW, so I decided to implement the rules. My final self contained library was 500kb of code and data with no platform dependencies.

Lessons learned

There are three sets of rules covering grapheme cluster, words, and sentences, and another set for line breaking.

I ended up rewriting/refactoring my code six times, learning a lot as a result.

No errors

It isn't immediately apparent, but there is no sequence of codepoints that can be considered an error. No matter how absurd a sequence is supplied, the code has to handle it, even when multiple rules apply and contradict each other.

Test data

Unicode publish test data of around 2,000 tests each for grapheme clusters, word, and sentence breaks. This isn't perfect as I had cases of bugs being present but still passing all the tests. Without those tests, there is zero chance I could have written correct code. I was often puzzling over test failures and the list of rules to work out exactly what sequence were intended to apply. I wish they would add even more test rules with longer sequences and even more combinations that hit multiple rules.

Break points

Break points turned out to be an excellent way of determining boundary locations. You do need to do an analysis step on segments - eg does it contain letters or numerals if you are looking for words.

State machines

It looks like you can do a state machine and that is what you see in the Rust implementation as well as the Python grapheme package (abandoned?). It is possible to do so for the grapheme clusters, but not word and sentence. Adding some adaptive code could perhaps work. In the end I found that you end up with far too many states (word segmentation has 25 different rules), and that rules can contradict. For example you should break after LF but not before ZWJ so what happens if they are in sequence? Adaptive code ends up having to do look ahead, and look behind (either reversing rules or adding more state).

The biggest problem with a state machine is it becomes very hard to map the code and states back to the series of rules. And as the rules get adjusted over time, it will become harder to merge the changes in.

What worked for me

I did my initial exploration in Python. This was far quicker to iterate, debug, update data structures etc. At the end I then converted to C for significantly improved performance, but starting in C would have taken considerably longer.
The code matches the rules in the same order and expressed the same way as the specification. This makes it easier to verify they match, and allows for easier changes as rules change over time. (Complex state machines make that a lot harder.)
Use look ahead for matching (I did briefly try look behind). You can mark your current spot, and then advance trying to match a rule on succeeding codepoints. If a rule match fails, revert back to the mark, and try the next rule. This is in theory less efficient, but in practise little real world text is like this.
I had to implement more Unicode tables. Some rules want information outside of that specified in the TR14/29 tables. For example the rules want to know if a codepoint is extended pictographic, or in a particular category. While some of the information may be available separately in other platform libraries, it may not be the same Unicode version.
The tables and rules look like they map onto enumerations. However they work far better as a bitset. Most tables fit within 32 bits, although some require 64.

You can combine all the necessary data into the single bitset. For example the grapheme cluster rules also want to know the Indic syllabic categories which is a separate table. I could combine that information with the grapheme cluster tables in the same bitset, avoiding a separate lookup.

Bitsets also allows code like the following:
```
if char & (L | V | LV | LVT): ...
```
and:
```
MidNumLetQ  = (MidNumLet | Single_Quote)

if char & MidNumLetQ: ...
```
... instead of:
```
if char == L || char == V || char == LV || ...
```
The various tables end up as generated code. (Every project I looked at used a Python based tool to do that.) I put a block before each table in a comment giving statistics on how popular each value was, how many lines there were in the table, and other similar information. This made it a lot easier to get an overview - the tables have thousands of rows - and also helped verify that code changes had the expected effect.
It was easy to add other tables. Because I also have terminal output I could add width information, and even a table for which Unicode version a codepoint was added.

Performance

Performance matters because these are used for full text indexing, which could be hundreds of megabytes if not gigabytes of text. There is no point in doing a C Python extension unless it performs well. So here is a benchmark. The source text is the UN Declaration of Human Rights which has been translated into 300 languages. All 300 are concatenated together producing a 5.5MB file. 60 codepoints used in many of the TR14/29 tests are combined and repeatedly shuffled with the source text, and appended until there are 50 million codepoints.

For each library, how long it takes to process that text and return each segment is measured, producing a codepoints per second result. Note that this is measuring both how long it takes to find each break location, as well as how long Python takes to return that substring/segment of text.

ie more codepoints per second is better.

Benchmarking apsw.unicode         unicode version 15.1
grapheme codepoints per second:    5,122,888    segments:  49,155,642
    word codepoints per second:   13,953,370    segments:   9,834,540
sentence codepoints per second:   93,699,093    segments:     909,603
    line codepoints per second:   16,840,718    segments:   9,548,987

Benchmarking uniseg               unicode version 15.0.0
grapheme codepoints per second:      567,514    segments:  49,149,753
    word codepoints per second:      570,943    segments:  23,617,670
sentence        EXCEPTION KeyError('OTHER')
    line codepoints per second:      528,677    segments:   9,545,722

Benchmarking grapheme             unicode version 13.0.0
grapheme codepoints per second:    1,777,689    segments:  49,163,151

Benchmarking pyicu                unicode version 15.1
grapheme codepoints per second:    3,778,235    segments:  49,155,642
    word codepoints per second:    8,375,000    segments:  20,050,729
sentence codepoints per second:   98,976,577    segments:     910,041
    line codepoints per second:   15,864,217    segments:   9,531,173

Here are the size measurements adding together Python bytecode and any stripped shared libraries.

 32kb    grapheme
144kb    uniseg
600kb    apsw.unicode (can be 200kb smaller)
 40mb    pyicu

Table unrolling

The tables end up with each row containing three integer fields:

Start codepoint
End codepoint
Category

Everybody stores them sorted, and then uses a binary search to find the row matching a codepoint. There are 2 to 4 thousand rows in each table meaning up to 12 comparisons are needed to find a particular row, which is negligible to a CPU.

The contents of the tables are known at compile time, and do not change. That led me to thinking that instead of generating tables, I could generate the comparisons instead. (This is similar to loop unrolling).

The resulting compiled code turned out to be smaller than the corresponding data tables, so I went with that. (It is even branch predictor friendly which tables are not.) The densest compilation results in code about 60% the size of tables, while 16 byte aligned is around 95% (but with lots of padding nops).

msvc defaults to densest and shaves 200kb off the binary size, while clang and gcc both do alignment. Examining the assembly code showed that it matched the comparisons.

Unicode 16 update

Unicode 16 came out in September, and I eagerly updated to it. The updates to codepoint tables were all easily handled. There were no rule changes to TR29 segmentation, and all tests passed.

TR14 line breaking did have rule changes which were fairly painful to implement in my C code. The compile test debug cycle is far slower and less ergonomic than almost anything else. The code is analogous to implementing regular expressions by hand while doing lookahead and backtracking, and you often need to figure out what particular rules did match but shouldn't have, or vice versa.

That leads me to one big lesson learned/suggestion for anyone else implementing the TR14 & TR29 rules:

Do not write code directly - always use an intermediate

representation that you generate code from, or interpret directly at

runtime.

It will make the development process a lot more pleasant, but also

allows tailoring the rules for different locales.

I should have realised this originally, but my Python implementation was productive. The Unicode consortium developers figured that out, and this is used for the C/Java library, while this is used for the Rust library. I am amused that they use a completely different syntax.

Category: misc – Tags: unicode, python, apsw

Python 3 C extension experience update

Roger Binns — Sat 02 December 2023

APSW is my Python wrapper for SQLite's C interface gluing it to Pythons C interface.

Contents

Looking back
Opportunity
Most important code
Most important code, part 2
MVP tool: cvise
CPython API
- Fastcall/Vectorcall
Error handling
Python type annotations

Looking back

It has been two years since I ended Python 2 and early Python 3 support. While I was proud that you could continue to use any Python version of the previous two decades with corresponding SQLite versions and have your code continue to work without maintenance, it isn't how modern software development works.

Components interact with other components, which in turn do the same. There are many additional tools such as build systems, test systems, documentation generation, and cloud based full scale system integration verification. It is practically impossible to make all versions of all the things work with all the other versions of the other things, even if my little corner did.

I ended up with a simple end of life policy - I do one more release after a particular Python version goes end of life. That means APSW will not be the weak link.

Opportunity

Only supporting modern Python versions let me delete a bunch of C and Python code, and some documentation. Non developers may not realise it, but one of the great joys of being a developer is when you get to delete stuff. That frees up brain space for better things. The last things removed were some documentation saying that strings are Unicode (a legacy of Python 2), and removing the u prefix from some strings (like u"hello world") for the same reason. It felt good.

I could also take advantage of dataclasses in Python code and FASTCALL in C code. This week I greatly appreciated the walrus operator in some new code. I even had problems in some example code that needed Python to consume a noticeable amount of CPU time, but performance enhancements made that more difficult!

Most important code

If you change code for the better, there is always a probability you will have broken something else without realising it. This is why developers are wary of changing code, instead adding more, or copying and pasting.

That makes the tests the most important code by far. APSW's test code is a similar size to the C code, and tries to exercise all possible routes through the C. In addition to checking everything works, far more effort is expended on doing everything wrong, and ensuring that all combinations of problems that could happen do happen.

That thoroughness of the test code is what made it possible to reduce and improve the code base. It would have been essentially impossible otherwise.

When a new API is added to SQLite, my rough estimates on effort are:

5% - Calling the API
15% - Adding error handling, especially all the things that could go wrong. More on this below.
5% - Updating documentation
75% - Updating the test suite, exercising all the paths, running under all the validation tools and analyzers

Most important code, part 2

It may not seem like much, but most of the start of C functions looks like this:

/** .. method:: __init__(filename: str,
                         flags: int = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE,
                         vfs: Optional[str] = None,
                         statementcachesize: int = 100)

Opens the named database ...

Of course that isn't code - it is a comment before the actual C code. It serves many duties.

Documentation: I use Sphinx for documentation, and that text ends up in the documentation in restructured text format.
Docstring: All Python objects have documentation, easily done in Python code. For C implemented objects the text has to be available to the C compiler, so the text is extracted and available in a header file processed to keep within C syntax.
Text signature: C objects can have a __text_signature__ used by inspect which is in yet another format with dollar signs and no return types. This StackOverflow answer has some more details.
Argument parsing: The str and int and default values are all relevant when you need C values to call into SQLite (const char *, and int or long long respectively). I used to use PyArg_ParseTupleAndKeywords which takes a format string with many Python extras, and escapes for your own conversions. Because the format string could disagree with the documented string, I have a tool that converts the comment into the correct defaults and formats. When I adopted FASTCALL it was then a simple matter of generating code to do the parsing directly. It also meant I can produce far better clearer error messages. (There is a Python internal use tool that is similar.)
Type stubs: For Python code implemented in C, you can provide type stubs which are Python syntax for those C implemented items. I discovered that Visual Studio Code would display any docsrings included with the typing information, so the APSW type stubs include that too. When editing Python code, you can't tell that APSW is implemented in C.

That is a lot of heavy lifting for some "code".

MVP tool: cvise

cvise is a tool that takes a C file known to cause a problem, and reduces the size while verifying it still causes the problem.

While reworking the code base, I could easily detect problems. There are many places where C code calls Python which calls C code which calls Python, with the flow going through CPython's internals, APSW's code, and SQLite's code. All 3 projects have copious amounts of assertion checking so it is easy to detect something has happened that shouldn't.

I'd be able to cause problems using my 10,000 line test suite, but to narrow down and understand it you want a lot less, ideally less than 10 lines. Trying to do so manually is tedious and time consuming, and often the actual nature of the problem is different than you think it is.

cvise takes a --not-c flag, so I could feed it my test suite, which would rapidly hack it down to just enough to still reproduce the problem. It was always surprising and delightful, because doing that manually is very tedious.

CPython API

There has been a lot of work behind the scenes to clean up the APIs. I ended up with backport code to make newer apis available on older supported Pythons.

My favourite has been Py_NewRef which let many places go from two lines of code to one, and is also easy to search for.

Fastcall/Vectorcall

This was the most impactful. Vectorcall is the faster way of making calls, and fastcall is the faster way of receiving calls. Traditionally calling from C code required building a tuple for positional arguments and a dictionary for keyword arguments.

// Old style - convenient but slow making a Python tuple
PyObject_Call(object, "lssL", updatetype, databasename, tablename, rowid);

// New style - directly put arguments in a C array
PyObject *vargs[] = {NULL,
                     PyLong_FromLong(updatetype),
                     PyUnicode_FromString(databasename),
                     PyUnicode_FromString(tablename),
                     PyLong_FromLongLong(rowid)};
if (vargs[1] && vargs[2] && vargs[3] && vargs[4])
  PyObject_Vectorcall(object, vargs + 1, 4 | PY_VECTORCALL_ARGUMENTS_OFFSET, NULL);

The PY_VECTORCALL_ARGUMENTS_OFFSET magic allows that first NULL element to be used by Python's internal machinery instead of having to allocate a new array. Python will automatically create a new tuple if the receiving C code does things the old way.

I was curious what the benefits are on the receiving end and made a project that did benchmarking, as well as answering some related questions. It took 22 time units to receive the new style, and 158 to do it the old way - 7 times slower! With each Python release the former has been getting quicker and the latter slower.

Error handling

Writing code in Python is delightful. You don't have to worry about errors, and in the rare circumstances they happen, your program is stopped with an exception. You can add code at whatever level of the call hierarchy is most relevant if you want code to handle future occurrences.

In C code it is a totally different matter. Different APIs return error information in different ways. Errors have to be handled immediately. As an example, here is the code to get SQLite's compilation options in a Python list of strings - a total of 4 lines of code.

PyObject *list = PyList_New(0);
for(int i=0; sqlite3_compileoption_get(i); i++)
  PyList_Append(list, PyUnicode_FromString(sqlite3_compileoption_get(i)));
return list;

Now lets add error checking, and it has grown from 4 to 14 lines.

PyObject *list = PyList_New(0);
if(!list) // error indicated by NULL return
    return NULL;
for(int i=0; sqlite3_compileoption_get(i); i++)
{
    PyObject *option = PyUnicode_FromString(sqlite3_compileoption_get(i));
    if(!option) // error indicated by NULL return
    {
        Py_DECREF(list);
        return NULL;
    }
    int append = PyList_Append(list, option);
    Py_DECREF(option);  // list took a reference or failed
    if(append == -1) // error indicated by -1 return, or non-zero?
    {
        Py_DECREF(list);
        return NULL;
    }
}
return list;

Lots of repeated cleanups in error handling, so we resort to goto. It is the same number of lines of code, but a more useful pattern for more complex code when there are far more items needing cleanup.

PyObject *option = NULL, *list = PyList_New(0);
if(!list)
    goto error;
for(int i=0; sqlite3_compileoption_get(i); i++)
{
    option = PyUnicode_FromString(sqlite3_compileoption_get(i));
    if(!option)
        goto error;
    int append = PyList_Append(list, option);
    if(append == -1)
        goto error;
    Py_CLEAR(option);
}
return list;

error:
Py_XDECREF(list);
Py_XDECREF(option);
return NULL;

Note how doing error handling in C triples the code size, and this is calling one simple API. I estimate that around 75% of the C code in APSW is error handling. Here are the top 10 goto label names illustrating the point:

157 finally
75 error
57 pyexception
39 fail
18 param_error
11 end
8 errorexit
5 error_return
3 success
3 out_of_range

It gets worse

Looking back at the code above, the reason for failures would be running out of memory. Do you know how often those specific lines of code will be the ones that run out of memory?

NEVER

That's right - as a developer I had to write 3 times as much code as necessary, to handle conditions that will never happen in the real world.

The computers running this code also had to do all that extra checking for conditions that never happen. Your computer is doing that for all the software it is running - what a waste.

... and worse

Not only did I write all that extra code, I can't even make it fail, nor could you. A lot of code is shipped without it ever having been run by the developers, nor does it get run in production. If it ever did run there is no certainty it would do the right thing.

And this was a trivial example.

My reaction

There are however adversaries who are very good at at making things happen. Virtually every security issue you hear about it is because they have figured out the weakest bits of software, and how to tickle things just right so that kind of code does get executed.

The consequences of error handling not being done correctly are one or more of:

Nothing
Something that was expected to happen didn't
Resources are leaked
Corruption of state
Memory corruption
Corruption being exported into storage or the network
Delays
Infinite loops
Invariants no longer holding
Exposing private information

I don't want my code to be responsible for any of that.

How I test

I originally did it manually. The code looked something like this where FAIL is a macro taking the name of the location, the happy path, and the failure path. Testing would then set each name to fail, and execute the code.

PyObject *list;
FAIL("CompileList", list = PyList_New(0), list = PyErr_NoMemory());

That was very tedious, makes the code unreadable, and the various editors and other tools couldn't understand it, format it etc.

Statement expressions to the rescue with generated code, given a list of function names with this for PyList_New:

 1 #define PyList_New(...) \
 2 ({                                                                                                                                 \
 3     __auto_type _res_PyList_New = 0 ? PyList_New(__VA_ARGS__) : 0;                                                                 \
 4                                                                                                                                   \
 5     _res_PyList_New = (typeof (_res_PyList_New))APSW_FaultInjectControl("PyList_New", __FILE__, __func__, __LINE__, #__VA_ARGS__); \
 6                                                                                                                                   \
 7     if ((typeof (_res_PyList_New))0x1FACADE == _res_PyList_New)                                                                    \
 8       _res_PyList_New = PyList_New(__VA_ARGS__);                                                                                  \
 9     else if ((typeof(_res_PyList_New))0x2FACADE == _res_PyList_New)                                                                \
10     {                                                                                                                              \
11         PyList_New(__VA_ARGS__);                                                                                                   \
12         _res_PyList_New = (typeof (_res_PyList_New))18;                                                                            \
13     }                                                                                                                              \
14     _res_PyList_New;                                                                                                               \
15 })

Line 3 sets up a variable to store the return value without knowing what the return type is
Line 5 calls APSW_FaultInjectControl giving the function name, filename, calling function name, line number, and stringized arguments. The combination of all those uniquely identifies a location even when there are multiple calls on the same line.
Line 7 looks for the 0x1FACADE return value to mean go ahead and call the function normally as seen on line 8.
Line 9 looks for the 0x2FACADE return value to mean go ahead and call the function, but pretend it returned 18. This is necessary for closing functions because I do want them to close. 18 is a valid SQLite error code.
Line 14 provides the final value which came from the call on line 5 unless that returned 0x1FACADE/ 0x2FACADE.

There is a little more to it, but this lets me cause all the various calls to fail in various ways and have all that error checking code I wrote actually run.

Yes it found bugs that static analysis can't because of all the calling between CPython, SQLite, and APSW. Python has an error indicator that does cause some internal routines to behave differently when it is set. For example they may short circuit and return immediately doing nothing, or they may clear the indicator hiding that an error happened. The main interpreter loop gets upset when the indicator is set and C code returns values as though there were no problems.

I feel better knowing all my code runs, handles errors correctly, and that the errors never get hidden. (And yes I am proud of my magic hex constants.)

Python type annotations

Python always let you rapidly develop code without having to be excruciating precise in details. Duck typing is wonderful. But there has been an increasing tension because there are more and more components to interact with, and there is greater version churn (see the start of this post!).

In the olden days you referred to a component's documentation and memorized what you used most frequently. That isn't practical any more. The question is "when do you want to know about problems in the code?" The answer is as soon as possible, as it gets more expensive (time, effort, and often money) the later you find out, with the worst case being once customers depend on it. Ideally you want to know as your finger rises from typing something.

Type annotations have let that happen, especially for simpler problems. There are annotations for all of APSW's C and Python code (except the shell which is on the todo list). I've found it quite difficult to express duck typing, and some concepts are impossible like the number of arguments to a callable depends on a parameter when some other function was called. But I appreciate the effort made by all the tools, and it does save me effort as I type.

You do however still get amusing error messages for correct code due to limitations of annotations and the tools. It is reminiscent of C++ template errors. I leave you with one example you should not read, deliberately left as one long line and a scrollbar.

example-code.py:95: error: Argument 1 to "set_exec_trace" of "Cursor" has incompatible type "Callable[[Cursor, str, Union[Sequence[Union[None, int, float, bytes, str]], Dict[str, Union[None, int, float, bytes, str]]]], bool]"; expected "Optional[Callable[[Cursor, str, Union[Dict[str, Union[None, int, float, bytes, str]], Tuple[Union[None, int, float, bytes, str], ...], None]], bool]]"

Category: misc – Tags: apsw, python

Exit review: XFCE

Roger Binns — Sat 28 October 2023

I've been using XFCE for about a decade, after switching from Gnome. As a developer I have lots of windows scattered across multiple screens, and find a taskbar the most convenient way to manage them. Gnome 3 went a different way which didn't work for me.

XFCE worked very well, getting out of the way and letting me be productive. The apps like file managers, window managers, image viewers, etc are all fine, and you can run any others anyway.

But I've been wanting to experience Wayland for a while, and XFCE aren't there yet. It is a lot of work for the volunteers.

So I've switched to KDE for the next decade. Everything is Wayland. The default apps are busier.

Thank you XFCE for the last decade.

Category: misc – Tags: exit review

Exit review: Running my own email server

Roger Binns — Wed 28 June 2023

It has been over a year since I stopped running my own email server for me and a few friends. I had been doing so for over two decades!

Why run your own?

Email was and remains a ubiquitous communications mechanism, both for people and automation. When I started running my own server there were very few providers, and they had very low limits. There would be restrictions on attachment sizes and formats, and developer emails would often be rejected as spam. There was little in the way of configurability of incoming email.

Running my own email server let me remove all the restrictions. I accepted emails up to a gigabyte in size because sometimes that was necessary in the days before Dropbox. I was able to have whatever processing rules I wanted, and had full insight into all of the details that were going on.

What do you need?

You need to have an entire system with several configurations and components all working together.

Static firewall: You can filter out countries, internet service providers, cloud providers etc that aren't worth even accepting connections from
Dynamic firewall: A second layer of filtering based on observed undesirable behaviour. For example IP addresses sending you spam can be filtered for a time period.
Spam control (generic): Various techniques are used to stop spam no matter which user it is going to. For example greylisting is very effective, dcc and rbl tell you if other systems have seen the same message. I also filtered all messages through a virus scanner.
Spam control (user specific): Spamassassin has many rules with weighting added together to come up with a per message score. It includes how similar the email is to previous emails you received, which you have classified as good or spam, plus many other rules.
Filtering rules: You want to make messages matching various criteria be placed in folders, forwarded, rejected etc using per user scriptable rules.
IMAP server: To actually read the email using email clients or the programs builtin to various desktop and mobile devices you need one of these.
Webmail: And sometimes you want to use a browser, so you need something that presents a web front end.
Mail transfer: This component receives incoming email, and sends outgoing.
Other: You need to ensure there are backups, have authentication, logging, monitoring, DNS records, SSL/TLS certificates etc. Some of the components can use or even require database servers.

Over time there have been open source software projects that address these needs, including more integrated ones that address many at once. There are a nice variety each in different sweet spots.

Exit review

It is a positive experience having to construct a working system. You are exposed to several components that have to work together, read lots of documentation, create configuration, and deal with upgrades and improvements. Seeing how others have addressed that makes you better at them too. I had a working system all those decades that served us very well.

The reality though is you are really running a spam detection and rejection system. There was a new attempt every 3 seconds never ending. Each one results in your system logging what happened and why, and you are acutely aware that overall you are putting more effort into controlling each spam message than the senders put into the message.

I've since switched to Fastmail (obligatory affiliate link). They have done all the work listed above, but I can still see what is happening as though it was my system. For example they too use Spamassassin. What is noticeable is just how many and how large the headers are on each message, almost all generated in the service of detecting spam. It is nice that it is someone else's duty to maintain now.

Category: misc – Tags: exit review

APSW 3.37 is the last with Python 2 / early Python 3 support

Roger Binns — Wed 08 December 2021

This release of APSW (my Python wrapper for SQLite's C interface) is the last that will support Python 2, and earlier versions of Python 3 (before 3.7).

If you currently use APSW with Python 2/early Python 3, then you will want to pin the APSW version to 3.37. You will still be able to use this version of APSW with future versions of SQLite (supported till 2050). But new C level APIs won't be covered. The last C level API additions were 3.36 in June 2021 adding serialization and 3.37 in December 2021 adding autovacuum control

What does APSW support now ...

APSW supports every Python version 2.3 onwards (released 2003). It doesn't support earlier versions as there was no GIL API (needed for multi-threading support).

The downloads for the prebuilt Windows binary gives an idea of just how many Python versions that is (15). (Python 3.0 does actually work, but is missing a module used by the test suite.)

Each release does involve building and testing all the combinations of 15 Python versions, 32 and 64 bit environment, and both UCS2 and UCS4 Unicode size for Python < 3.3, on multiple operating systems.

There are ~13k lines of C code making up APSW, with ~7k lines of Python code making up the test suite. It is that test suite that gives the confidence that all is working as intended.

... and why?

I wanted to make sure that APSW is the kind of module I would want to use. The most frustrating thing as a developer is that you want to change one thing (eg one library) and then find that forces you to change the versions of other components, or worse the runtime and dev tools (eg compiler).

I never made the guarantee, but it turned out to be:

You can change the APSW (and SQLite) versions, and nothing else. No other changes will be required and everything will continue to work, probably better.

This would apply to any project otherwise untouched since 2004!

There are two simple reasons:

Because I could - I do software development for a living, and not breaking things is a good idea (usually)

I would have to delete code that works

What happens next?

I am going to delete code that works, but it is mainly in blocks saying doing one thing for Python 2, another for early Python 3, and another for current Python 3.

My plan is to incrementally remove Python 2/early 3 code from the Python test suite and the C code base together, while updating documentation (only Python 3 types need to be mentioned). The test suite and coverage testing will hopefully catch any problems early.

I will be happy that the code base, testing, documentation, and tooling will all become smaller. That makes things less complex.

Other thoughts

The hardest part of porting APSW from Python 2 to 3 was the test suite which had to remain valid to both environments. For example it is easy to create invalid Unicode strings in Python 2 which I had to make sure the test suite checked.

It was about 10 times the amount of work making the Python test suite code changes, vs the C level API work. Python 3 wasn't that much different in terms of the C API (just some renaming and unification of int and long etc).

Category: misc – Tags: apsw, python

Contact me