Index ¦ Archives ¦ RSS

A small club: Implementing Unicode TR29 and TR14

I recently implemented Unicode Technical Report #29 which covers how to turn "text" into "letters", words, and sentences. I also ended up implementing Technical Report #14 which covers line breaking - good places to start text on the next line when the current line is getting too long.

My guess is that less than 10 developers have fully implemented both of these, and I have now joined that very small club.

How hard could it be?

Unicode aims to have one way of representing human text for computers. Before its widespread adoption, different computer systems and even programs had completely different and very incompatible ways of representing text, and it was a mess to read, write, print, copy, and backup text.

This is an extraordinarily hard problem to solve. Are upper and lower case A` the same letter? What about bold, and italic? What about rotated? Fonts? Is ß a different letter or just a different way of drawing two lower case s? Ligatures? Is a lower case e in French without an accent the same as e in English? Is Roman numeral different than i? That barely scratches the surface of Western Europe, and now throw in the whole world with a rich diversity and history of writing. Thankfully experts have been working on this for several decades.

Codepoints

Unicode assigned a number (codepoint) per "letter". This doesn't work because some writing systems use a base letter (eg a consonant) and then add marks on the letter (eg vowels), and having one codepoint for each combination would be a very large number.

This led to combining codepoints that could modify a base, allowing adding accents, vowels, and other marks to represent human writing,

The fateful decision

There were two possible ways of representing these values - for example letter-e and combing-acute-accent

Modifier first

combing-acute-accent then letter-e

Base first

letter-e then combing-acute-accent

If modifier first had been chosen, then each letter as perceived by a human (ie with all the modifiers applied) would be easy to determine by computer code - accumulate modifiers until you find a base. You now have one user perceived letter (aka grapheme cluster).

Base first was chosen. That means code sees a codepoint, but has to keep looking ahead to see if there are more modifiers that apply to the base. It gets way more complicated with zero width joiners, variation selectors, regional indicators, and Indic syllabic categories to name a few.

The consequences

Because of the complexity, most programming languages just pretend that each codepoint is one "letter" and takes one unit of storage, but often worse (well worth a read) conflating byte encoding, grapheme clusters, and indexing. They then require third party libraries to get words, sentences, and line breaking.

I've been adding support for SQLite's full text search to APSW (my Python SQLite wrapper) and it became very clear very quickly that doing things properly was necessary.

SQLite has a unicode61 tokenizer but it works on individual codepoints and their categories on a version of Unicode from 2012 (to remain stable, and Lite). It doesn't implement any rules for words other than whitespace and punctuation which makes it unsuitable for the languages used by billions of people. It also doesn't handle regional indicators, variation selectors, zero width joiners, some emoji etc.

There is a standard International Components for Unicode library available that deals with all the Unicode rules, text segmentation, locale specific customisation, message generation, and localization. It is available as a 5MB Python binding, to the 5MB of code and 30MB of tables making up ICU. ICU isn't installed on every platform, and when it is you have no control over the version.

The ICU solution is also impractical to distribute for cross platform libraries like APSW, so I decided to implement the rules. My final self contained library was 500kb of code and data with no platform dependencies.

Lessons learned

There are three sets of rules covering grapheme cluster, words, and sentences, and another set for line breaking.

I ended up rewriting/refactoring my code six times, learning a lot as a result.

No errors

It isn't immediately apparent, but there is no sequence of codepoints that can be considered an error. No matter how absurd a sequence is supplied, the code has to handle it, even when multiple rules apply and contradict each other.

Test data

Unicode publish test data of around 2,000 tests each for grapheme clusters, word, and sentence breaks. This isn't perfect as I had cases of bugs being present but still passing all the tests. Without those tests, there is zero chance I could have written correct code. I was often puzzling over test failures and the list of rules to work out exactly what sequence were intended to apply. I wish they would add even more test rules with longer sequences and even more combinations that hit multiple rules.

Break points

Break points turned out to be an excellent way of determining boundary locations. You do need to do an analysis step on segments - eg does it contain letters or numerals if you are looking for words.

State machines

It looks like you can do a state machine and that is what you see in the Rust implementation as well as the Python grapheme package (abandoned?). It is possible to do so for the grapheme clusters, but not word and sentence. Adding some adaptive code could perhaps work. In the end I found that you end up with far too many states (word segmentation has 25 different rules), and that rules can contradict. For example you should break after LF but not before ZWJ so what happens if they are in sequence? Adaptive code ends up having to do look ahead, and look behind (either reversing rules or adding more state).

The biggest problem with a state machine is it becomes very hard to map the code and states back to the series of rules. And as the rules get adjusted over time, it will become harder to merge the changes in.

What worked for me

  • I did my initial exploration in Python. This was far quicker to iterate, debug, update data structures etc. At the end I then converted to C for significantly improved performance, but starting in C would have taken considerably longer.

  • The code matches the rules in the same order and expressed the same way as the specification. This makes it easier to verify they match, and allows for easier changes as rules change over time. (Complex state machines make that a lot harder.)

  • Use look ahead for matching (I did briefly try look behind). You can mark your current spot, and then advance trying to match a rule on succeeding codepoints. If a rule match fails, revert back to the mark, and try the next rule. This is in theory less efficient, but in practise little real world text is like this.

  • I had to implement more Unicode tables. Some rules want information outside of that specified in the TR14/29 tables. For example the rules want to know if a codepoint is extended pictographic, or in a particular category. While some of the information may be available separately in other platform libraries, it may not be the same Unicode version.

  • The tables and rules look like they map onto enumerations. However they work far better as a bitset. Most tables fit within 32 bits, although some require 64.

    You can combine all the necessary data into the single bitset. For example the grapheme cluster rules also want to know the Indic syllabic categories which is a separate table. I could combine that information with the grapheme cluster tables in the same bitset, avoiding a separate lookup.

    Bitsets also allows code like the following:

    if char & (L | V | LV | LVT): ...
    

    and:

    MidNumLetQ  = (MidNumLet | Single_Quote)
    
    if char & MidNumLetQ: ...
    

    ... instead of:

    if char == L || char == V || char == LV || ...
    
  • The various tables end up as generated code. (Every project I looked at used a Python based tool to do that.) I put a block before each table in a comment giving statistics on how popular each value was, how many lines there were in the table, and other similar information. This made it a lot easier to get an overview - the tables have thousands of rows - and also helped verify that code changes had the expected effect.

  • It was easy to add other tables. Because I also have terminal output I could add width information, and even a table for which Unicode version a codepoint was added.

Performance

Performance matters because these are used for full text indexing, which could be hundreds of megabytes if not gigabytes of text. There is no point in doing a C Python extension unless it performs well. So here is a benchmark. The source text is the UN Declaration of Human Rights which has been translated into 300 languages. All 300 are concatenated together producing a 5.5MB file. 60 codepoints used in many of the TR14/29 tests are combined and repeatedly shuffled with the source text, and appended until there are 50 million codepoints.

For each library, how long it takes to process that text and return each segment is measured, producing a codepoints per second result. Note that this is measuring both how long it takes to find each break location, as well as how long Python takes to return that substring/segment of text.

ie more codepoints per second is better.

Benchmarking apsw.unicode         unicode version 15.1
grapheme codepoints per second:    5,122,888    segments:  49,155,642
    word codepoints per second:   13,953,370    segments:   9,834,540
sentence codepoints per second:   93,699,093    segments:     909,603
    line codepoints per second:   16,840,718    segments:   9,548,987

Benchmarking uniseg               unicode version 15.0.0
grapheme codepoints per second:      567,514    segments:  49,149,753
    word codepoints per second:      570,943    segments:  23,617,670
sentence        EXCEPTION KeyError('OTHER')
    line codepoints per second:      528,677    segments:   9,545,722

Benchmarking grapheme             unicode version 13.0.0
grapheme codepoints per second:    1,777,689    segments:  49,163,151

Benchmarking pyicu                unicode version 15.1
grapheme codepoints per second:    3,778,235    segments:  49,155,642
    word codepoints per second:    8,375,000    segments:  20,050,729
sentence codepoints per second:   98,976,577    segments:     910,041
    line codepoints per second:   15,864,217    segments:   9,531,173

Here are the size measurements adding together Python bytecode and any stripped shared libraries.

 32kb    grapheme
144kb    uniseg
600kb    apsw.unicode (can be 200kb smaller)
 40mb    pyicu

Table unrolling

The tables end up with each row containing three integer fields:

  • Start codepoint
  • End codepoint
  • Category

Everybody stores them sorted, and then uses a binary search to find the row matching a codepoint. There are 2 to 4 thousand rows in each table meaning up to 12 comparisons are needed to find a particular row, which is negligible to a CPU.

The contents of the tables are known at compile time, and do not change. That led me to thinking that instead of generating tables, I could generate the comparisons instead. (This is similar to loop unrolling).

The resulting compiled code turned out to be smaller than the corresponding data tables, so I went with that. (It is even branch predictor friendly which tables are not.) The densest compilation results in code about 60% the size of tables, while 16 byte aligned is around 95% (but with lots of padding nops <https://en.wikipedia.org/wiki/NOP_(code)>).

msvc defaults to densest and shaves 200kb off the binary size, while clang and gcc both do alignment. Examining the assembly code showed that it matched the comparisons.

Category: misc – Tags: unicode, python, apsw


Python 3 C extension experience update

APSW is my Python wrapper for SQLite's C interface gluing it to Pythons C interface.

Looking back

It has been two years since I ended Python 2 and early Python 3 support. While I was proud that you could continue to use any Python version of the previous two decades with corresponding SQLite versions and have your code continue to work without maintenance, it isn't how modern software development works.

Components interact with other components, which in turn do the same. There are many additional tools such as build systems, test systems, documentation generation, and cloud based full scale system integration verification. It is practically impossible to make all versions of all the things work with all the other versions of the other things, even if my little corner did.

I ended up with a simple end of life policy - I do one more release after a particular Python version goes end of life. That means APSW will not be the weak link.

Opportunity

Only supporting modern Python versions let me delete a bunch of C and Python code, and some documentation. Non developers may not realise it, but one of the great joys of being a developer is when you get to delete stuff. That frees up brain space for better things. The last things removed were some documentation saying that strings are Unicode (a legacy of Python 2), and removing the u prefix from some strings (like u"hello world") for the same reason. It felt good.

I could also take advantage of dataclasses in Python code and FASTCALL in C code. This week I greatly appreciated the walrus operator in some new code. I even had problems in some example code that needed Python to consume a noticeable amount of CPU time, but performance enhancements made that more difficult!

Most important code

If you change code for the better, there is always a probability you will have broken something else without realising it. This is why developers are wary of changing code, instead adding more, or copying and pasting.

That makes the tests the most important code by far. APSW's test code is a similar size to the C code, and tries to exercise all possible routes through the C. In addition to checking everything works, far more effort is expended on doing everything wrong, and ensuring that all combinations of problems that could happen do happen.

That thoroughness of the test code is what made it possible to reduce and improve the code base. It would have been essentially impossible otherwise.

When a new API is added to SQLite, my rough estimates on effort are:

  • 5% - Calling the API
  • 15% - Adding error handling, especially all the things that could go wrong. More on this below.
  • 5% - Updating documentation
  • 75% - Updating the test suite, exercising all the paths, running under all the validation tools and analyzers

Most important code, part 2

It may not seem like much, but most of the start of C functions looks like this:

/** .. method:: __init__(filename: str,
                         flags: int = SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE,
                         vfs: Optional[str] = None,
                         statementcachesize: int = 100)

Opens the named database ...

Of course that isn't code - it is a comment before the actual C code. It serves many duties.

Documentation
I use Sphinx for documentation, and that text ends up in the documentation in restructured text format.
Docstring
All Python objects have documentation, easily done in Python code. For C implemented objects the text has to be available to the C compiler, so the text is extracted and available in a header file processed to keep within C syntax.
Text signature
C objects can have a __text_signature__ used by inspect which is in yet another format with dollar signs and no return types. This StackOverflow answer has some more details.
Argument parsing
The str and int and default values are all relevant when you need C values to call into SQLite (const char *, and int or long long respectively). I used to use PyArg_ParseTupleAndKeywords which takes a format string with many Python extras, and escapes for your own conversions. Because the format string could disagree with the documented string, I have a tool that converts the comment into the correct defaults and formats. When I adopted FASTCALL it was then a simple matter of generating code to do the parsing directly. It also meant I can produce far better clearer error messages. (There is a Python internal use tool that is similar.)
Type stubs
For Python code implemented in C, you can provide type stubs which are Python syntax for those C implemented items. I discovered that Visual Studio Code would display any docsrings included with the typing information, so the APSW type stubs include that too. When editing Python code, you can't tell that APSW is implemented in C.

That is a lot of heavy lifting for some "code".

MVP tool: cvise

cvise is a tool that takes a C file known to cause a problem, and reduces the size while verifying it still causes the problem.

While reworking the code base, I could easily detect problems. There are many places where C code calls Python which calls C code which calls Python, with the flow going through CPython's internals, APSW's code, and SQLite's code. All 3 projects have copious amounts of assertion checking so it is easy to detect something has happened that shouldn't.

I'd be able to cause problems using my 10,000 line test suite, but to narrow down and understand it you want a lot less, ideally less than 10 lines. Trying to do so manually is tedious and time consuming, and often the actual nature of the problem is different than you think it is.

cvise takes a --not-c flag, so I could feed it my test suite, which would rapidly hack it down to just enough to still reproduce the problem. It was always surprising and delightful, because doing that manually is very tedious.

CPython API

There has been a lot of work behind the scenes to clean up the APIs. I ended up with backport code to make newer apis available on older supported Pythons.

My favourite has been Py_NewRef which let many places go from two lines of code to one, and is also easy to search for.

Fastcall/Vectorcall

This was the most impactful. Vectorcall is the faster way of making calls, and fastcall is the faster way of receiving calls. Traditionally calling from C code required building a tuple for positional arguments and a dictionary for keyword arguments.

// Old style - convenient but slow making a Python tuple
PyObject_Call(object, "lssL", updatetype, databasename, tablename, rowid);

// New style - directly put arguments in a C array
PyObject *vargs[] = {NULL,
                     PyLong_FromLong(updatetype),
                     PyUnicode_FromString(databasename),
                     PyUnicode_FromString(tablename),
                     PyLong_FromLongLong(rowid)};
if (vargs[1] && vargs[2] && vargs[3] && vargs[4])
  PyObject_Vectorcall(object, vargs + 1, 4 | PY_VECTORCALL_ARGUMENTS_OFFSET, NULL);

The PY_VECTORCALL_ARGUMENTS_OFFSET magic allows that first NULL element to be used by Python's internal machinery instead of having to allocate a new array. Python will automatically create a new tuple if the receiving C code does things the old way.

I was curious what the benefits are on the receiving end and made a project that did benchmarking, as well as answering some related questions. It took 22 time units to receive the new style, and 158 to do it the old way - 7 times slower! With each Python release the former has been getting quicker and the latter slower.

Error handling

Writing code in Python is delightful. You don't have to worry about errors, and in the rare circumstances they happen, your program is stopped with an exception. You can add code at whatever level of the call hierarchy is most relevant if you want code to handle future occurrences.

In C code it is a totally different matter. Different APIs return error information in different ways. Errors have to be handled immediately. As an example, here is the code to get SQLite's compilation options in a Python list of strings - a total of 4 lines of code.

PyObject *list = PyList_New(0);
for(int i=0; sqlite3_compileoption_get(i); i++)
  PyList_Append(list, PyUnicode_FromString(sqlite3_compileoption_get(i)));
return list;

Now lets add error checking, and it has grown from 4 to 14 lines.

PyObject *list = PyList_New(0);
if(!list) // error indicated by NULL return
    return NULL;
for(int i=0; sqlite3_compileoption_get(i); i++)
{
    PyObject *option = PyUnicode_FromString(sqlite3_compileoption_get(i));
    if(!option) // error indicated by NULL return
    {
        Py_DECREF(list);
        return NULL;
    }
    int append = PyList_Append(list, option);
    Py_DECREF(option);  // list took a reference or failed
    if(append == -1) // error indicated by -1 return, or non-zero?
    {
        Py_DECREF(list);
        return NULL;
    }
}
return list;

Lots of repeated cleanups in error handling, so we resort to goto. It is the same number of lines of code, but a more useful pattern for more complex code when there are far more items needing cleanup.

PyObject *option = NULL, *list = PyList_New(0);
if(!list)
    goto error;
for(int i=0; sqlite3_compileoption_get(i); i++)
{
    option = PyUnicode_FromString(sqlite3_compileoption_get(i));
    if(!option)
        goto error;
    int append = PyList_Append(list, option);
    if(append == -1)
        goto error;
    Py_CLEAR(option);
}
return list;

error:
Py_XDECREF(list);
Py_XDECREF(option);
return NULL;

Note how doing error handling in C triples the code size, and this is calling one simple API. I estimate that around 75% of the C code in APSW is error handling. Here are the top 10 goto label names illustrating the point:

  • 157 finally
  • 75 error
  • 57 pyexception
  • 39 fail
  • 18 param_error
  • 11 end
  • 8 errorexit
  • 5 error_return
  • 3 success
  • 3 out_of_range

It gets worse

Looking back at the code above, the reason for failures would be running out of memory. Do you know how often those specific lines of code will be the ones that run out of memory?

NEVER

That's right - as a developer I had to write 3 times as much code as necessary, to handle conditions that will never happen in the real world.

The computers running this code also had to do all that extra checking for conditions that never happen. Your computer is doing that for all the software it is running - what a waste.

... and worse

Not only did I write all that extra code, I can't even make it fail, nor could you. A lot of code is shipped without it ever having been run by the developers, nor does it get run in production. If it ever did run there is no certainty it would do the right thing.

And this was a trivial example.

My reaction

There are however adversaries who are very good at at making things happen. Virtually every security issue you hear about it is because they have figured out the weakest bits of software, and how to tickle things just right so that kind of code does get executed.

The consequences of error handling not being done correctly are one or more of:

  • Nothing
  • Something that was expected to happen didn't
  • Resources are leaked
  • Corruption of state
  • Memory corruption
  • Corruption being exported into storage or the network
  • Delays
  • Infinite loops
  • Invariants no longer holding
  • Exposing private information

I don't want my code to be responsible for any of that.

How I test

I originally did it manually. The code looked something like this where FAIL is a macro taking the name of the location, the happy path, and the failure path. Testing would then set each name to fail, and execute the code.

PyObject *list;
FAIL("CompileList", list = PyList_New(0), list = PyErr_NoMemory());

That was very tedious, makes the code unreadable, and the various editors and other tools couldn't understand it, format it etc.

Statement expressions to the rescue with generated code, given a list of function names with this for PyList_New:

 1 #define PyList_New(...) \
 2 ({                                                                                                                                 \
 3     __auto_type _res_PyList_New = 0 ? PyList_New(__VA_ARGS__) : 0;                                                                 \
 4                                                                                                                                   \
 5     _res_PyList_New = (typeof (_res_PyList_New))APSW_FaultInjectControl("PyList_New", __FILE__, __func__, __LINE__, #__VA_ARGS__); \
 6                                                                                                                                   \
 7     if ((typeof (_res_PyList_New))0x1FACADE == _res_PyList_New)                                                                    \
 8       _res_PyList_New = PyList_New(__VA_ARGS__);                                                                                  \
 9     else if ((typeof(_res_PyList_New))0x2FACADE == _res_PyList_New)                                                                \
10     {                                                                                                                              \
11         PyList_New(__VA_ARGS__);                                                                                                   \
12         _res_PyList_New = (typeof (_res_PyList_New))18;                                                                            \
13     }                                                                                                                              \
14     _res_PyList_New;                                                                                                               \
15 })
  • Line 3 sets up a variable to store the return value without knowing what the return type is
  • Line 5 calls APSW_FaultInjectControl giving the function name, filename, calling function name, line number, and stringized arguments. The combination of all those uniquely identifies a location even when there are multiple calls on the same line.
  • Line 7 looks for the 0x1FACADE return value to mean go ahead and call the function normally as seen on line 8.
  • Line 9 looks for the 0x2FACADE return value to mean go ahead and call the function, but pretend it returned 18. This is necessary for closing functions because I do want them to close. 18 is a valid SQLite error code.
  • Line 14 provides the final value which came from the call on line 5 unless that returned 0x1FACADE/ 0x2FACADE.

There is a little more to it, but this lets me cause all the various calls to fail in various ways and have all that error checking code I wrote actually run.

Yes it found bugs that static analysis can't because of all the calling between CPython, SQLite, and APSW. Python has an error indicator that does cause some internal routines to behave differently when it is set. For example they may short circuit and return immediately doing nothing, or they may clear the indicator hiding that an error happened. The main interpreter loop gets upset when the indicator is set and C code returns values as though there were no problems.

I feel better knowing all my code runs, handles errors correctly, and that the errors never get hidden. (And yes I am proud of my magic hex constants.)

Python type annotations

Python always let you rapidly develop code without having to be excruciating precise in details. Duck typing is wonderful. But there has been an increasing tension because there are more and more components to interact with, and there is greater version churn (see the start of this post!).

In the olden days you referred to a component's documentation and memorized what you used most frequently. That isn't practical any more. The question is "when do you want to know about problems in the code?" The answer is as soon as possible, as it gets more expensive (time, effort, and often money) the later you find out, with the worst case being once customers depend on it. Ideally you want to know as your finger rises from typing something.

Type annotations have let that happen, especially for simpler problems. There are annotations for all of APSW's C and Python code (except the shell which is on the todo list). I've found it quite difficult to express duck typing, and some concepts are impossible like the number of arguments to a callable depends on a parameter when some other function was called. But I appreciate the effort made by all the tools, and it does save me effort as I type.

You do however still get amusing error messages for correct code due to limitations of annotations and the tools. It is reminiscent of C++ template errors. I leave you with one example you should not read, deliberately left as one long line and a scrollbar.

example-code.py:95: error: Argument 1 to "set_exec_trace" of "Cursor" has incompatible type "Callable[[Cursor, str, Union[Sequence[Union[None, int, float, bytes, str]], Dict[str, Union[None, int, float, bytes, str]]]], bool]"; expected "Optional[Callable[[Cursor, str, Union[Dict[str, Union[None, int, float, bytes, str]], Tuple[Union[None, int, float, bytes, str], ...], None]], bool]]"

Category: misc – Tags: apsw, python


Exit review: XFCE

I've been using XFCE for about a decade, after switching from Gnome. As a developer I have lots of windows scattered across multiple screens, and find a taskbar the most convenient way to manage them. Gnome 3 went a different way which didn't work for me.

XFCE worked very well, getting out of the way and letting me be productive. The apps like file managers, window managers, image viewers, etc are all fine, and you can run any others anyway.

But I've been wanting to experience Wayland for a while, and XFCE aren't there yet. It is a lot of work for the volunteers.

So I've switched to KDE for the next decade. Everything is Wayland. The default apps are busier.

Thank you XFCE for the last decade.

Category: misc – Tags: exit review


Exit review: Running my own email server

It has been over a year since I stopped running my own email server for me and a few friends. I had been doing so for over two decades!

Why run your own?

Email was and remains a ubiquitous communications mechanism, both for people and automation. When I started running my own server there were very few providers, and they had very low limits. There would be restrictions on attachment sizes and formats, and developer emails would often be rejected as spam. There was little in the way of configurability of incoming email.

Running my own email server let me remove all the restrictions. I accepted emails up to a gigabyte in size because sometimes that was necessary in the days before Dropbox. I was able to have whatever processing rules I wanted, and had full insight into all of the details that were going on.

What do you need?

You need to have an entire system with several configurations and components all working together.

Static firewall
You can filter out countries, internet service providers, cloud providers etc that aren't worth even accepting connections from
Dynamic firewall
A second layer of filtering based on observed undesirable behaviour. For example IP addresses sending you spam can be filtered for a time period.
Spam control (generic)
Various techniques are used to stop spam no matter which user it is going to. For example greylisting is very effective, dcc and rbl tell you if other systems have seen the same message. I also filtered all messages through a virus scanner.
Spam control (user specific)
Spamassassin has many rules with weighting added together to come up with a per message score. It includes how similar the email is to previous emails you received, which you have classified as good or spam, plus many other rules.
Filtering rules
You want to make messages matching various criteria be placed in folders, forwarded, rejected etc using per user scriptable rules.
IMAP server
To actually read the email using email clients or the programs builtin to various desktop and mobile devices you need one of these.
Webmail
And sometimes you want to use a browser, so you need something that presents a web front end.
Mail transfer
This component receives incoming email, and sends outgoing.
Other
You need to ensure there are backups, have authentication, logging, monitoring, DNS records, SSL/TLS certificates etc. Some of the components can use or even require database servers.

Over time there have been open source software projects that address these needs, including more integrated ones that address many at once. There are a nice variety each in different sweet spots.

Exit review

It is a positive experience having to construct a working system. You are exposed to several components that have to work together, read lots of documentation, create configuration, and deal with upgrades and improvements. Seeing how others have addressed that makes you better at them too. I had a working system all those decades that served us very well.

The reality though is you are really running a spam detection and rejection system. There was a new attempt every 3 seconds never ending. Each one results in your system logging what happened and why, and you are acutely aware that overall you are putting more effort into controlling each spam message than the senders put into the message.

I've since switched to Fastmail (obligatory affiliate link). They have done all the work listed above, but I can still see what is happening as though it was my system. For example they too use Spamassassin. What is noticeable is just how many and how large the headers are on each message, almost all generated in the service of detecting spam. It is nice that it is someone else's duty to maintain now.

Category: misc – Tags: exit review


APSW 3.37 is the last with Python 2 / early Python 3 support

This release of APSW (my Python wrapper for SQLite's C interface) is the last that will support Python 2, and earlier versions of Python 3 (before 3.7).

If you currently use APSW with Python 2/early Python 3, then you will want to pin the APSW version to 3.37. You will still be able to use this version of APSW with future versions of SQLite (supported till 2050). But new C level APIs won't be covered. The last C level API additions were 3.36 in June 2021 adding serialization and 3.37 in December 2021 adding autovacuum control

What does APSW support now ...

APSW supports every Python version 2.3 onwards (released 2003). It doesn't support earlier versions as there was no GIL API (needed for multi-threading support).

The downloads for the prebuilt Windows binary gives an idea of just how many Python versions that is (15). (Python 3.0 does actually work, but is missing a module used by the test suite.)

Many Python versions supported ...

Each release does involve building and testing all the combinations of 15 Python versions, 32 and 64 bit environment, and both UCS2 and UCS4 Unicode size for Python < 3.3, on multiple operating systems.

There are ~13k lines of C code making up APSW, with ~7k lines of Python code making up the test suite. It is that test suite that gives the confidence that all is working as intended.

... and why?

I wanted to make sure that APSW is the kind of module I would want to use. The most frustrating thing as a developer is that you want to change one thing (eg one library) and then find that forces you to change the versions of other components, or worse the runtime and dev tools (eg compiler).

I never made the guarantee, but it turned out to be:

You can change the APSW (and SQLite) versions, and nothing else. No other changes will be required and everything will continue to work, probably better.

This would apply to any project otherwise untouched since 2004!

There are two simple reasons:

  • Because I could - I do software development for a living, and not breaking things is a good idea (usually)
  • I would have to delete code that works

What happens next?

I am going to delete code that works, but it is mainly in blocks saying doing one thing for Python 2, another for early Python 3, and another for current Python 3.

My plan is to incrementally remove Python 2/early 3 code from the Python test suite and the C code base together, while updating documentation (only Python 3 types need to be mentioned). The test suite and coverage testing will hopefully catch any problems early.

I will be happy that the code base, testing, documentation, and tooling will all become smaller. That makes things less complex.

Other thoughts

The hardest part of porting APSW from Python 2 to 3 was the test suite which had to remain valid to both environments. For example it is easy to create invalid Unicode strings in Python 2 which I had to make sure the test suite checked.

It was about 10 times the amount of work making the Python test suite code changes, vs the C level API work. Python 3 wasn't that much different in terms of the C API (just some renaming and unification of int and long etc).

Category: misc – Tags: apsw, python


I scanned 3,768 photos and 2,799 slides

Does a physical photo you never look at really exist?

Our current devices and online services do a fantastic job of managing digital photos. There is face recognition, content recognition, maps, timelines etc. And it is all backed up in the cloud.

Meanwhile the physical photos languish inside a box, itself inside another box. All it takes is a few house moves over the years. There is a local company that will do scanning, but it is quite expensive and you still need to do most of the work yourself of extracting photos from albums, unsticking stacks of photos from each other, sorting out landscape from portrait orientation shots and more. The individual photos just aren't that valuable.

I also care about the physical to digital conversion parameters. For example what resolution do you scan the photos at? The higher the number the better the detail, the longer the scanning takes, the larger the file sizes become, and the detail may not actually be present in the print anyway. There are also all sorts of corrections for colour, blurring, dust, tone etc.

The reason I care is because I never want to do the scanning again! Consequently I pick high levels of fidelity, and almost no processing. The processing is very hard to undo when it makes a mistake. I also prefer capturing the photos as is, since that is how they do look now.

How hard could it be?

I briefly tried using mobile apps and phone to scan. That turns out not to be useful with terrible capture quality and apps I did not like. I resorted to a well reviewed flatbed scanner (surprisingly cheap) and a separate slide scanner.

The process itself is simple - load scanner, press buttons, wait, repeat. You do have to have focus of efficiency - an additional 1 second per item would add 2 hours to the total scanning time!

I finished my own photos quickly - all 200 of them. Then I volunteered to scan all family photos which is where the totals came from. That total is about a quarter of the size of my digital photos collection taken in the last two decades.

A small selection of slides and photos awaiting scanning

A small selection of slides and photos awaiting scanning

What did I learn?

It was fun. The photos themselves are like a time machine, with the oldest from 1907. People used to get dressed up in the olden days for photos! But just like other people's vacation photos (which many were), the pictures are mundane unless you were there. The backgrounds were interesting because they showed how things were then.

What surprised me the most was the sheer number of different print sizes. There was absolutely no consistency or standardisation at all. The scanner will scan multiple photos at once providing there is enough gap between them. Most of my time was spent fitting as many as possible onto the glass, like a real world tetris.

The colour reproduction was not what I expected. I had expected fading and yellowing, based on age. There was very little of that, and what there was had no age pattern.

There were a few non-photo items such as newspaper clippings, and two school report cards from the 1930s. They considered deportment the primary subject to grade!

One correlation I did note was the amount of notes on the back of photos and slides. The older they were, the more writing there was. By the 1970s there was usually nothing while the 1930s would have copious information about where, who, and why. Another was how many photos of an event there would be. For example a kid birthday party in the 1950's might have one picture. steadily increasing to 30 or more in 1990s.

Any tips?

Keep your fingers very dry! Any moisture (eg condensation from a cold drink you just had a sip of) will cause photos to stick together (even more), or to the scanner glass.

I put the photos/sides after scanning into batches of 100, separately bagging them with a numbered label corresponding to the folder name. This is to make it easier to go from the digital scan to finding the physical photo and copy any notes across.

Bonus Time Machine: Rare Historical Photographs

Sizes

Slides were all the same size, with the actual image size being in metric and the cardboard frame being in inches! (There were about 10 slides that were a different size.) I plotted photo sizes and how many at each size.

Photo sizes and count

The x axis is photo area, while the y axis is how many were at that size. Photos before the 90s usually had a white border, which the scanning software usually crops out. Sometimes it had writing, and in the 1950s would have the processing date.

Category: misc – Tags: photos


The aviation business as seen by a coder

A while ago I was flying across the Atlantic in a half full $250 million Boeing 747, wondering how it all worked financially. Multiplying the few hundred paid (round trip!) by passengers and 20 years didn't seem like it would pay for the plane let alone crews, fuel, maintenance and everything else. I even visited an airline once on business, and asked an employee during lunch break how an airline actually makes a profit. They did not know!

So here I am going to answer that, and also show the parallels to the software industry. The sources listed at the end include where I have picked up much of this information over the years.

Note

Unless otherwise stated, numbers given are for 2019. They are general ballparks for mainstream passenger airlines, with some variance throughout the industry, and US centric. The numbers for specific airlines and aircraft of interest to you are usually publicly available.

There are two primary parts to the business:

  • Operating an airline
  • Making planes

The software business often has companies that both make software for distribution to all, and then separately operate that software as a service. The aviation business has been separated for almost a century.

Operating an airline

Simple: You spend vast quantities of people, money, and time. You will also outsource a lot. In return you will get back slightly more than you spent. In numbers it may cost you 12.4 cents per seat per mile flown, and you get back 12.6 cents per seat per mile, averaged across your entire operation.

The single most profitable thing is flying a full load of paying passengers, the bigger the plane the better. Until the 1990s a load factor of 65% was considered good. These days 90%+ is the target and that is usually the break-even point for a low cost carrier. Not filling your plane will lose you money, the bigger the plane the worse the loss.

The good news is that most expenses are proportional to flying time. For example the flight crew, cabin crew, fuel, maintenance, ATC fees etc are based on flight hours. Those expenses scale up with the size of the plane. Planes are flown for 8 to 12 hours a day, with bigger numbers being preferable.

Aside: Paying for planes

You won't be shelling out $250 million for a 747. The list prices were always aspirational, just like in enterprise software sales. The planes become a monthly payment, with lessors handling turning the big price into smaller monthly ones.

Based on this posting you can get a rough idea of what it cost for new aircraft in 2019. The prices go a lot lower for used/older. The number of seats varies by airline (eg business class seats take more space, there may be more or less galley space depending on flight lengths, there are denser slimline seats and less dense more comfortable thicker seats). Each aircraft also has sub-models offering incremental seating capacity at an incremental price.

Aircraft Seats Price Monthly
A320 150 $44M $330K
B737 150 $47M $285K
A330 250 $82M $640K
B787 250 $119M $1M
A350 325 $148M $1.1M
B777 350 $155M $1.3M
A380 450 $230M $1.7M

With a software lens, it is also an enterprise sale. When someone spends tens of millions per plane, and usually buys many of them, there is a complex sales process. There are even legendary salesmen.

Making money

This is a giant optimization problem that plays out over months and years. You have to figure out what tradeoffs to make, and constantly update them while your competition do the same. Airlines have staff for which this is their job.

"Tightness"

You could schedule flights and turnaround time for the duration they usually take. But any delay then affects operations later in the day since aircraft and crews aren't where they should be causing cascading problems. Making things looser by adding padding gives more buffer should problems happen, but then those same planes and crews aren't making you any money. Worst case you may end up doing 3 flights a day with planes when you could have done 4, and your competitors may be doing 4, making a third more revenue and providing a better schedule.

Tightening things up is a great way of making the business more efficient, until events exceed your spare capacity (time, crews, planes, parts etc). That usually results in cancellations, irate passengers, and negative media coverage. The spare capacity has a cost too, especially as it isn't used most of the time.

Routes x Frequency

You want to serve as many places as possible to have a broad customer base. They won't want to split trips across multiple airlines. Travellers that are willing to pay more for tickets (eg business) also want more frequency so less of their time is spent waiting.

A common approach is to use smaller aircraft to feed passengers to larger hubs where they can be combined onto larger aircraft. But travellers willing to pay more want direct flights.

Fleet complexity

You can get aircraft for virtually any number of seats (eg 20 seat increments from 70 all the way to 550). That means you could operate the perfectly sized aircraft on each flight. Crew are certified for certain aircraft models, maintenance varies, engines vary and overall you become less able to make changes.

Some airlines avoid the complexity by only operating one type of aircraft which makes it far easier to move crews, maintenance, spares etc around as needed. Others embrace the complexity by being able to put the perfect aircraft on each route.

Fleet age

New aircraft are the most expensive to pay for and you'll have to work them hard to cover that. You do get to customize the cabin easily, making for a better onboard experience. Maintenance is also a lot less. (Replacing the worn out cabin in a 550 seat A380 costs about the same as a new 150 seat B737.)

Older aircraft are a lot cheaper, so it is easier to fly them only when it is worth it. But you'll have a more tired cabin. Maintenance costs also go up, and reliability will go down (a little). They will cost more to fly due to being less fuel efficient.

You'll notice some airlines that brag of youthful fleet get rid of planes at about 6 years old. That is when a heavy maintenance check (D Check) is done that involves taking almost the entire plane apart, checking everything, and putting it back together.

Different offerings

You will not succeed if you charge every passenger the same amount. The standard is to charge more the closer to departure. It is common to have different seating classes, but you need to get the ratios useful for the routes aircraft operate - eg you want business and economy class to be full, not just one and flying empty seats for the other.

The easiest is charging for things that don't require changing the aircraft, like food, priority boarding, wifi, baggage etc.

Outsourcing

You will never be able to handle everything yourself. For example if you operate one flight a day to an airport, then it won't make sense to have full time check in staff, full time maintenance, full time luggage handlers, full time cleaning staff etc.

Unless you have a lot of a certain aircraft, it won't make sense to do heavy maintenance yourself.

But outsourcing is more expensive - you are helping another company make money. The airlines outsource a lot of things to each other.

Bonus: Freight
A silver lining is carrying freight in the hold of passenger aircraft. About 90% of air freight used to be carried by passenger aircraft. They already fly where people go and have a timely schedule, so putting unused baggage space to work is pure gravy. It can also be what makes a flight that doesn't have a full passenger load still be profitable.

Making planes

Simple: You spend vast quantities of people, money, and time. You will also outsource a lot. In return you will get back more than you spent, eventually, if the aircraft programme is successful.

There is a lot involved - this series covers it, and it is only in part 17 where you are actually designing an aircraft.

Lines of code is a useful but very imperfect metric for software. (It does correlate with effort, complexity, bugs, functionality etc though). The equivalent for aircraft is weight, and that is how aircraft size is often measured. Weight has to be added to carry fuel (how far you fly), to contain passenger seats, and for aircraft elements like wings, landing gear, pressure vessel, catering etc. And more weight means more expensive to manufacture, design, purchase and operate.

The most important part is the manufacturing stage. The more you do something the better you get at it, improving efficiency. The standard way of measuring this is to compare the cost to produce unit number n with unit number 2n - for example 10 vs 20, 50 vs 100, 500 vs 1,000. Aircraft manufacturing is around 77%.

An example were estimates the first Boeing 787 cost $2 billion to make. The machines had to be made, machines to make those machines, staff trained, procedures worked out, mistakes detected and prevented in the future etc. It would take close to 1,000 planes manufactured at that 77% improvement before the manufacturing cost meets the sale price listed earlier!

It is a careful choice of how much of the design and manufacturing to outsource. It isn't feasible to do all of it yourself. Outsourcing to specialists reduces your effort, reduces your control, but they also expect to get more of the rewards. The Boeing 787 programme tried significantly increasing the amount of outsourcing, and is a good case study.

Aside: Engines

Engines are purchased separately from the aircraft. There is no standard fitting between the aircraft and the engine, and an airframe + engine combination is what is certified. You are stuck with the same engine model for the lifetime of a particular airframe.

The airlines prefer as much engine choice as possible, while engine manufacturers prefer as little competition as possible. They also have large investments to pay back. For example a deal with Boeing was made by GE to be an exclusive engine supplier for the Boeing 777-300ER model.

How does a new aircraft work?

To interest the airlines, you'll need to have a 15% fuel consumption improvement over what is currently available. Much of that improvment will come from improved engines, while the rest comes from improved materials (especially lighter ones) and aerodynamic tweaks (designing a new wing is very expensive and effective). It will however require actually designing and building the airframe and engines before the exact numbers are found.

Just like version 1.0 software will have "issues", the first aircraft off the production line will too, usually being overweight. (That reduces payload & range, and increases fuel consumption.) There is usually some sort of performance guarantee.

Those initial aircraft are going to have the most teething issues. And they are going to cost you the most to make, after having spent billions of dollars and many years. They will also have lower second hand values. That means launch customers will strike a hard bargain for the aircraft that cost you the most to make!

Iterating

Bugs and small improvements are going to be found. Service bulletins (improvements) and airworthiness directives (affecting safety) are issued. There is careful tracking of each airframe since they could be implemented early during production, or later during maintenance.

A group of improvements to the airframe and engines can be bundled together into a "performance improvement package" - good for a percent or two in reduced fuel consumption. That makes for an easy upsell to customers whose aircraft haven't been manufactured yet. It is rarely sensible to retrofit existing airframes.

The aircraft manufacturer is now in a good position to make some good money. Every successful aircraft has been stretched - putting an additional fuselage frames ahead and behind the wing (to maintain center of gravity) and making space for a few more seats. The goal is keep everything else similar - avoiding new crew training, different maintenance etc. The airlines like it too - if you are flying a route with 200 seats that you routinely fill then the same plane slightly longer and slightly heavier with the same crew and 220 seats makes things easy. As an example wikipedia lists the 747 derivatives doing just that.

Semver

While software has it's versioning, aircraft have a different convention. Using the Boeing 747 as an example:

747: Refers to all aircraft of this family

747-100: The first model produced. The second was 747-200 etc. It isn't always the case that the model starts at -100 - eg if the second is expected to be a smaller aircraft then models may start at -200 with a -100 coming later.

747-436: While the model is conventionally referred to as the -400, they are different for each customer. The -436 is what British Airways had because of their specific engine and other choices like how the cabin is configured. There needs to be plumbing for toilets, electrical power for the galleys, and often choices about space being used for bags or additional fuel tanks. One documentary I saw years ago explained how customers could choose whether the clipboard clasp on the captains controls could be at the side or on top. When spending millions, the customer gets to decide!

Rewriting from scratch

By far the easiest thing to do is keep tweaking existing models. It costs about $2bn and 3 years to update and certify a new engine, and you may even be able to get the engine manufacturer to pay for that. The Boeing 737 has been going since 1968! That compares to the $20bn and 10 years for a clean sheet design, if everything goes well.

Eventually it gets too difficult - making the airframe longer becomes impractical, or needs longer landing gear which needs larger landing gear bays which forces all the other belly components to move. The efficiency improvements are harder to come by since you've done it several times. Rewriting from scratch will fix all these and more, but you won't know for 10 years. It is a difficult complex decision, just as with software.

Good sources

Leeham News and Analysis
Excellent coverage of airlines and manufacturers, with good in depth analysis.
Skyships Eng
Wikipedia has good textual pages for aircraft. This Youtube channel has discussion and video of commercial aircraft history and operations.
Cranky Flier
Covers airline operations. During the pandemic Cranky has shown how the airlines kept updating their schedules and routes. There are also interviews with airline executives, and airport operators.
The Aviation Herald
Covers daily operational incidents world wide. You get an idea of how often there are bird strikes, engine issues, tail strikes etc happen (about 4 a day).

Category: misc – Tags: aviation


Metric Won!

It turns out the metric system has completely won, but some just haven't realised it yet! If you are in the US, and need a definitive ruling on distance, weight etc then you'd think that somewhere there is the golden definition of an inch, a gallon, a pound and so on. There is. They are defined in terms of the metric system.

To quote at 8m45s in:

The most ridiculous thing about all of this? Every single one of these imperial measurements are legally defined by the metric system. America is already using the metric system, and most of the population is oblivious to is.

The imperial system wasn't even sensible. The US uses feet, but also uses surveying feet which are finally becoming one. US and UK gallons remain gratuitously different, so miles per gallons don't translate.

The final frontier are recipes randomly switching amongst volumes and weights with imperial units (quick: what is the weight difference between a fluid ounce and an ounce of water or butter?)

Category: misc – Tags: metric


A decade of hindsight

I wrote a bunch of stuff over the last 10 years. Now that we know what happened, this is my look back. Followup is most recent posts first, and then getting older.

History podcasts did well, and keep getting better.

I liked the Casio Smartwatch, but WearOS and its apps aren't getting much development. (It was particularly frustrating when even Google didn't bother to keep their apps working, and some third party ones just stopped working one day.) It turns out the Apple Watch is also 50m water resistant, will also run for 30 days in power reserve mode, and can also be charged without taking the watch off. Those were the important base features of the Casio to me. Apple watch also has a 1,000 nit display (ie sunlight readable), and you get microphone, speakers, and NFC. Plus the rectangular display is better for those like me who prefer digital watchfaces with lots of information. Even Google's apps work better. I switched.

I keep wishing Emacs well. Language servers have made development environments easier. In the end I did abandon Atom in favour of Visual Studio Code. While vscode doesn't have tramp mode, the remote development is good enough.

I was very wrong about Mario Kart 8. It is a lot of fun, and Nintendo fixed many of the Wii version issues. We try the Wii version again every now and then, and it seems less fun than we remembered.

I had to switch from Nikola to Pelican. The main feature of Pelican is a far slower development pace, and not sucking in lots of dependencies. Nikola went full tilt adding many features quickly, but that made it hard to run infrequently since everything would be a lot harder to update. Additionally every time I ran it, there was a blizzard of messages about deprecations and configuration changes.

Support is still a problem. It is still usually treated as a cost centre, with incentives to do as little as possible. It is easier to support smaller numbers of customers who have paid more for a product, but offering supoort to large numbers of people cheaply doesn't seem to be done by anyone.

SSL was fixed by Lets Encrypt.

RSS is still around, but not as mainstream as the days of Google Reader. I still use it.

Self driving cars are still just around the corner, while there is more evidence of just how bad human drivers are.

I still have trouble with voice recognition. Most of the services do get it mostly right now, but when they get it wrong it is very wrong. Any other humans in the room usually also burst into laughter, due to what the service did. For example I may ask for a temperature conversion, and instead the service will start reading out some obscure fact.

Category: misc


Exit Review: Python 2 (and some related thoughts)

Python 2 has come to an end. I ported the last of my personal scripts to Python 3 a few months ago.

Perhaps the greatest feature of Python 2 was that after the first few releases, it stayed stable. Code ran and worked. New releases didn't break anything. It was predictable. And existing Python 2 code won't break for a long time.

The end of Python 2 has led to the end of that stability, which isn't a bad thing. Python 3 is now competing across a broader ecosystem of languages and environments trying to improve developer and runtime efficiency. Great!

I did see a quote that Python is generally the second best solution to any problem. That is a good summary, and shows why Python is so useful when you need to solve many different problems. It is also my review of Python 2.

So let's have some musings ...

Python has had poor timing. The first Python release (1994) was when unicode was being developed, so the second major Python version (2000) had to bolt on unicode support. But if it had waited a few more years, then things could have been simpler by going straight to utf8 (see also PEP 0538).

Every language has been adding async with Python 3 (2008) increasing support with each minor release. However like most other languages, functions ended up coloured. This will end up solved, almost certainly by having the runtime automagically doing the right thing.

Python 3 made a big mistake with the 2to3 tool. It works exactly as described. But it had the unfortunate effect of maintainers keeping their code in Python 2, and using that to make releases that supported both Python 2 and 3. The counter-example is javascript where tools provide the most recent syntax and transpiling to support older versions. Hopefully future Python migration tools will follow the same pattern so that code can be maintained in the most recent release, and transpiled to support older versions. This should also be the case for using the C API.

The CPython C API is quite nice for a C based object API. Even the internal objects use it. It followed the standard pattern of the time with an object (structure) pointer and methods taking it as a parameter. There are also macros for "optimised access". But this style makes changing underlying implementation details difficult, as alternate Python interpeter implementations have found out. If for example a handle based API was used instead, then it would have been slower due to an indirection, but allow easier changing of implementation details.

Another mistake was not namespacing the third party package repository PyPI. Others have made the same mistake. For example when SourceForge was a thing, they did not use namespacing so the urls were sf.net/projectname - which then led to issues over who legitimately owned projectname. Github added namespaces so the urls are github.com/user/projectname. (user can also be an organization.) This means the same projectname can exist many times over. That makes forking really easy, and is perhaps one of the most important software freedoms.

Using NPM as an example, this is the only package that can be named database. It hasn't been updated in 6 years. On PyPI this is apsw and hasn't been updated in 5 years. (I am the apsw author updating it about quarterly but not the publisher on PyPI for reasons.) Go does use namespacing. A single namespace prevents forks (under the same name) and also makes name squatting very easy. Hopefully Python will figure out a nice solution.

Category: misc – Tags: exit review, python

Contact me