I recently implemented Unicode Technical Report #29 which covers how to turn "text" into "letters", words, and sentences. I also ended up implementing Technical Report #14 which covers line breaking - good places to start text on the next line when the current line is getting too long.
My guess is that less than 10 developers have fully implemented both of these, and I have now joined that very small club.
How hard could it be?
Unicode aims to have one way of representing human text for computers. Before its widespread adoption, different computer systems and even programs had completely different and very incompatible ways of representing text, and it was a mess to read, write, print, copy, and backup text.
This is an extraordinarily hard problem to solve. Are upper and lower case A` the same letter? What about bold, and italic? What about rotated? Fonts? Is ß a different letter or just a different way of drawing two lower case s? Ligatures? Is a lower case e in French without an accent the same as e in English? Is Roman numeral ⅰ different than i? That barely scratches the surface of Western Europe, and now throw in the whole world with a rich diversity and history of writing. Thankfully experts have been working on this for several decades.
Codepoints
Unicode assigned a number (codepoint) per "letter". This doesn't work because some writing systems use a base letter (eg a consonant) and then add marks on the letter (eg vowels), and having one codepoint for each combination would be a very large number.
This led to combining codepoints that could modify a base, allowing adding accents, vowels, and other marks to represent human writing,
The fateful decision
There were two possible ways of representing these values - for example letter-e and combing-acute-accent
Modifier first
combing-acute-accent then letter-e
Base first
letter-e then combing-acute-accent
If modifier first had been chosen, then each letter as perceived by a human (ie with all the modifiers applied) would be easy to determine by computer code - accumulate modifiers until you find a base. You now have one user perceived letter (aka grapheme cluster).
Base first was chosen. That means code sees a codepoint, but has to keep looking ahead to see if there are more modifiers that apply to the base. It gets way more complicated with zero width joiners, variation selectors, regional indicators, and Indic syllabic categories to name a few.
The consequences
Because of the complexity, most programming languages just pretend that each codepoint is one "letter" and takes one unit of storage, but often worse (well worth a read) conflating byte encoding, grapheme clusters, and indexing. They then require third party libraries to get words, sentences, and line breaking.
I've been adding support for SQLite's full text search to APSW (my Python SQLite wrapper) and it became very clear very quickly that doing things properly was necessary.
SQLite has a unicode61 tokenizer but it works on individual codepoints and their categories on a version of Unicode from 2012 (to remain stable, and Lite). It doesn't implement any rules for words other than whitespace and punctuation which makes it unsuitable for the languages used by billions of people. It also doesn't handle regional indicators, variation selectors, zero width joiners, some emoji etc.
There is a standard International Components for Unicode library available that deals with all the Unicode rules, text segmentation, locale specific customisation, message generation, and localization. It is available as a 5MB Python binding, to the 5MB of code and 30MB of tables making up ICU. ICU isn't installed on every platform, and when it is you have no control over the version.
The ICU solution is also impractical to distribute for cross platform libraries like APSW, so I decided to implement the rules. My final self contained library was 500kb of code and data with no platform dependencies.
Lessons learned
There are three sets of rules covering grapheme cluster, words, and sentences, and another set for line breaking.
I ended up rewriting/refactoring my code six times, learning a lot as a result.
No errors
It isn't immediately apparent, but there is no sequence of codepoints that can be considered an error. No matter how absurd a sequence is supplied, the code has to handle it, even when multiple rules apply and contradict each other.
Test data
Unicode publish test data of around 2,000 tests each for grapheme clusters, word, and sentence breaks. This isn't perfect as I had cases of bugs being present but still passing all the tests. Without those tests, there is zero chance I could have written correct code. I was often puzzling over test failures and the list of rules to work out exactly what sequence were intended to apply. I wish they would add even more test rules with longer sequences and even more combinations that hit multiple rules.
Break points
Break points turned out to be an excellent way of determining boundary locations. You do need to do an analysis step on segments - eg does it contain letters or numerals if you are looking for words.
State machines
It looks like you can do a state machine and that is what you see in the Rust implementation as well as the Python grapheme package (abandoned?). It is possible to do so for the grapheme clusters, but not word and sentence. Adding some adaptive code could perhaps work. In the end I found that you end up with far too many states (word segmentation has 25 different rules), and that rules can contradict. For example you should break after LF but not before ZWJ so what happens if they are in sequence? Adaptive code ends up having to do look ahead, and look behind (either reversing rules or adding more state).
The biggest problem with a state machine is it becomes very hard to map the code and states back to the series of rules. And as the rules get adjusted over time, it will become harder to merge the changes in.
What worked for me
I did my initial exploration in Python. This was far quicker to iterate, debug, update data structures etc. At the end I then converted to C for significantly improved performance, but starting in C would have taken considerably longer.
The code matches the rules in the same order and expressed the same way as the specification. This makes it easier to verify they match, and allows for easier changes as rules change over time. (Complex state machines make that a lot harder.)
Use look ahead for matching (I did briefly try look behind). You can mark your current spot, and then advance trying to match a rule on succeeding codepoints. If a rule match fails, revert back to the mark, and try the next rule. This is in theory less efficient, but in practise little real world text is like this.
I had to implement more Unicode tables. Some rules want information outside of that specified in the TR14/29 tables. For example the rules want to know if a codepoint is extended pictographic, or in a particular category. While some of the information may be available separately in other platform libraries, it may not be the same Unicode version.
The tables and rules look like they map onto enumerations. However they work far better as a bitset. Most tables fit within 32 bits, although some require 64.
You can combine all the necessary data into the single bitset. For example the grapheme cluster rules also want to know the Indic syllabic categories which is a separate table. I could combine that information with the grapheme cluster tables in the same bitset, avoiding a separate lookup.
Bitsets also allows code like the following:
if char & (L | V | LV | LVT): ...
and:
MidNumLetQ = (MidNumLet | Single_Quote) if char & MidNumLetQ: ...
... instead of:
if char == L || char == V || char == LV || ...
The various tables end up as generated code. (Every project I looked at used a Python based tool to do that.) I put a block before each table in a comment giving statistics on how popular each value was, how many lines there were in the table, and other similar information. This made it a lot easier to get an overview - the tables have thousands of rows - and also helped verify that code changes had the expected effect.
It was easy to add other tables. Because I also have terminal output I could add width information, and even a table for which Unicode version a codepoint was added.
Performance
Performance matters because these are used for full text indexing, which could be hundreds of megabytes if not gigabytes of text. There is no point in doing a C Python extension unless it performs well. So here is a benchmark. The source text is the UN Declaration of Human Rights which has been translated into 300 languages. All 300 are concatenated together producing a 5.5MB file. 60 codepoints used in many of the TR14/29 tests are combined and repeatedly shuffled with the source text, and appended until there are 50 million codepoints.
For each library, how long it takes to process that text and return each segment is measured, producing a codepoints per second result. Note that this is measuring both how long it takes to find each break location, as well as how long Python takes to return that substring/segment of text.
ie more codepoints per second is better.
Benchmarking apsw.unicode unicode version 15.1 grapheme codepoints per second: 5,122,888 segments: 49,155,642 word codepoints per second: 13,953,370 segments: 9,834,540 sentence codepoints per second: 93,699,093 segments: 909,603 line codepoints per second: 16,840,718 segments: 9,548,987 Benchmarking uniseg unicode version 15.0.0 grapheme codepoints per second: 567,514 segments: 49,149,753 word codepoints per second: 570,943 segments: 23,617,670 sentence EXCEPTION KeyError('OTHER') line codepoints per second: 528,677 segments: 9,545,722 Benchmarking grapheme unicode version 13.0.0 grapheme codepoints per second: 1,777,689 segments: 49,163,151 Benchmarking pyicu unicode version 15.1 grapheme codepoints per second: 3,778,235 segments: 49,155,642 word codepoints per second: 8,375,000 segments: 20,050,729 sentence codepoints per second: 98,976,577 segments: 910,041 line codepoints per second: 15,864,217 segments: 9,531,173
Here are the size measurements adding together Python bytecode and any stripped shared libraries.
32kb grapheme 144kb uniseg 600kb apsw.unicode (can be 200kb smaller) 40mb pyicu
Table unrolling
The tables end up with each row containing three integer fields:
- Start codepoint
- End codepoint
- Category
Everybody stores them sorted, and then uses a binary search to find the row matching a codepoint. There are 2 to 4 thousand rows in each table meaning up to 12 comparisons are needed to find a particular row, which is negligible to a CPU.
The contents of the tables are known at compile time, and do not change. That led me to thinking that instead of generating tables, I could generate the comparisons instead. (This is similar to loop unrolling).
The resulting compiled code turned out to be smaller than the corresponding data tables, so I went with that. (It is even branch predictor friendly which tables are not.) The densest compilation results in code about 60% the size of tables, while 16 byte aligned is around 95% (but with lots of padding nops <https://en.wikipedia.org/wiki/NOP_(code)>).
msvc defaults to densest and shaves 200kb off the binary size, while clang and gcc both do alignment. Examining the assembly code showed that it matched the comparisons.