Posts in this category

Sun, 05 Mar 2017

Perl 6 By Example: A Unicode Search Tool

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).

Every so often I have to identify or research some Unicode characters. There's a tool called uni in the Perl 5 distribution App::Uni.

Let's reimplement its basic functionality in a few lines of Perl 6 code and use that as an occasion to talk about Unicode support in Perl 6.

If you give it one character on the command line, it prints out a description of the character:

$ uni 🕐
🕐 - U+1f550 - CLOCK FACE ONE OCLOCK

If you give it a longer string instead, it searches in the list of Unicode character names and prints out the same information for each character whose description matches the search string:

$ uni third|head -n5
⅓ - U+02153 - VULGAR FRACTION ONE THIRD
⅔ - U+02154 - VULGAR FRACTION TWO THIRDS
↉ - U+02189 - VULGAR FRACTION ZERO THIRDS
㆛ - U+0319b - IDEOGRAPHIC ANNOTATION THIRD MARK
𐄺 - U+1013a - AEGEAN WEIGHT THIRD SUBUNIT

Each line corresponds to what Unicode calls a "code point", which is usually a character on its own, but occasionally also something like a U+00300 - COMBINING GRAVE ACCENT, which, combined with a a - U+00061 - LATIN SMALL LETTER A makes the character à.

Perl 6 offers a method uniname in both the classes Str and Int that produces the Unicode code point name for a given character, either in its direct character form, or in the form the code point number. With that, the first part of uni's desired functionality:

#!/usr/bin/env perl6

use v6;

sub format-codepoint(Int $codepoint) {
    sprintf "%s - U+%05x - %s\n",
        $codepoint.chr,
        $codepoint,
        $codepoint.uniname;
}

multi sub MAIN(Str $x where .chars == 1) {
    print format-codepoint($x.ord);
}

Let's look at it in action:

$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE

The chr method turns a code point number into the character and ord is the reverse, in other words: from character to code point number.

The second part, searching in all Unicode character names, works by brute-force enumerating all possible characters and searching through their uniname:

multi sub MAIN($search is copy) {
    $search.=uc;
    for 1..0x10FFFF -> $codepoint {
        if $codepoint.uniname.contains($search) {
            print format-codepoint($codepoint);
        }
    }
}

Since all character names are in upper case, the search term is first converted to upper case with $search.=uc, which is short for $search = $search.uc. By default, parameters are read only, which is why its declaration here uses is copy to prevent that.

Instead of this rather imperative style, we can also formulate it in a more functional style. We could think of it as a list of all characters, which we whittle down to those characters that interest us, to finally format them the way we want:

multi sub MAIN($search is copy) {
    $search.=uc;
    print (1..0x10FFFF).grep(*.uniname.contains($search))
                       .map(&format-codepoint)
                       .join;
}

To make it easier to identify (rather than search for) a string of more than one character, an explicit option can help disambiguate:

multi sub MAIN($x, Bool :$identify!) {
    print $x.ords.map(&format-codepoint).join;
}

Str.ords returns the list of code points that make up the string. With this multi candidate of sub MAIN in place, we can do something like

$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E

Code Points, Grapheme Clusters and Bytes

As alluded to above, not all code points are fully-fledged characters on their own. Or put another way, some things that we visually identify as a single character are actually made up of several code points. Unicode calls these sequences of one base character and potentially several combining characters as a grapheme cluster.

Strings in Perl 6 are based on these grapheme clusters. If you get a list of characters in string with $str.comb, or extract a substring with $str.substr(0, 4), match a regex against a string, determine the length, or do any other operation on a string, the unit is always the grapheme cluster. This best fits our intuitive understanding of what a character is and avoids accidentally tearing apart a logical character through a substr, comb or similar operation:

my $s = "ø\c[COMBINING TILDE]";
say $s;         # ø̃
say $s.chars;   # 1

The Uni type is akin to a string and represents a sequence of codepoints. It is useful in edge cases, but doesn't support the same wealth of operations as Str. The typical way to go from Str to a Uni value is to use one of the NFC, NFD, NFKC, or NFKD methods, which yield a Uni value in the normalization form of the same name.

Below the Uni level you can also represent strings as bytes by choosing an encoding. If you want to get from string to the byte level, call the encode method:

my $bytes = 'Perl 6'.encode('UTF-8');

UTF-8 is the default encoding and also the one Perl 6 assumes when reading source files. The result is something that does the Blob role; you can access individual bytes with positional indexing, such as $bytes[0]. The decode method helps you to convert a Blob to a Str.

Numbers

Number literals in Perl 6 aren't limited to the Arabic digits we are so used to in the English speaking part of the world. All Unicode code points that have the Decimal_Number (short Nd) property are allowed, so you can for example use Bengali digits:

say ৪২;             # 42

The same holds true for string to number conversions:

say "৪২".Int;       # 42

For other numeric code points you can use the unival method to obtain its numeric value:

say "\c[TIBETAN DIGIT HALF ZERO]".unival;

which produces the output -0.5 and also illustrates how to use a codepoint by name inside a string literal.

Other Unicode Properties

The uniprop method in type Str returns the general category by default:

say "ø".uniprop;                            # Ll
say "\c[TIBETAN DIGIT HALF ZERO]".uniprop;  # No

The return value needs some Unicode knowledge in order to make sense of it, or one could read Unicode's Technical Report 44 for the gory details. Ll stands for Letter_Lowercase, No is Other_Number. This is what Unicode calls the General Category, but you can ask the uniprop (or uniprop-bool method if you're only interested in a boolean result) for other properties as well:

say "a".uniprop-bool('ASCII_Hex_Digit');    # True
say "ü".uniprop-bool('Numeric_Type');       # False
say ".".uniprop("Word_Break");              # MidNumLet

Collation

Sorting strings starts to become complicated when you're not limited to ASCII characters. Perl 6's sort method uses the cmp infix operator, which does a pretty standard lexicographic comparison based on the codepoint number.

If you need to use a more sophisticated collation algorithm, Rakudo 2017.02 and newer offer the Unicode Collation Algorithm as an experimental feature:

my @list = <a ö ä Ä o ø>;
say @list.sort;                     # (a o Ä ä ö ø)

use experimental :collation;
say @list.collate;                  # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate;                  # (a Ä ä o ö ø)

The default sort considers any character with diacritics to be larger than ASCII characters, because that's how they appear in the code point list. On the other hand, collate knows that characters with diacritics belong directly after their base character, which is not perfect in every language, but internally a good compromise.

For Latin-based scripts, the primary sorting criteria is alphabetic, the secondary diacritics, and the third is case. $*COLLATION.set(:tertiary(False)) thus makes .collate ignore case, so it doesn't force lower case characters to come before upper case characters anymore.

At the time of writing, language specification of collation is not yet implemented.

Summary

Perl 6 takes languages other than English very seriously, and goes to great lengths to facilitate working with them and the characters they use.

This includes basing strings on grapheme clusters rather than code points, support for non-Arabic digits in numbers, and access to large parts of Unicode database through built-in methods.

[/perl-6] Permanent link

Categories