Categories
Posts in this category
- Current State of Exceptions in Rakudo and Perl 6
- Meet DBIish, a Perl 6 Database Interface
- doc.perl6.org and p6doc
- Exceptions Grant Report for May 2012
- Exceptions Grant Report -- Final update
- Perl 6 Hackathon in Oslo: Be Prepared!
- Localization for Exception Messages
- News in the Rakudo 2012.05 release
- News in the Rakudo 2012.06 release
- Perl 6 Hackathon in Oslo: Report From The First Day
- Perl 6 Hackathon in Oslo: Report From The Second Day
- Quo Vadis Perl?
- Rakudo Hack: Dynamic Export Lists
- SQLite support for DBIish
- Stop The Rewrites!
- Upcoming Perl 6 Hackathon in Oslo, Norway
- A small regex optimization for NQP and Rakudo
- Pattern Matching and Unpacking
- Rakudo's Abstract Syntax Tree
- The REPL trick
- First day at YAPC::Europe 2013 in Kiev
- YAPC Europe 2013 Day 2
- YAPC Europe 2013 Day 3
- A new Perl 6 community server - call for funding
- New Perl 6 community server now live, accepting signups
- A new Perl 6 community server - update
- All Perl 6 modules in a box
- doc.perl6.org: some stats, future directions
- Profiling Perl 6 code on IRC
- Why is it hard to write a compiler for Perl 6?
- Writing docs helps you take the user's perspective
- Perl 6 Advent Calendar 2016 -- Call for Authors
- Perl 6 By Example: Running Rakudo
- Perl 6 By Example: Formatting a Sudoku Puzzle
- Perl 6 By Example: Testing the Say Function
- Perl 6 By Example: Testing the Timestamp Converter
- Perl 6 By Example: Datetime Conversion for the Command Line
- What is Perl 6?
- Perl 6 By Example, Another Perl 6 Book
- Perl 6 By Example: Silent Cron, a Cron Wrapper
- Perl 6 By Example: Testing Silent Cron
- Perl 6 By Example: Stateful Silent Cron
- Perl 6 By Example: Perl 6 Review
- Perl 6 By Example: Parsing INI files
- Perl 6 By Example: Improved INI Parsing with Grammars
- Perl 6 By Example: Generating Good Parse Errors from a Parser
- Perl 6 By Example: A File and Directory Usage Graph
- Perl 6 By Example: Functional Refactorings for Directory Visualization Code
- Perl 6 By Example: A Unicode Search Tool
- What's a Variable, Exactly?
- Perl 6 By Example: Plotting using Matplotlib and Inline::Python
- Perl 6 By Example: Stacked Plots with Matplotlib
- Perl 6 By Example: Idiomatic Use of Inline::Python
- Perl 6 By Example: Now "Perl 6 Fundamentals"
- Perl 6 Books Landscape in June 2017
- Living on the (b)leading edge
- The Loss of Name and Orientation
- Perl 6 Fundamentals Now Available for Purchase
- My Ten Years of Perl 6
- Perl 6 Coding Contest 2019: Seeking Task Makers
- A shiny perl6.org site
- Creating an entry point for newcomers
- An offer for software developers: free IRC logging
- Sprixel, a 6 compiler powered by JavaScript
- Announcing try.rakudo.org, an interactive Perl 6 shell in your browser
- Another perl6.org iteration
- Blackjack and Perl 6
- Why I commit Crud to the Perl 6 Test Suite
- This Week's Contribution to Perl 6 Week 5: Implement Str.trans
- This Week's Contribution to Perl 6
- This Week's Contribution to Perl 6 Week 8: Implement $*ARGFILES for Rakudo
- This Week's Contribution to Perl 6 Week 6: Improve Book markup
- This Week's Contribution to Perl 6 Week 2: Fix up a test
- This Week's Contribution to Perl 6 Week 9: Implement Hash.pick for Rakudo
- This Week's Contribution to Perl 6 Week 11: Improve an error message for Hyper Operators
- This Week's Contribution to Perl 6 - Lottery Intermission
- This Week's Contribution to Perl 6 Week 3: Write supporting code for the MAIN sub
- This Week's Contribution to Perl 6 Week 1: A website for proto
- This Week's Contribution to Perl 6 Week 4: Implement :samecase for .subst
- This Week's Contribution to Perl 6 Week 10: Implement samespace for Rakudo
- This Week's Contribution to Perl 6 Week 7: Implement try.rakudo.org
- What is the "Cool" class in Perl 6?
- Report from the Perl 6 Hackathon in Copenhagen
- Custom operators in Rakudo
- A Perl 6 Date Module
- Defined Behaviour with Undefined Values
- Dissecting the "Starry obfu"
- The case for distributed version control systems
- Perl 6: Failing Softly with Unthrown Exceptions
- Perl 6 Compiler Feature Matrix
- The first Perl 6 module on CPAN
- A Foray into Perl 5 land
- Gabor: Keep going
- First Grant Report: Structured Error Messages
- Second Grant Report: Structured Error Messages
- Third Grant Report: Structured Error Messages
- Fourth Grant Report: Structured Error Messages
- Google Summer of Code Mentor Recap
- How core is core?
- How fast is Rakudo's "nom" branch?
- Building a Huffman Tree With Rakudo
- Immutable Sigils and Context
- Is Perl 6 really Perl?
- Mini-Challenge: Write Your Prisoner's Dilemma Strategy
- List.classify
- Longest Palindrome by Regex
- Perl 6: Lost in Wonderland
- Lots of momentum in the Perl 6 community
- Monetize Perl 6?
- Musings on Rakudo's spectest chart
- My first executable from Perl 6
- My first YAPC - YAPC::EU 2010 in Pisa
- Trying to implement new operators - failed
- Programming Languages Are Not Zero Sum
- Perl 6 notes from February 2011
- Notes from the YAPC::EU 2010 Rakudo hackathon
- Let's build an object
- Perl 6 is optimized for fun
- How to get a parse tree for a Perl 6 Program
- Pascal's Triangle in Perl 6
- Perl 6 in 2009
- Perl 6 in 2010
- Perl 6 in 2011 - A Retrospection
- Perl 6 ticket life cycle
- The Perl Survey and Perl 6
- The Perl 6 Advent Calendar
- Perl 6 Questions on Perlmonks
- Physical modeling with Math::Model and Perl 6
- How to Plot a Segment of a Circle with SVG
- Results from the Prisoner's Dilemma Challenge
- Protected Attributes Make No Sense
- Publicity for Perl 6
- PVC - Perl 6 Vocabulary Coach
- Fixing Rakudo Memory Leaks
- Rakudo architectural overview
- Rakudo Rocks
- Rakudo "star" announced
- My personal "I want a PONIE" wish list for Rakudo Star
- Rakudo's rough edges
- Rats and other pets
- The Real World Strikes Back - or why you shouldn't forbid stuff just because you think it's wrong
- Releasing Rakudo made easy
- Set Phasers to Stun!
- Starry Perl 6 obfu
- Recent Perl 6 Developments August 2008
- The State of Regex Modifiers in Rakudo
- Strings and Buffers
- Subroutines vs. Methods - Differences and Commonalities
- A SVG plotting adventure
- A Syntax Highlighter for Perl 6
- Test Suite Reorganization: How to move tests
- The Happiness of Design Convergence
- Thoughts on masak's Perl 6 Coding Contest
- The Three-Fold Function of the Smart Match Operator
- Perl 6 Tidings from September and October 2008
- Perl 6 Tidings for November 2008
- Perl 6 Tidings from December 2008
- Perl 6 Tidings from January 2009
- Perl 6 Tidings from February 2009
- Perl 6 Tidings from March 2009
- Perl 6 Tidings from April 2009
- Perl 6 Tidings from May 2009
- Perl 6 Tidings from May 2009 (second iteration)
- Perl 6 Tidings from June 2009
- Perl 6 Tidings from August 2009
- Perl 6 Tidings from October 2009
- Timeline for a syntax change in Perl 6
- Visualizing match trees
- Want to write shiny SVG graphics with Perl 6? Port Scruffy!
- We write a Perl 6 book for you
- When we reach 100% we did something wrong
- Where Rakudo Lives Now
- Why Rakudo needs NQP
- Why was the Perl 6 Advent Calendar such a Success?
- What you can write in Perl 6 today
- Why you don't need the Y combinator in Perl 6
- You are good enough!
Sun, 05 Mar 2017
Perl 6 By Example: A Unicode Search Tool
Permanent link
This blog post is part of my ongoing project to write a book about Perl 6.
If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).
Every so often I have to identify or research some Unicode characters. There's a tool called uni in the Perl 5 distribution App::Uni.
Let's reimplement its basic functionality in a few lines of Perl 6 code and use that as an occasion to talk about Unicode support in Perl 6.
If you give it one character on the command line, it prints out a description of the character:
$ uni 🕐
🕐 - U+1f550 - CLOCK FACE ONE OCLOCK
If you give it a longer string instead, it searches in the list of Unicode character names and prints out the same information for each character whose description matches the search string:
$ uni third|head -n5
⅓ - U+02153 - VULGAR FRACTION ONE THIRD
⅔ - U+02154 - VULGAR FRACTION TWO THIRDS
↉ - U+02189 - VULGAR FRACTION ZERO THIRDS
㆛ - U+0319b - IDEOGRAPHIC ANNOTATION THIRD MARK
𐄺 - U+1013a - AEGEAN WEIGHT THIRD SUBUNIT
Each line corresponds to what Unicode calls a "code point", which is usually a
character on its own, but occasionally also something like a U+00300 -
COMBINING GRAVE ACCENT
, which, combined with a a - U+00061 - LATIN SMALL
LETTER A
makes the character à
.
Perl 6 offers a method uniname
in both the classes Str
and Int
that
produces the Unicode code point name for a given character, either in its
direct character form, or in the form the code point number. With that, the
first part of uni
's desired functionality:
#!/usr/bin/env perl6
use v6;
sub format-codepoint(Int $codepoint) {
sprintf "%s - U+%05x - %s\n",
$codepoint.chr,
$codepoint,
$codepoint.uniname;
}
multi sub MAIN(Str $x where .chars == 1) {
print format-codepoint($x.ord);
}
Let's look at it in action:
$ uni ø
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
The chr
method turns a code point number into the character and ord
is
the reverse, in other words: from character to code point number.
The second part, searching in all Unicode character names, works by
brute-force enumerating all possible characters and searching through their
uniname
:
multi sub MAIN($search is copy) {
$search.=uc;
for 1..0x10FFFF -> $codepoint {
if $codepoint.uniname.contains($search) {
print format-codepoint($codepoint);
}
}
}
Since all character names are in upper case, the search term is first
converted to upper case with $search.=uc
, which is short for $search =
$search.uc
. By default, parameters are read only, which is why its
declaration here uses is copy
to prevent that.
Instead of this rather imperative style, we can also formulate it in a more functional style. We could think of it as a list of all characters, which we whittle down to those characters that interest us, to finally format them the way we want:
multi sub MAIN($search is copy) {
$search.=uc;
print (1..0x10FFFF).grep(*.uniname.contains($search))
.map(&format-codepoint)
.join;
}
To make it easier to identify (rather than search for) a string of more than one character, an explicit option can help disambiguate:
multi sub MAIN($x, Bool :$identify!) {
print $x.ords.map(&format-codepoint).join;
}
Str.ords
returns the list of code points that make up the string. With this
multi candidate of sub MAIN
in place, we can do something like
$ uni --identify øre
ø - U+000f8 - LATIN SMALL LETTER O WITH STROKE
r - U+00072 - LATIN SMALL LETTER R
e - U+00065 - LATIN SMALL LETTER E
Code Points, Grapheme Clusters and Bytes
As alluded to above, not all code points are fully-fledged characters on their own. Or put another way, some things that we visually identify as a single character are actually made up of several code points. Unicode calls these sequences of one base character and potentially several combining characters as a grapheme cluster.
Strings in Perl 6 are based on these grapheme clusters. If you get a list of
characters in string with $str.comb
, or extract a substring with
$str.substr(0, 4)
, match a regex against a string, determine the length, or
do any other operation on a string, the unit is always the grapheme cluster.
This best fits our intuitive understanding of what a character is and avoids
accidentally tearing apart a logical character through a substr
, comb
or
similar operation:
my $s = "ø\c[COMBINING TILDE]";
say $s; # ø̃
say $s.chars; # 1
The Uni type is akin to a string and
represents a sequence of codepoints. It is useful in edge cases, but doesn't
support the same wealth of operations as
Str. The typical way to go from Str
to a
Uni
value is to use one of the NFC
, NFD
, NFKC
, or NFKD
methods, which
yield a Uni
value in the normalization form of the same name.
Below the Uni
level you can also represent strings as bytes by choosing an
encoding. If you want to get from string to the byte level, call the
encode
method:
my $bytes = 'Perl 6'.encode('UTF-8');
UTF-8 is the default encoding and also the one Perl 6 assumes when reading
source files. The result is something that does the
Blob role; you can access
individual bytes with positional indexing, such as $bytes[0]
. The
decode
method helps
you to convert a Blob
to a Str
.
Numbers
Number literals in Perl 6 aren't limited to the Arabic digits we are so used
to in the English speaking part of the world. All Unicode code points that
have the Decimal_Number
(short Nd
) property are allowed, so you can for
example use Bengali digits:
say ৪২; # 42
The same holds true for string to number conversions:
say "৪২".Int; # 42
For other numeric code points you can use the unival
method to obtain its
numeric value:
say "\c[TIBETAN DIGIT HALF ZERO]".unival;
which produces the output -0.5
and also illustrates how to use a codepoint
by name inside a string literal.
Other Unicode Properties
The uniprop
method
in type Str
returns the general category by default:
say "ø".uniprop; # Ll
say "\c[TIBETAN DIGIT HALF ZERO]".uniprop; # No
The return value needs some Unicode knowledge in order to make sense of it,
or one could read
Unicode's Technical Report 44 for the gory details.
Ll
stands for Letter_Lowercase
, No
is Other_Number
. This is what
Unicode calls the General Category, but you can ask the uniprop
(or
uniprop-bool
method if you're only interested in a boolean result) for
other properties as well:
say "a".uniprop-bool('ASCII_Hex_Digit'); # True
say "ü".uniprop-bool('Numeric_Type'); # False
say ".".uniprop("Word_Break"); # MidNumLet
Collation
Sorting strings starts to become complicated when you're not limited to ASCII
characters. Perl 6's sort
method uses the cmp
infix operator, which does a
pretty standard lexicographic comparison based on the codepoint number.
If you need to use a more sophisticated collation algorithm, Rakudo 2017.02 and newer offer the Unicode Collation Algorithm as an experimental feature:
my @list = <a ö ä Ä o ø>;
say @list.sort; # (a o Ä ä ö ø)
use experimental :collation;
say @list.collate; # (a ä Ä o ö ø)
$*COLLATION.set(:tertiary(False));
say @list.collate; # (a Ä ä o ö ø)
The default sort
considers any character with diacritics to be larger than
ASCII characters, because that's how they appear in the code point list. On
the other hand, collate
knows that characters with diacritics belong
directly after their base character, which is not perfect in every language,
but internally a good compromise.
For Latin-based scripts, the primary sorting criteria is alphabetic, the
secondary diacritics, and the third is case.
$*COLLATION.set(:tertiary(False))
thus makes .collate
ignore case, so it
doesn't force lower case characters to come before upper case characters
anymore.
At the time of writing, language specification of collation is not yet implemented.
Summary
Perl 6 takes languages other than English very seriously, and goes to great lengths to facilitate working with them and the characters they use.
This includes basing strings on grapheme clusters rather than code points, support for non-Arabic digits in numbers, and access to large parts of Unicode database through built-in methods.