Categories
Posts in this category
- A shiny perl6.org site
- Creating an entry point for newcomers
- Sprixel, a 6 compiler powered by JavaScript
- Another perl6.org iteration
- Blackjack and Perl 6
- Why I commit Crud to the Perl 6 Test Suite
- Report from the Perl 6 Hackathon in Copenhagen
- Custom operators in Rakudo
- Defined Behaviour with Undefined Values
- Dissecting the "Starry obfu"
- Perl 6: Failing Softly with Unthrown Exceptions
- The first Perl 6 module on CPAN
- Google Summer of Code Mentor Recap
- Building a Huffman Tree With Rakudo
- Immutable Sigils and Context
- Is Perl 6 really Perl?
- Perl 6: Lost in Wonderland
- Lots of momentum in the Perl 6 community
- Musing and the future of feather and the Pugs repository
- Musings on Rakudo's spectest chart
- My first executable from Perl 6
- Trying to implement new operators - failed
- Let's build an object
- Perl 6 is optimized for fun
- How to get a parse tree for a Perl 6 Program
- Perl 6 in 2009
- Perl 6 ticket life cycle
- The Perl 6 Advent Calendar
- How to Plot a Segment of a Circle with SVG
- Publicity for Perl 6
- Rakudo architectural overview
- Rakudo Rocks
- Rakudo "star" announced
- Rakudo's rough edges
- Rats and other pets
- Releasing Rakudo made easy
- Set Phasers to Stun!
- Starry Perl 6 obfu
- Recent Perl 6 Developments August 2008
- Strings and Buffers
- Subroutines vs. Methods - Differences and Commonalities
- A SVG plotting adventure
- A Syntax Highlighter for Perl 6
- Test Suite Reorganization: How to move tests
- The Happiness of Design Convergence
- Perl 6 Tidings from September and October 2008
- Perl 6 Tidings for November 2008
- Perl 6 Tidings from December 2008
- Perl 6 Tidings from January 2009
- Perl 6 Tidings from February 2009
- Perl 6 Tidings from March 2009
- Perl 6 Tidings from April 2009
- Perl 6 Tidings from May 2009
- Perl 6 Tidings from May 2009 (second iteration)
- Perl 6 Tidings from June 2009
- Perl 6 Tidings from August 2009
- Perl 6 Tidings from October 2009
- Timeline for a syntax change in Perl 6
- Visualizing match trees
- We write a Perl 6 book for you
- When we reach 100% we did something wrong
- Where Rakudo Lives Now
- Why was the Perl 6 Advent Calendar such a Success?
- What you can write in Perl 6 today
- Why you don't need the Y combinator in Perl 6
Thu, 02 Jul 2009
Strings and Buffers
Permanent link
Subtitled "The Zen of not messing up your strings".
Handling non-ASCII strings in Perl 5 is a real pain, because there are no
real separate types for binary data and text strings. Mostly the operation
provides a context of either binary or string processing, but function like
length don't, so the answer is dependent on internal
representations about wich the programmer should never care.
In the Perl 6 language design we decided not to repeat that mistake. Since
Strings are objects like everything else, it's easy to invent new types. So in
essence we have two types relevant for our discussion, Str and
Buf.
Str
A Str is notionally a sequence of characters, or a text
string. There's no character encoding attached to it, and while it is surely
stored in a specific encoding scheme internally, it's nothing that programmer
cares about.
A Str co-exists on two (at least) two levels, on codepoint and
grapheme level. A codepoint is everything that the Unicode consortium has
assigned a number and a name, like U+0065 LATIN SMALL LETTER E
or U+0300 COMBINING GRAVE ACCENT. A grapheme is either a
codepoint or a sequence of codepoints that are visually represented together,
for example the two codepoints mentioned before would be printed as a single
grapheme è.
The default level is grapheme, because that's closest to how humans usually
think of characters and text. Specific operations can override the default
abstraction level, or it can be adjusted by pragmas like use
codes;.
Buf
Of course you can also handle binary data in Perl 6. Such data is stored in
objects of type Buf. Notionally a Buf is a list of integers of a
fixed size. It has subtypes for common sizes buf8 is a sequence
of unsigned bytes, buf16 and buf32 store unsigned 16
and 23 bit integers.
When you communicate with anything outside of Perl 6, you'll need
Buf objects for that, because files and terminals only understand
byte streams, not character streams.
There are also a different kind of Bufs which enforce a
specific encoding, for example utf8 can only hold byte sequences
which can be interpreted as UTF-8. They are not strictly necessary, but
provide a nice, convenient interface for some operations.
Conversion
Conversion between Str and Buf is called encoding, the other way round is
called decoding. For example "møøse" is a Str, and
"møøse".encode('Latin-1') returns a Buf, more specifically a
buf8.
On the other hand if you read some bytes from a socket and want to treat
the result as a text string, you decode it: my $str =
$buf.decode('UTF-16LE').
Mixing a Str and Buf that doesn't know about its own encoding in an operation like concatenation or comparison throws an exception, because those are the conditions where most Perl 5 programs mess up strings beyond all repair.
IO
If you read the above, maybe you think that printing a Str to standard output is an error because the string doesn't know its encoding, so it can't represented as a byte stream. That's only half the truth; the output handle can also have knowledge about its own encoding.
When you open a file, you can either specify that it's opened as a binary
file, or you specify an encoding. In the former case reading from the handle
returns Bufs, in the latter Strs.
For the sake of convenience a pseudo encoding called Unicode
exists. (Yes, we know that Unicode defines a character repertoire, not a
character encoding). If there's a byte order mark (short BOM) at the start of
the stream, it is used to determine the encoding. If not, a very simple
autodetection is used: If the file is obviously UTF-16LE, UTF-16BE or UTF-32
(detectable by the position of the zero bytes when encoding ASCII characters)
then the detected encoding is used, otherwise as a fallback UTF-8 is
used. This autodetection scheme is the default.
Conclusions
We learned from the experience that cramming too many different semantics into a single data type is harmful. So now byte streams and text streams have different data types, and a clean interface for converting back and forth.
The specification is not set in stone so far, and no compiler implements the Buf type yet, but it is already planned for Rakudo.
Since this is an important topic to me I will continue to nag the implementors and language designers about it, and write tests to ensure a solid implementation.