Articles

Character Encodings in Perl

This article describes the different character encodings, how they may lead to problems, and how they can be handled in Perl programs.

German and French versions exist too.

Introduction

It happens far too often: a program works fine with latin characters, but it produces weird, unreadable characters as soon as it has to process other characters like Chinese or Japanese characters or modified latin characters like the German Umlauts Ä, Ö etc. or the Scandinavian characters å and Ø.

ASCII

To understand the root of the problem you have to understand how "normal" Latin characters and other characters (the ones that cause problems) are stored.

It all began in the year 1963 with ASCII, the "American Standard for Information Interchange". It maps 128 characters to the number from 0 to 127, which can be encoded with 7 bits.

Since a byte contains 8 bits, the first, "most significant" bit in ASCII characters is always zero.

The standard defines the Latin letters a to z in both upper and lower case, the Arabic digits 0 to 9, whitespace like "blank" and "carriage return", a few control characters and a few special signs like %, $ and so on.

Characters that aren't essential in the day to day life of an American citizen are not defined in ASCII, like Cyrillic letters, "decorated" Latin characters, Greek characters and so on.

Other Character Encodings

When people started to use computers in other countries, other characters needed to be encoded. In the European countries ASCII was reused, and the 128 unused numbers per byte were used for the locally needed characters.

In Western Europe the character encoding was called "Latin 1", and later standardized as ISO-8859-1. Latin 2 was used in central Europe and so on.

In each of the Latin-* charsets the first 128 characters are identical to ASCII, so they can be viewed as ASCII extensions. The second 128 byte values are each mapped to characters needed in the regions where these character sets were used.

In other parts of world other character encodings were developed, like EUC-CN in China and Shift-JIS in Japan.

These local charsets are very limited. When the Euro was introduced in 2001, many European countries had a currencies whose symbols couldn't be expressed in the traditional character encodings.

Unicode

The charsets mentioned so far can encode only a small part of all possible characters, which makes it nearly impossible to create documents that contain letters from different scripts.

In an attempt to unify all scripts into a single writing system, the Unicode consortium was created, and it started to collect all known characters, and assign a unique number to each, called a "codepoint".

The codepoint is usually written as a four or six digit hex number, like U+0041. The corresponding name is LATIN SMALL LETTER A.

Apart from letters and other "base characters", there are also accents and decorations like ACCENT, COMBINING ACUTE, which can be added to a base character.

If a base char is followed by one or more of these marking characters, this compound forms a logical character called "grapheme".

Note that many pre-composed graphemes exist for characters that are defined in other character sets, and these pre-composed are typically better supported by current software than the equivalent written as base character and combining mark.

Unicode Transformation Formats

The concept of Unicode codepoints and graphemes is completely independent of the encoding.

There are different ways to encode these codepoints, and these mappings from codepoints to bytes are called "Unicode Transformation Formats". The most well known is UTF-8, which is a byte based format that uses all possible byte values from 0 to 255. In Perl land there is also a lax version called UTF8 (without the hyphen). The Perl module Encode distinguishes these versions.

Windows uses mostly UTF-16 which uses at least two bytes per codepoint, for very high codepoints it uses 4 bytes. There are two variants of UTF-16, which are marked with the suffix LE for "little endian" and -BE for "big endian" (see Endianess).

UTF-32 encodes every codepoint in 4 bytes. It is the only fixed width encoding that can implement the whole Unicode range.

Codepoint	Char	ASCII	UTF-8	Latin-1	ISO-8859-15	UTF-16
U+0041	A	0x41	0x41	0x41	0x41	0x00 0x41
U+00c4	Ä	-	0xc3 0x84	0xc4	0xc4	0x00 0xc4
U+20AC	€	-	0xe3 0x82 0xac	-	0xa4	0x20 0xac
U+c218	수	-	0xec 0x88 0x98	-	-	0xc2 0x18

(The letter in the last line is the Hangul syllable SU, and your browser will only display it correctly if you have the appropriate Asian fonts installed.)

Unicode defines a character repertoire of codepoints and their properties. Character encodings like UTF-8 and UTF-16 define a way to write them as a short sequence of bytes.

Perl 5 and Character Encodings

Perl Strings can either be used to hold text strings or binary data. Given a string, you generally have no mechanism of finding out whether it holds text or binary data - you have to keep track of it yourself.

Interaction with the environment (like reading data from STDIN or a file, or printing it) treats strings as binary data. The same holds true for the return value of many built-in functions (like gethostbyname) and special variables that carry information to your program (%ENV and @ARGV).

Other builtin functions that deal with text (like uc and lc and regular expressions) treat strings as text, or more accurately as a list of Codepoints.

With the function decode in the module Encode you decode binary strings to make sure that the text handling functions work correctly.

All text operations should work on strings that have been decoded by Encode::decode (or in other ways described below). Otherwise the text processing functions assume that the string is stored as Latin-1, which will yield incorrect results for any other encoding.

Note that cmp only compares non-ASCII chars by codepoint number, which might give unexpected results. In general the ordering is language dependent, so that you need use locale in effect to sort strings according the rules of a natural language. For example, in German the desired ordering is 'a' lt 'ä' and 'ä' lt 'b', whereas comparison by codepoint number gives 'ä' gt 'b'.

#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(encode decode);

my $enc = 'utf-8'; # This script is stored as UTF-8
my $str = "Ä\n";

# Byte strings:
print lc $str; # prints 'Ä', lc didn't have any effect

# text strings::
my $text_str = decode($enc, $byte_str);
$text_str = lc $text_str;
print encode($enc, $text_str); # prints 'ä', lc worked as expected

It is highly recommended to convert all input to text strings, then work with the text strings, and only covert them back to byte strings on output or storing.

Otherwise, you can get confused very fast, and lose track of which strings are byte strings, and which ones are text strings.

Perl offers IO layers, which are easy mechanisms to make these conversions automatically, either globally or per file handle.

# IO layer: $handle now decodes all strings upon reading
open my $handle, '<:encoding(UTF-8)', $file;

# same
open my $handle, '<', $datei;
binmode $handle, ':encoding(UTF-8)';

# each open() automatically uses :encoding(iso-8859-1)
use open ':encoding(iso-8859-1)';

# All string literals in the script are interpreted as text strings:
use utf8;
# (assumes the script to be stored in UTF-8

# Get the current locale from the environment, and let STDOUT
# convert to that encoding:
use PerlIO::locale;
binmode STDOUT, ':locale';

# all I/O with current locale:
use open ':locale';

Care should be taken with the input layer :utf8, which often pops up in example code and old documentation: it assumes the input to be in valid UTF-8, and you have no way of knowing in your program if that was actually the case. If not, it's a source of subtle security holes, see this article on perlmonks.org for details. Don't ever use it as an input layer, use :encoding(UTF-8) instead.

The module and pragma utf8 also allows you to use non-ASCII chars in variable names and module names. But beware, don't do this for package and module names; it might not work well. Also, consider that not everybody has a keyboard that allows easy typing of non-ASCII characters, so you make maintenance of your code much harder if you use them in your code.

Testing your Environment

You can use the following short script to your terminal, locales and fonts. It is very European centric, but you should be able to modify it to use the character encodings that are normally used where you live.

#!/usr/bin/perl
use warnings;
use strict;
use Encode;

my @charsets = qw(utf-8 latin1 iso-8859-15 utf-16);

# some non-ASCII codepoints:
my $test = 'Ue: ' . chr(220) .'; Euro: '. chr(8364) . "\n";

for (@charsets){
    print "$_: " . encode($_, $test);
}

If you run this program in a terminal, only one line will be displayed correctly, and its first column is the character encoding of your terminal.

The Euro sign € isn't in Latin-1, so if your terminal has that encoding, the Euro sign won't be displayed correctly.

Windows terminals mostly use cp* encodings, for example cp850 or cp858 (only available in new versions of Encode) for German windows installations. The rest of the operating environment uses Windows-* encodings, for example Windows-1252 for a number of Western European localizations. Encode->encodings(":all"); returns a list of all available encodings.

Troubleshooting

"Wide Character in print"

Sometimes you might see the Wide character in print warning.

This means that you tried to use decoded string data in a context where it only makes sense to have binary data, in this case printing it. You can make the warning go away by using an appropriate output layer, or by piping the offending string through Encode::encode first.

Inspecting Strings

Sometimes you want to inspect if a string from an unknown source has already been decoded. Since Perl has no separate data types for binary strings and decoded strings, you can't do that reliably.

But there is a way to guess the answer by using the module Devel::Peek

use Devel::Peek;
use Encode;
my $str = "ä";
Dump $str;
$str = decode("utf-8", $str);
Dump $str;
Dump encode('latin1', $str);

__END__
SV = PV(0x814fb00) at 0x814f678
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x81654f8 "\303\244"\0
CUR = 2
LEN = 4

SV = PV(0x814fb00) at 0x814f678
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x817fcf8 "\303\244"\0 [UTF8 "\x{e4}"]
CUR = 2
LEN = 4

SV = PV(0x814fb00) at 0x81b7f94
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x8203868 "\344"\0
CUR = 1
LEN = 4

The string UTF8 in the line starting with FLAGS = shows that the string has been decoded already. The line starting with PV = holds the bytes, and in brackets the codepoints.

But there is a big caveat: Just because the UTF8 flag isn't present doesn't mean that the text string hasn't been decoded. Perl uses either Latin-1 or UTF-8 internally to store strings, and the presence of this flag indicates which one is used.

That also implies that if your program is written in Perl only (and has no XS components) it is almost certainly an error to rely on the presence or absence of that flag. You shouldn't care how perl stores its strings anyway.

Buggy Modules

A common source of errors are buggy modules. The pragma encoding looks very tempting:

# automatic conversion to and from the current locale
use encoding ':locale';

But under the effect of use encoding, some AUTOLOAD functions stop working, and the module isn't thread safe.

Character Encodings in the WWW

When you write a CGI script you have to chose a character encoding, print all your data in that encoding, and write it in the HTTP headers.

For most applications, UTF-8 is a good choice, since you can code arbitrary Unicode codepoints with it. On the other hand English text (and of most other European languages) is encoded very efficiently.

HTTP offers the Accept-Charset-Header in which the client can tell the server which character encodings it can handle. But if you stick to the common encodings like UTF-8 or Latin-1, next to all user agents will understand it, so it isn't really necessary to check that header.

HTTP headers themselves are strictly ASCII only, so all information that is sent in the HTTP header (including cookies and URLs) need to be encoded to ASCII if non-ASCII characters are used.

For HTML files the header typically looks like this: Content-Type: text/html; charset=UTF-8. If you send such a header, you only have to escape those characters that have a special meaninig in HTML: <, >, & and, in attributes, ".

Special care must be taken when reading POST or GET parameters with the function param in the module CGI. Older versions (prior to 3.29) always returned byte strings, newer version return text strings if charset("UTF-8") has been called before, and byte strings otherwise.

CGI.pm also doesn't support character encodings other than UTF-8. Therefore you should not to use the charset routine and explicitly decode the parameter strings yourself.

To ensure that form contents in the browser are sent with a known charset, you can add the accept-charset attribute to the <form> tag.

<form method="post" accept-charset="utf-8" action="/script.pl">

If you use a template system, you should take care to choose one that knows how to handle character encodings. Good examples are Template::Alloy, HTML::Template::Compiled (since version 0.90 with the open_mode option), or Template Toolkit (with the ENCODING option in the constructor and an IO layer in the process method).

Modules

There are a plethora of Perl modules out there that handle text, so here are only a few notable ones, and what you have to do to make them Unicode-aware:

LWP::UserAgent and WWW::Mechanize

Use the $response->decode_content instead of just $response->content. That way the character encoding information sent in the HTTP response header is used to decode the body of the response.

DBI

DBI leaves handling of character encodings to the DBD:: (driver) modules, so what you have to do depends on which database backend you are using. What most of them have in common is that UTF-8 is better supported than other encodings.

For Mysql and DBD::mysql pass the mysql_enable_utf8 => 1 option to the DBI->connect call.

For Postgresql and DBD::Pg, set the pg_enable_utf8 attribute to 1

For SQLite and DBD::SQLite, set the sqlite_unicode attribute to 1

Advanced Topics

With the basic charset and Perl knowledge you can get quite far. For example, you can make a web application "Unicode safe", i.e. you can take care that all possible user inputs are displayed correctly, in any script the user happens to use.

But that's not all there is to know on the topic. For example, the Unicode standard allows different ways to compose some characters, so you need to "normalize" them before you can compare two strings. You can read more about that in the Unicode normalization FAQ.

To implement country specific behaviour in programs, you should take a look at the locales system. For example in Turkey lc 'I', the lower case of the capital letter I is ı, U+0131 LATIN SMALL LETTER DOTLESS I, while the upper case of i is İ, U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE.

A good place to start reading about locales is perldoc perllocale.

Philosophy

Many programmers who are confronted with encoding issues first react with "But shouldn't it just work?". Yes, it should just work. But too many systems are broken by design regarding character sets and encodings.

Broken by Design

"Broken by Design" most of the time means that a document format, and API or a protocol allows multiple encodings, without a normative way on how that encoding information is transported and stored out of band.

A classical example is the Internet Relay Chat (IRC), which specifies that a character is one Byte, but not which character encoding is used. This worked well in the Latin-1 days, but was bound to fail as soon as people from different continents started to use it.

Currently, many IRC clients try to autodetect character encodings, and recode it to what the user configured. This works quite well in some cases, but produces really ugly results where it doesn't work.

Another Example: XML

The Extensible Markup Language, commonly known by its abbreviation XML, lets you specific the character encoding inside the file:

<?xml version="1.0" encoding="UTF-8" ?>

There are two reasons why this is insufficient:

The encoding information is optional. The specification clearly states that the encoding must be UTF-8 if the encoding information is absent, but sadly many tool authors don't seem to know that, end emit Latin-1. (This is of course only partly the fault of the specification).
Any XML parser first has to autodetect the encoding to be able to parse the encoding information

The second point is really important. You'd guess "Ah, that's no problem, the preamble is just ASCII" - but many encodings are ASCII-incompatible in the first 127 bytes (for example UTF-7, UCS-2 and UTF-16).

So although the encoding information is available, the parser first has to guess nearly correctly to extract it.

The appendix to the XML specification contains a detection algorithm than can handle all common cases, but for example lacks UTF-7 support.

How to Do it Right: Out-of-band Signaling

The XML example above demonstrates that a file format can't carry encoding information in the file itself, unless you specify a way to carry that encoding information on the byte level, independently of the encoding of the rest of the file.

A possible workaround could have been to specific that the first line of any XML file has to be ASCII encoded, and the rest of the file is in the encoding that is specified in that first line. But it's an ugly workaround: a normal text editor would display the first line completely wrong if the file is in an ASCII-incompatible encoding. Of course it's also incompatible with current XML specification, and would require a new, incompatible specification, which would in turn break all existing applications.

So how to do it right, then?

The answer is quite simple: Every system that works with text data has to either store meta data separately, or store everything in a uniform encoding.

It is tempting to store everything in the same encoding, and it works quite well on a local machine, but you can't expect everyone to agree on one single encoding, so all data exchange still has to carry encoding information. And usually you want to store original files (for fear of data loss), so you have to keep that encoding information somewhere.

This observation should have a huge impact on the computing world: all file systems should allow you to store encoding information as meta data, and easily retrieve that meta data. The same should hold true for file names, and programming languages (at least those who want to take the pain away from their users) should transparently transport that meta information, and take care of all encoding issues.

Then it could just work.

Useful Tools

gucharmap, the Gnome Unicode character map.
An UTF-8 dumper that shows you the name of non-ASCII characters.
hexdump never lies (on Debian it's in the bsdmainutils package).
iconv converts text files from one character encoding to another.

Acknowledgments

This article is a translation of a German article of mine written for $foo-Magazin 01/2008, a German Perl magazine. It was enhanced and corrected since then.

Many thanks go to Juerd Waalboer, who pointed out many smaller and a few not-so-small errors in earlier versions of this article, and contributed greatly to my understanding of Perl's string handling.

I'd also like to thank ELISHEVA for suggesting many improvements to both grammar and spelling.

I'd like to acknowledge insightful discussions with: