13. Ahhh, Kanji code

"Kanji" is main characters for Japanese which typically have meanings and two sounds. The total number of Kanji usually used is over 3,000. Kanji was originated from Chinese characters and had been modified and simplified in Japan for a long time.

There are also about 80 characters, so called "Hiragana", each expresses just one sound and has a soft shape. Kanji is mainly used for nouns and beginning portion of verbs while Hiragana is used for other parts including last portion of verbs. Japanese sentences typically consist of Kanji in 30% and Hiragana in 70%. There is one more character set, called "Katakana", which is another notation of Hiragana. Katakana has exactly same sound of Hiragana and a little hard shape and is used to express exported words from other countries based on their sounds.

I describe an example of struggle history for non-alphabetical character set in messaging system.

13.1 Email and localization
13.2 The appearance of MIME
13.3 The concept of canonicalization

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.1 Email and localization

A spec of Email, RFC822, was defined with a hope to ensure interoperability in 1982. Since Email was grown in America, its header and body could not contains other character sets than US-ASCII.

It is, however, very inconvenient for people whose language is not English. So, despite of extension of header, many people from various countries extended RFC822 messages to contain non-English characters from their native language.

In Europe, Latin 1 started to be used that presents umlaut(accent) characters by 8 bit word. Latin 1 is sometime called ISO-8859-1.

In Japan, there are three major codes, (1) JIS code which is 7-bit 2 characters, (2) EUC code which is 8-bit 2 characters and is used in UNIX, (3) SJIS which is 8-bit 2 characters and is used in PCs. Pioneers of JUNET which is the antecedent of Japanese Internet chose a switch mechanism of ASCII and JIS with ESC sequence, so called JUNET code, for transportation.

JUNET code is sometime called ISO-2022-JP. With JUNET code, we can tell what their character sets are in addition to switch them.

The extension such as Latin 1 and JUNET code is an agreement within the region. You are compelled to use English to send a message across regions in the context of RFC822.

RFC822 is so ambiguous that we misunderstand that JUNET code can be used for header and body since it is 7 bit. Probably this is a good explanation to blow away your misunderstanding. "RFC822 defines that the syntax of header and body is 7bit and the semantics of header and body is US-ASCII". JUNET code is syntactically legal but its semantics is illegal.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.2 The appearance of MIME

To satisfy users' desire such as transportation of picture and audio and to bridge localized RFC822, MIME was defined in 1992. With MIME, the character parameter can be specified. Since JUNET code is called ISO-2022-JP, Japanese message looks as follows:

Content-Type: Text/Plain; charset=iso-2022-jp
Content-Transfer-Encoding: 7bit

Japanese text.

Is this charset useful? Absolutely! The charset parameter can tell user interface an exact character set. Suppose that a Norway guy send a message with ISO-8859-1 to Japanese. If his interface supports ISO-8859-1 then it is no problem to display the body. Otherwise the interface can safely ignore the body. Mew makes use of the charset parameter to convert messages to Mule's internal representation.

Some people say like this; "If we use ISO-2022-JP-2 which is upper compatible to ISO-2022-JP and can handle numerous character set, the charset parameter is not necessary since ISO-2022-JP-2 itself contains information about character set." Maybe, just maybe, such people don't understand MIME.

Ideally speaking, this assessment is correct. But MIME takes a practical stance. MIME does not suppose that all people in the world will start using ISO-2022-JP-2 tomorrow. Moreover, MIME is designed to be robust against unstable transfer programs. All transfer program in the world are not well implemented. And all site cannot use rich resources. If you ask to use UNICODE as of today, how do you feel?

MIME provides the charset parameter to bridge between numerous localized regions. The additional procedure under MIME is to label the charset parameter and we can use ISO-2022-JP as we used to. If you wish to make ISO-2022-JP-2 an Internet standard, you should make an effort to spread region where ISO-2022-JP-2 is used by default. Likewise, ISO-2022-JP-2 and MIME is not inconsistent. Rather, ISO-2022-JP-2 can make most use of MIME to make itself widely spread. Of course, the name of "charset" is not proper for character switching mechanisms such as ISO-2022-xx.

With MIME, you can encode non-ASCII character set and insert it into header. This scheme prevents errors of Email transfer programs and makes it possible to convey non-ASCII strings in header. We don't have to say "Do not use Japanese on Subject:" anymore!

MIME is not a spec to prohibit localized RFC822. So, MIME interfaces are supposed to act as follows:

Viewing

Allow user to choose a default charset.
If MIME-Version: doesn't exist, the body is treated as the default charset.
If MIME-Version: exists and Content-Type: is not provided, the body is treated as US-ASCII.
If both MIME-Version: and Content-Type: exist, specified charset is used.

Composing

Specify MIME-Version:, Content-Type:, and its charset.
Choose minimum character set for charset. For instance, US-ASCII for English. If the rule is violated, it is likely that a message to be read cannot be read. For instance, consider a message labeled ISO-2022-JP whose body is US-ASCII in fact. Mailers which support US-ASCII only could not handle such a message.

If you store messages in a spool or folders after conversion from ISO-2022-JP to EUC-Japan, please don't be blind. You should check charset out, and convert only ISO-2022-JP messages to EUC-Japan.

Insertion of non-ASCII in header is one of MIME features but in fact MIME-Version: is not necessary.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

13.3 The concept of canonicalization

Unfortunately, each computer in the world represents data with its own format. The followings are end of lines used in major OSes.

UNIX :: LF(0x0a)
MS-DOS :: CRLF(0x0d0a)
MacOS :: CR(0x0d)

As you know, if there is no agreement for end of line, text is not transfered between these OSes safely. RFC822 defines to transform end of line into CRLF. This kind of format conversion is called canonicalization. Converting SJIS and EUC-Japan to JUNET code is a kind of canonicalization.

OK, let's think about encryption and signature with PGP. Suppose that a Mac user signed text whose line breaks are CR then sent it to a UNIX user. If the UNIX user transforms line break to LF then verifies the signature, it is obvious that the verification fails. You thus understand that canonicalization is necessary.

When you encrypt or sign text with PGP, first convert it to ISO-2022-JP then transform its end of lines into CRLF.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by XEmacs shared group account on December, 19 2009 using texi2html 1.65.