計算機とその周辺: What I Talk About When I Talk About Computers: XMLを扱う (その12) [Unicode]

ワークフローの続き。

 ** 2.6 Encoding Schemes

 前節のencoding formsのときは、計算機個々の中での取り扱いの話をして
 いる。計算機間でデータをやりとりするときには、byteのならびなどにつ
 いて決めておかないとうまくいかない。Unicode encoding schemaはその
 あたりに関する話。

 Byte Order: 最近の計算機は、big-endianなのとlittle-endianなのがある。
 UTF-16とUTF-32においてはこれは重要。

 さて、

 character encoding scheme =
   character encoding form +
   code units を byteにserializeする方法

 である。

 The Unicode Standard では、byte orderの指定のために頭につけるbyte
 order mark (BOM)というものも規程している。

 もし、アプリケーション側にbyte orderを適切に扱う仕掛けがあるなら、
 character encoding schemeやBOMを使わなくてもよい。

 Encoding Scheme の一覧
 ---
 Encoding Scheme  Endian Order                 BOM Allowed?
 UTF-8            N/A                          yes
 UTF-16           Big-endian or little-endian  yes         
 UTF-16BE         Big-endian                   no
 UTF-16LE         little-endian                no
 UTF-32           Big-endian or little-endian  yes         
 UTF-32BE         Big-endian                   no
 UTF-32LE         littele-endian               no
 ---

 UTF-8のBOMは、非推奨ではあるが、UTF-8であることの判別に使われるこ
 とがある。

 Encoding Scheme Versus Encoding Form.:
 両者で同じ名称が使われているが別ものである。Encoding Formは、メモ
 リ上の表現やAPIなどで使用されるものであり、このときは母体たる計算
 機の処理として完結しているで、byteとしてどのように直列化されている
 かなどに関心はない。Encoding Schemeは、streaming I/Oやfile storage
 などを考えるときに使うものであり、byte-orderが重要である。

 IANAはchaset namesというのを管理しているが、そこでに登録されている
 のは、encoding schemesである。ただし、IANAのcharsetとUnicodeの
 encoding schemeには概念的に差分があるので注意すること。


 ** 2.7 Unicode Strings

 A Unicode string data typeというのは、code unitsのシーケンスである。

 この節のprogramming environement毎の実装のくだり、正直何いってんだ
 かわからない。Section 3.9にでてくるらしいwell-formed UTF-16とかの概
 念をつかって説明しているようなのだが、それが何だかわからない。また、
 isolated surrogatesというのも初出じゃないか？ なのに説明がない。

 まあ、Unicode stringsのことをrodsと呼ぶことがThe Unicode Standard
 の用語じゃないことはわかったのでよしとする。


 ** 2.8 Unicode Allocation

 この節は概念だけおさえる。

 Plane:

 the Unicode code spaceは、64Kごとに分割して考えることができて、そ
 の64Kのcode pointsをplaneと呼ぶ。

 ---
 Basic Multilingual Plane:
 BMP or Plane 0。

 Supplementary Multilingual Plane:
 SMP or Plane 1

 Supplementary Ideographic Plane:
 SIP or Plane 2

 Supplementary Special-purpose Plane:
 SSP or Plane 14
 
 Private Use Planes:
 Planes 15 and 16
 ---

 ** 2.9 Details of Allocation

 スキップ。図表をみるのが吉。

 
 ** 2.10 Writing Direction

 現在の興味とは無関係なのでスキップ。

 
 ** 2.11 Combining Characters

 現在の興味とは無関係なのでスキップ。

 
 ** 2.12 Equivalent Sequences and Normalization

 現在の興味とは無関係なのでスキップ。

 
 ** 2.13 Special Characters and Noncharacters

 現在の興味とは無関係なのでスキップ。

 
 ** 2.14 Conforming to the Unicode Standard

 conformanceの定義はChapter 3でやる。ここでは、conformantとnot
 conformantとがどういうものかというトピックを羅列する。

 ---
 It reats characters according to the specified Unicode encoding
 form.

 byte sequence (20 20)は、
   UTF-16なら、U+2020 (dagger)
   UTF-8なら、(U+0020 U+0020) (two spaces)

 It interprets characters according to the identities, properties,
 and rules defined for them in this standard.

 ま、そのまんま。
 ---

 *** Unacceptable Behavior

 ---
 To use unassinged codes.

 To corrupt unsupported characters.

 To remove or alter uninterpreted code points in text that purports
 to be unmodified.
 ---

 *** Acceptable Behavior

 ---
 To support only a subset of the Unicode characters.

 To transform data knowingly.

 To build higher-level protocols on the character set.

 To define private-use characters.

 To not support the Bidirectional Algorithm or character shaping in
 implementations that do not support comprex scripts, such as
 Arabic and Devanagari.

 To not support the Bidirectional Algorithm or character shaping in
 implementations that do not display characters, as, for example,
 on servers or in programs that simply parse or transcode text, suc
 as an XML paraser.
 ---

 とりあえず、Chapter 2で関係がありそうなところは読んだ。
 Unicodeの基本概念のイメージは捉めた。

 cxml-domの問題にもどる。
こつこつ。
計算機とその周辺: What I Talk About When I Talk About Computers

2009年2月1日日曜日

XMLを扱う (その12) [Unicode]

0 件のコメント:

ラベル

自己紹介

ブログアーカイブ

計算機とその周辺: What I Talk About When I Talk About Computers

2009年2月1日日曜日

XMLを扱う (その12) [Unicode]

0 件のコメント:

ラベル

自己紹介

ブログ アーカイブ

ブログアーカイブ