計算機とその周辺: What I Talk About When I Talk About Computers: unicode

ラベル unicode の投稿を表示しています。すべての投稿を表示

2009年8月30日日曜日

端末で論理記号がずれる (3)

UnicodeとEmacsの仕様を多少読んでみた。

以前、EastAsian ambiguous width characters が根源、と書いたが、これは少し乱暴だった。

というのは、あたりまえなのですが、論理記号は EastAsian というカテゴリではないから。

端末上のEmacsを多少いじってみてわかったのだが、mathematical symbols とその supplemental なものどもについて、'small'とついているものが、half width であるというのがルールであるとするならば、単純に、Emacsとフォント(X?)の双方がUnicode標準に精確に準拠していないために発生していることがわかった。

で、そういうルールは無いということであれば、根源はEastAsianの場合と同じだ。

これ、keymapやabbrevをつかって ad hoc な解決はできるのだけど、やっぱり根本解決しないと気持ち悪い。しかし curses 含めて対応されるのは時間がかかりそう。。。

そろそろGtkなEmacsに移行すべきかもしれない。

2009年8月17日月曜日

端末で論理記号がずれる (2)

Webで調べた。Unicodeでいうと、

EastAsian ambiguous width characters

というところが問題の根源らしい。

Emacs上での対応状況は、

[mule-ja-2009:09575]

以下のスレッドにくわしい。

2009年2月2日月曜日

XMLを扱う (その15) [Unicode][Common Lisp]


 ワークフローの続き。closure-xmlを理解したいのだが、runesという用語
 がわからない、というのが文脈。ソースを読む。

 closure-common.asd :

 char-code-limitなどを用いて、処理系(内部処理)のUnicode対応を判定し
 ている。判定した結果としてfeatureが追加される。

 ---
 Unicode非対応
   :rune-is-integer
 Unicode(UTF-16)対応?
   Unicode対応のようだがsurrogate pairの取り扱いが変。
   :rune-is-character
 Unicode(UTF-16)対応
   :rune-is-utf-16
   :rune-is-character
 ---

 で、

 #-rune-is-character
 (format t "~&;;; Building Closure with (UNSIGNED-BYTE 16) RUNES~%")

 #+rune-is-character
 (format t "~&;;; Building Closure with CHARACTER RUNES~%") 

 ということらしい。さらにdefsystemのfileのソース指定もこれらによっ
 て変化する。

---
(defsystem :closure-common
    :default-component-class closure-source-file
    :serial t
    :components
    ((:file "package")
     (:file "definline")
     (:file runes
            :pathname
             #-rune-is-character "runes"
             #+rune-is-character "characters")
     #+rune-is-integer (:file "utf8")
     (:file "syntax")
     #-x&y-streams-are-stream (:file "encodings")
     #-x&y-streams-are-stream (:file "encodings-data")
     #-x&y-streams-are-stream (:file "xstream")
     #-x&y-streams-are-stream (:file "ystream")
     #+x&y-streams-are-stream (:file #+scl "stream-scl")
     (:file "hax"))
    :depends-on (#-scl :trivial-gray-streams
         #+rune-is-character :babel))
---

 ちなみにaclは、

 Unicode(UTF-16)対応
   :rune-is-utf-16
   :rune-is-character

 だった。

 さて、characters.lispとrunes.lispの比較。

 まず型。

--- characters.lisp --- 
(deftype rune () #-lispworks 'character #+lispworks 'lw:simple-char)
(deftype rod () '(vector rune))
(deftype simple-rod () '(simple-array rune))
---  

--- runes.lisp ---  
(deftype rune () '(unsigned-byte 16))
(deftype rod () '(array rune (*)))
(deftype simple-rod () '(simple-array rune (*)))
---   

 characters.lispの方は処理系の型の別名。
 runes.lispは16-bitを確保。

 関数とか。
 
--- characters.lisp --- 
(definline digit-rune-p (char &optional (radix 10))
  (digit-char-p char radix))
---

--- runes.lisp --- 
(definline digit-rune-p (char &optional (radix 10))
  (cond ((<= #.(char-code #\0) char #.(char-code #\9))
         (and (< (- char #.(char-code #\0)) radix)
              (- char #.(char-code #\0))))
        ((<= #.(char-code #\A) char #.(char-code #\Z))
         (and (< (- char #.(char-code #\A) -10) radix)
              (- char #.(char-code #\A) -10)))
        ((<= #.(char-code #\a) char #.(char-code #\z))
         (and (< (- char #.(char-code #\a) -10) radix)
              (- char #.(char-code #\a) -10))) ))
---

 こちらもcharacters.lispは処理系の機能の別名、runes.lispは処理を実
 装。

 さて、utf8というprefixの関数がcxmlの説明でたまにでてくるが、これは
 何か。

 まず、defsystemで、

      #+rune-is-integer (:file "utf8")

 とあるのでUnicode非対応の処理系用の機能だろう。
 続いてutf8.lispを見る。

--- 
(deftype rune () 'character)
(deftype rod () '(vector rune))
(deftype simple-rod () '(simple-array rune))
---

 なるほど。runes.lispではcxml内で、UTF-16をエミュレートするが、
 utf8.lisp では、UTF-8をエミュレートするのか。

 これでcxmlのドキュメントを読み解く基礎はできたかなぁ。

こつこつ。

2009年2月1日日曜日

XMLを扱う (その14) [Unicode][Common Lisp]


 ワークフローの続き。

 ソースを精査する前に、Lispの外の自分の環境のencodingを調べてみる。

 ** Mac OS X Leopard

 http://images.apple.com/jp/macosx/pdf/L355785A_UNIX_TB_J.pdf

 -----
 OS内部(ってどこ？): UTF-32 [たぶんNFDなんじゃないか]
 ファイルシステム(HFS+): UTF-16 [NFD]
 アプリケーション: アプリケーション毎
   Terminal.app: 選択可能。ただしUnicodeとしては、UTF-8 [NFD] (UTF-8-mac)。
 -----

 Ubuntuについては別途調べる。

 
 ** Allegro Common Lisp

 http://www.franz.com/support/documentation/8.1/doc/iacl.htm

 -----
 internal representation: UTF-16
 external formats:
         Name            Nicknames                      Comments
  ----            ---------                      --------
  :latin1         :ascii, :8-bit, :iso8859-1, t
  :1250                                          For MS Windows
  :1251                                          For MS Windows
  :1252                                          For MS Windows
  :1253                                          For MS Windows
  :1254                                          For MS Windows
  :1255                                          For MS Windows
  :1256                                          For MS Windows
  :1257                                          For MS Windows
  :1258                                          For MS Windows
  :iso8859-2     :latin-2, :latin2
  :iso8859-3     :latin-3, :latin3
  :iso8859-4     :latin-4, :latin4
  :iso8859-5     :latin-5, :latin5
  :iso8859-6     :latin-6, :latin6
  :iso8859-7     :latin-7, :latin7
  :iso8859-8     :latin-8, :latin8
  :iso8859-9     :latin-9, :latin9
  :iso8859-14    :latin-14, :latin14
  :iso8859-15    :latin-15, :latin15
  :koi8-r

  :emacs-mule     For eli
  :octets     For uncovered octets
  :void

  :utf8           :utf-8
  :big5
  :gb2312
  :euc            :ujis
  :874                                           For MS Windows
  :932                                           For MS Windows
  :936                                           For MS Windows
  :949                                           For MS Windows
  :950                                           For MS Windows
  :jis
  :shiftjis

  :unicode     BOM対応(BOM無しはLE)
  
  などなど。まだいろいろあるようだ。

 さて、internal representationとexternal formatsの関係について。

 まず、REPL。Unicode Standardで定義されている名前を#\[name]に使える。
 ただし空白はunderscoreにする。名前が無いものはそのまま？使える。
 
--- 
CL-USER(7): #\latin_capital_letter_l_with_stroke
#\Ł
CL-USER(8): (code-char #x0141)
#\Ł
CL-USER(9): #\気
#\気
--- 

 実はこれはeli経由でやっているので、すんなりいくのはeliの賜物な気が
 する。eliはexternal formats経由でやりとりしているので、純粋に
 REPL(純粋なREPLって何だ？)の話題ではないといえばない。

 さて、ACLでは、streamとforeign-function callにて、自動変換機能をもっ
 ている。ここでの変換というのは、internal representaion と external
 formats の間のこと。

 さてexternal formatsの指定がどうなされているか。

 まず*locale*にある。

--- 
CL-USER(6): *locale*
#<locale "ja_JP" [:UTF8-BASE] @ #x1000695502>
--- 

 これが採用される。次に個別のstreamをopenする際に指定できる。これは
 implemetation dependentであるexternal file formatをplaceとして利用
 している、ということ。

 これらexternal-formatsの指定にしたがってstreamに関するたとえば
 read-charやwrite-charは動作する。

 Streamのexternal-formatを調べるには例えばinspect。

---
CL-USER(12): (inspect *standard-output*)
A TENURED TERMINAL-SIMPLE-STREAM @ #x1000245e42 = #<TERMINAL-SIMPLE-STREAM [initial terminal io] fd 0/1 @ #x1000245e42>
   0 Class --------> #<STANDARD-CLASS TERMINAL-SIMPLE-STREAM>
   1 J-WRITE-BYTE -> #<Function DUAL-CHANNEL-WRITE-BYTE>
   ... 
  16 EXTERNAL-FORMAT -> The symbol :EMACS-MULE
   ...
  36 SRC-POSITION-TABLE -> The symbol NIL
[1i] CL-USER(13):  
---
 -----

 さて、これでソースを精査しようかと思ったが、closure-commonは.asdファ
 イルの中でいろいろやっているようだ。

 ASDFについておさらいをしてからか、もしくはしつつ進もう。

こつこつ。

XMLを扱う (その13) [Unicode][Common Lisp]

ワークフローの続き。


 cxmlのサイトをみてみると、SAXのコーナーが他の機能の説明の基礎になっ
 ているようだ。まあDOMをつくるのにもSAXを使うからなぁ。cxmlのSAXコー
 ナーを読んでみる。

 http://common-lisp.net/project/cxml/sax.html

 むぅ。そもそも、SAXって何だ？ わからなくなった。なんかもやもやして
 いるのは、SAXってparserじゃなかったっけ？ってこと。XML文書を読んで
 いくにしたがってeventがあがって、そのeventにたいしてcallbacksを定義
 しておくっていう構造だった気がするのだが。cxmlではsax eventをユーザ
 がいろいろ生成できてconstructorに使えるような記述がある。

 そもそものSAXを調べる。

 つ http://www.saxproject.org/

 ---
 ** 規格化はされていない。デファクトである。

 ** もともとはJava用のAPI。version 2.0.1ではいろいろな言語が対応し
 ている。

 ** SAXの仕様はJava用の実装そのものである。なので、他の言語で何が
 SAXかは、その言語での実装者が決めることだ。おお、すごい。

 ** Java APIをみてみる。Javaはよくわからないが、そもそもSAXに対する
 私の理解は間違ってなさそうだ。(確信はもてない)
 ---

 というわけで、もしcxmlのsaxではユーザがsax eventsを生成できるなら、
 それはcxmlにおけるSAXの独自な特徴なのだろう。

 cxmlのsaxに戻る。

 cxml:parseの説明を読むが、正直何を言ってんだかさっぱりわからない。

 ソースを読む。

 ああ、わかってきた。runeというのはUnicodeの取り扱いが処理系毎に違
 うことを吸収するための抽象化なんだ。なので、compile時に処理系毎に
 何がrune?というが選択されてcompileされるようだ。今日は力尽きたので、
 次回この理解が本当に正しいかどうかソースを読み込む。

こつこつ。

XMLを扱う (その12) [Unicode]

ワークフローの続き。


 ** 2.6 Encoding Schemes

 前節のencoding formsのときは、計算機個々の中での取り扱いの話をして
 いる。計算機間でデータをやりとりするときには、byteのならびなどにつ
 いて決めておかないとうまくいかない。Unicode encoding schemaはその
 あたりに関する話。

 Byte Order: 最近の計算機は、big-endianなのとlittle-endianなのがある。
 UTF-16とUTF-32においてはこれは重要。

 さて、

 character encoding scheme =
   character encoding form +
   code units を byteにserializeする方法

 である。

 The Unicode Standard では、byte orderの指定のために頭につけるbyte
 order mark (BOM)というものも規程している。

 もし、アプリケーション側にbyte orderを適切に扱う仕掛けがあるなら、
 character encoding schemeやBOMを使わなくてもよい。

 Encoding Scheme の一覧
 ---
 Encoding Scheme  Endian Order                 BOM Allowed?
 UTF-8            N/A                          yes
 UTF-16           Big-endian or little-endian  yes         
 UTF-16BE         Big-endian                   no
 UTF-16LE         little-endian                no
 UTF-32           Big-endian or little-endian  yes         
 UTF-32BE         Big-endian                   no
 UTF-32LE         littele-endian               no
 ---

 UTF-8のBOMは、非推奨ではあるが、UTF-8であることの判別に使われるこ
 とがある。

 Encoding Scheme Versus Encoding Form.:
 両者で同じ名称が使われているが別ものである。Encoding Formは、メモ
 リ上の表現やAPIなどで使用されるものであり、このときは母体たる計算
 機の処理として完結しているで、byteとしてどのように直列化されている
 かなどに関心はない。Encoding Schemeは、streaming I/Oやfile storage
 などを考えるときに使うものであり、byte-orderが重要である。

 IANAはchaset namesというのを管理しているが、そこでに登録されている
 のは、encoding schemesである。ただし、IANAのcharsetとUnicodeの
 encoding schemeには概念的に差分があるので注意すること。


 ** 2.7 Unicode Strings

 A Unicode string data typeというのは、code unitsのシーケンスである。

 この節のprogramming environement毎の実装のくだり、正直何いってんだ
 かわからない。Section 3.9にでてくるらしいwell-formed UTF-16とかの概
 念をつかって説明しているようなのだが、それが何だかわからない。また、
 isolated surrogatesというのも初出じゃないか？ なのに説明がない。

 まあ、Unicode stringsのことをrodsと呼ぶことがThe Unicode Standard
 の用語じゃないことはわかったのでよしとする。


 ** 2.8 Unicode Allocation

 この節は概念だけおさえる。

 Plane:

 the Unicode code spaceは、64Kごとに分割して考えることができて、そ
 の64Kのcode pointsをplaneと呼ぶ。

 ---
 Basic Multilingual Plane:
 BMP or Plane 0。

 Supplementary Multilingual Plane:
 SMP or Plane 1

 Supplementary Ideographic Plane:
 SIP or Plane 2

 Supplementary Special-purpose Plane:
 SSP or Plane 14
 
 Private Use Planes:
 Planes 15 and 16
 ---

 ** 2.9 Details of Allocation

 スキップ。図表をみるのが吉。

 
 ** 2.10 Writing Direction

 現在の興味とは無関係なのでスキップ。

 
 ** 2.11 Combining Characters

 現在の興味とは無関係なのでスキップ。

 
 ** 2.12 Equivalent Sequences and Normalization

 現在の興味とは無関係なのでスキップ。

 
 ** 2.13 Special Characters and Noncharacters

 現在の興味とは無関係なのでスキップ。

 
 ** 2.14 Conforming to the Unicode Standard

 conformanceの定義はChapter 3でやる。ここでは、conformantとnot
 conformantとがどういうものかというトピックを羅列する。

 ---
 It reats characters according to the specified Unicode encoding
 form.

 byte sequence (20 20)は、
   UTF-16なら、U+2020 (dagger)
   UTF-8なら、(U+0020 U+0020) (two spaces)

 It interprets characters according to the identities, properties,
 and rules defined for them in this standard.

 ま、そのまんま。
 ---

 *** Unacceptable Behavior

 ---
 To use unassinged codes.

 To corrupt unsupported characters.

 To remove or alter uninterpreted code points in text that purports
 to be unmodified.
 ---

 *** Acceptable Behavior

 ---
 To support only a subset of the Unicode characters.

 To transform data knowingly.

 To build higher-level protocols on the character set.

 To define private-use characters.

 To not support the Bidirectional Algorithm or character shaping in
 implementations that do not support comprex scripts, such as
 Arabic and Devanagari.

 To not support the Bidirectional Algorithm or character shaping in
 implementations that do not display characters, as, for example,
 on servers or in programs that simply parse or transcode text, suc
 as an XML paraser.
 ---

 とりあえず、Chapter 2で関係がありそうなところは読んだ。
 Unicodeの基本概念のイメージは捉めた。

 cxml-domの問題にもどる。

こつこつ。

2009年1月31日土曜日

XMLを扱う (その11) [Common Lisp][Unicode]

ワークフロ-の続き。


 cxml-stpから。

 この際だから、cxmlのDOMに立ちもどって確認する。

 http://common-lisp.net/project/cxml/dom.html

 をみる。

 IDL/Lisp mappingというのはOMGが定めたもので、

 http://www.omg.org/docs/formal/00-06-02.pdf

 これ。Franzが作成支援したようだ。

 さて、cxml-stpが主張しているのは、

 ** DOMはIDLをつかって規程されている。
 ** IDL/LISP mappingはOMGで標準化されている。
 ** だからといってDOM/LISP mappingが標準化されているとはいえないぜ。
 ** なので、IDL/LISP mappingに従わず、俺流 mappingをするぜ。

 ということ。まあ、それはそうだ。

 characters/strings instead of runes/rods って何だろ？

 調べる。Unicodeに対応していない Lisp処理系にてUnicodeを扱うものな感
 じ。

 さらに調べる。closure-common packageで定義されているようだ。

 --- runes.lisp ---
 (deftype rune () '(unsigned-byte 16))
 (deftype rod () '(array rune (*)))
 (deftype simple-rod () '(simple-array rune (*)))
 ------------------

 Unicodeを扱うためのものであることは間違いなさそうだ。

 runeは英語の意味(ルーン文字/呪文)そのまんまから来たんだろうな。rod
 は？stringがひもだからそれとの対応としてムチってことかな?

 さて、Unicodeとか、UTF-8とかを正確に理解していないような気がする。
 私の理解ではUnicodeというのは文字集合の定義とエンコーディング達の定
 義を含む総称的な規格であり、その中でもUTF-8というのはPlan9を発祥と
 するエンコーディングというものだった。だけどここではUnicodeに対応し
 ているものはrune、そうでないものはutf-8という分け方がされている。
 むぅ。Unicode 対応ということが何かがわかってないんだな、私は。
 
 文字関係については、確か今夜わかるメールプロトコル、にあったような。
 あった、自分なりの補足や調査も交えつつまとめる。

 ** 文字((graphic) character):

 言葉を表記するために社会習慣として用いられる記号。

 例: アルファベット、数字、ギリシャ文字、漢字、ひらがな、カタカナ。
 
 ** 文字集合(character set):

 文字の集合。膨大な文字を計算機で扱いやすくするために、よく使う文字
 だけを集めていることが多い。下記の例はcodedが入っていることからわか
 るとおり、文字集合の定義だけでなく、次項の文字コードの定義も含んで
 いる。ただし、文字集合が何を指すかというのは規格化団体毎に揺れがあ
 る。

 例:
 ISO/IEC 8859-1:1998
 Information technology - 8-bit single-byte coded graphic character
 sets - Part 1: Latin alphabet No.1

 JIS X 0208
 7ビット及び8ビットの2バイト情報交換用符号化漢字集合

 ** 文字コード

 文字集合に対して、それと数値を対応づける方式。文字符号化方式
 (character encoding scheme)と呼ばれることもある。ただし、文字コード
 が何を指すかというのは規格化団体毎に揺れがある。

 例：
 JIS X 0208は文字集合を規程するとともに、ISO-2022-JP、EUC-JP、
 Shift_JISという3つの文字コードを規程している。

 さて、ではUnicodeは？

 ** UnicodeはUnicode Consortiumが企画の策定・維持を行っている規格で
 ある。

 ** The Unicode Standardという文書で規程している。

 ** The Unicode Standardではそれぞれの文字について、名前とコード
 ポイントを定めている。code pointとは数値のことである。

 ** 現在の最新バージョンは5.1.0である。

 ** バージョン5.0とISO/IEC 10646は互換である。

 ** 5.1.0はWebで公開されている。しかしそのPDFは印刷できない。5.0は
 書籍としても出版されている。

 ** The Unicode Standardは、character encoding standardである。
 これ以降特にことわりがない限り、The Unicode Standard version 5.1.0
 についての話とする。

 ** 3つのエンコーディング形式を定めている。a 32-bit form (UTF-32)、
 a 16-bit form (UTF-16)、 an 8-bit form (UTF-8)。(あれ？ やはり
 UTF-8とUnicodeに対する理解は間違っていなかった)

 ** コードポイントの総数は、1,114,112である。(百万以上)

 ** コードポイントの始めの65,536をBasic Multilingual Plane (BMP)と呼ぶ。

 ** コードポイントは、0から0x10FFFFまでの間の値をとる。

 ** 計算機でテキスト処理をするさいには、そのテキストの要素については
 いろいろな表現形式や取り扱いがあるであろう。ここでは、テキスト要素
 とはコードポイントであるとする。その意味で、コードポイントのことを
 encoded characters(符号化文字)と呼ぶ。

 ** ここまでがだいたいChapter 1 Introduction。

 ** 続いてChapter 2 General Structure。

 ** Chapter 2 のアウトライン。

 *** text representationとtext processingの本性について整理する。

 *** The Unicode Design Prinsipleを紹介する。

 *** the Unicode character encoding model を紹介する。そこでは、
 character、code point、encoding forms という概念がそれらの相関ふく
 めて導入される。これらによって、UTF-8、UTF-16、UTF-32の説明が可能
 となり、これらエンコーディングの利点・欠点もあわせて説明する。

 *** the Unicode codespaceを紹介する。

 *** writing directionの話題を説明する。

 *** equivalent sequencesとnormalizationを説明する。

 *** the Unicode Standardに対するcomformanceについてざっくりと説明す
 る。(お、ここでUnicode対応とはなんぞやがわかるかも)

 ** 2.1 Architectual Context

 character code standardというのはそれ自体が計算機のテキスト処理のア
 プリケーションを成しているわけではなく、有用なアプリケーションをつ
 くるための部品である。部品であるがゆえ、様々な用途に使われる。それ
 ゆえすべての用途のrequirementsをみたしたものを作るのは不可能であり、
 それらをいい感じでみたすようなバランスをとることになる。

 
 *** Basic Text Processes

 たいていの計算機では、low-levelのテキスト処理機能をまず用意して、
 それをもとに多様なテキスト処理が組み立てられる。このlow-levelの部
 分をBasic Text Processesと呼ぶ。

 **** Rendering characters visible
 
 **** Breaking lines while rendering
 
 **** Modifying appearance, such as point size, kerning,
 underlining, slat and weight

 **** Determining units such as "word" and "sentence"

 **** Interacting with users in processes such as selecting and
 highlighting text

 **** Accepting keyboard input and editing stored text through
 insertion and deletion

 **** Comparing text in operations such as in searching or
 determining the sort order of two strings

 **** Analyzing text content in operations such as spell-checking,
 hyphenation, and parsing morphology

 **** Treating text as bulk data for operations such as compressing
 and decompressing, truncating, transmitting, and receiving

 
 *** Text Elements, Characters, and Text Process

 なにがtext elementであるということは、どういうtext processにおいて？
 ということと不可分である。普遍的なというか、text processと独立して
 いるようなtext elementsの定義は存在しない。

 例えば、英語の"A"と"a"はレンダリングでは別のものだが、語を検索する
 ときには同じと扱われる。ドイツ語では、letter combination "ck"は、
 ハイフネーションするときは"k-k"として分離して扱うが、sortするとき
 は"ck"を一体として扱う。

 また、英語でspell-checkするときは、"the quick brown fox,"の"fox"が
 テキスト要素になる。

 で、Unicodeがcharacter encoding standardとして定めているのは、text
 elementsでなく、もっとアトミックなcharacterですよ。それを特定する
 code pointsですよ。それらをassigned characterと呼びますよ。

 charactersからtext elementsを作る形態は4つある。

 **** composite (合成)

 **** Collation Unit (照合単位)

 **** Syllable (音節(を表わすつづり文字))

 **** Word (語)

 *** Text Processes and Encoding

 the Unicode Standardは特定のbasic text-processing algorithmsに依存
 しないように設計している。

 ** 2.2 Unicode Design Principles

 ---
 Universality: The Unicode Standard provides a single, universal
 repertoire.

 Efficincy: Unicode text is simple to parase and process.

 Characters, not glyphs: THe Unicode Standard encodes characters,
 not glyphs.

 Semantics: Characters have well-defined semantics.

 Plain text: Unicode characters represent plain text.

 Logical order: The default for memory representaion is logical
 order.

 Unification: The Unicode Standard unifies duplicate characters
 withiin scripts across languages.

 Dynamic composition: Accented forms can be dynamically composed.

 Stability: Characters, once assigned, cannot be reassigned and key
 properties are immutable.

 Convertibility: Accurate convertibility is guaranteed between the
 Unicode Standard and other widely accepted standards.
 ---

 今、興味があるものだけ深掘りする。

 *** Characters, Not Glyphs

 Characters: 書き言葉の最小単位の抽象表現である。ここで書き言葉は意
 味論上の価値を持つものに限る。代表的なものとして、letters、
 punctuation、signsなどがある。自然言語で使われるlettersをグルーピ
 ングしたものはscriptsと呼ぶ。あるletterが異なるscriptsに属すること
 もあろう。そして、それが意味論的にも見た目的にも同一だとする。それ
 でもUnicodeはそれらを別のcharactersとして扱う。scriptsの分類が優先
 ということ。

 glyphというのは要は見た目のこと。ただし、単一のletterに関するもの
 だけではなく、letterの連続もあることに注意たとえば、英語の筆記体に
 て"fox"と単語で書くときと"f""o""x"と個別に書くときではglyphが違う。

 fontというのは、Rendering processにおけるUnicode charactersから
 glyphsへのマッピングである。

 *** Semantics

 Unicodeのsemanticsは、characterの名前やcode table上の位置などでは
 な定義しない。すなわち暗黙の定義はしない。character propertiesで明
 示的に定義する。それらを格納したものをThe Unicode Character
 Databaseと呼ぶ。このdatabaseの情報をもとに、parsing、sortingなどの
 アルゴリズムを構築する。

 Unicode流のアプリケーションプログラミングをする場合、このdatabase
 のsemanticsを核におくことになり、i18n対応のためのcode set indepent
 modelは不要になる。code set indepent modelでは、byte-streamの
 semanticsはcode set毎に異なるという見地から、byte-streamの取り扱う
 際には、個別のcode setの個別に定義されたsemanticsを選択的に適用する
 機構を有するものである。これがUnicodeでは不要となる。もちろん例えば
 UTF-8を単なるcode setの追加と位置付けて、code set independet model
 で取り扱うことも可能ではあるが。

 *** Unification

 scriptsという概念とlanguagesという概念はUnicodeでは別。そして、先
 程書いたように、同じletterが別のScriptに属するならば、そのletterの
 Semanticsや見た目が同じでも別のcharacterとするが、別のlagnguageで
 使用されるletterであってもそれを同じscriptにまとめらるならば統一し
 て単一のcharacterにしてしまう、ということ。

 ** 2.3 Compatibility characters

 ここは今の興味には無関係なので割愛。

 ** 2.4 Code Points and Characters

 抽象的であるcharactersという概念は、計算機の内部では、具体的に数と
 して表現される。

 その数の値域をcodespaceと呼ぶ。その中のひとつの数をcode pointと呼
 ぶ。

 code pointとencoded characterは一対一対応ではない。例えば、Aの上に
 ○があるabstract character の eEncoded characterは、次の3通りがある。

 code point : 00C5
 code point : 212B
 code point : 0041 + code point : 030A

 最後のものは、letter A のcode pointとletter 上付き○のDynamic
 Compositionである。

 Unicodeのcode pointの表記の関連は、"U+" + 16進数である。

 U+0061  LATIN SMALL LETTER A
 U+201DF CJK UNIFIED IDEOGRAPH-201DF

 *** Types of Code Points

 code pointsはいくつかの観点でカテゴライズできる。

 基本的な分類は次のとおり。

 ---
 Graphic:
 Letter, mark, number, punctuation, symbol, and spaces
 Assigned to abstract character

 Format:
 Invisible but affects neighboring characters; includes
 line/paragraph separators
 Assigned to abstract character

 Control:
 Usage defined by protocols or standards outside the Unicode
 Standard
 Assigned to abstract character

 Private-use:
 Usage defined by private agreement outside the Unicode standard
 Assigned to abstract character

 Surrogate:
 Permanently reserved for UTF-16; restricted interchange
 Not assigned to abstract character

 Noncharacter:
 Permanent reserved for internal usage; restricted interchange
 Not assigned to abstract character

 Reserved:
 Reserved for future assignment; restricted interchange
 Not assigned to abstract character
 ---

 ** 2.5 Encoding Forms

 計算機は数というものを数学上の数そのままとしてではなく、fixed-size
 units で取り扱う。例えばbyteとか32-bit wordsとか。character
 encodingはそういう現実に則して設計する必要がある。

 計算機上で整数は特定のcode unitsとして実装される。ここでcode units
 は、たいていが、8-bit、16-bit、32-bitである。

 そこで、Unicodeのencoding formsでは、それぞれのcode point(整数)を
 どのようにa sequence of code unitsとして表現するかを規程している。

 The Unicode Standardでは、code units 8-bit, 16-bit, 32-bitに対応し
 て、次の3のencoding formsを提供している。

 UTF-8, UTF-16, UTF-32

 UTFは、Unicode (or UCS) Transformation Formatの略である。

 これらはどれも表現力として等価であり、相互変換がロスレスで可能であ
 る。

 *** Non-overlap

 CP932等はoverlapするが、Unicodeはoverlapしない。これは便利。

 どういうことかというと、CP932は1byteまたは2byteで表現するのだが、
 2byteのleading byteと1byteが衝突することはないが、1byteと2byteの
 trailing byteは数として衝突することがある。これは衝突している文字
 (1byte)側を検索するときなどやっかいである。

 The Unicode encoding formsではこの心配はない。lead、trail、single
 だろうがなんだろうが、何かを表現している数にoverlapは存在しない。

 *** Conformance

 "The Unicode Consortium fully endorses the use of any of the three
 Unicode encoding forms as a conformant way of implementing the
 Unicode Standard. It is important not to fall int the trap of
 trying todistinguish "UTF-8 versus Unicode," for example."

 あり？ ということは、cxml-domの記述はおかしくないか？
 やはり私のそもそもの理解でただしかったのか？
 もう少しUnicodeを読みといてから、考える。

 *** Examples.

 AΩ語(不思議な文字)

 UTF-32 00000041    0000003A9    00008A9E    00010384
 UTF-16     0041         03A9        8A9E   D800 DF84
 UTF-8        41        CE A9    E8 AA 9E F0 90 8E 84

 *** UTF-32

 もっともシンプル。code pointの数とcode unitの数が一対一対応。
 値域は0..0x10FFFF。いわずもがなだが、code pointと同じ。
 なので、U+0000..U+10FFFFというcode pointの表記そのままがencodingの
 値である。
 
 
 *** UTF-16

 U+0000..U+FFFF (BMP)は、そのまま16-bit code unitひとつで表わす。
 U+10000..U+10FFFFは、2つの16-bit code unitsであらわす。このペアを
 surrogate pairと呼ぶ。surrogate pairに割り当てる領域は先の単一の
 encodingに使われる領域とは隔離されている。これがnon-overlapを実現し
 ている。


 *** UTF-8

 U+0000..&+007F ASCII code points (0x00..0x7F)

 それ以外は、Table 3-6をみるのがよさげ。

うーん。とりあえず、ここまで。次回は2.6 Encoding Schemes。
なんかUnicodeのお勉強になっているが、ま、気にしない。

こつこつ。

登録: 投稿 (Atom)

計算機とその周辺: What I Talk About When I Talk About Computers

2009年8月30日日曜日

端末で論理記号がずれる (3)

2009年8月17日月曜日

端末で論理記号がずれる (2)

2009年2月2日月曜日

XMLを扱う (その15) [Unicode][Common Lisp]

2009年2月1日日曜日

XMLを扱う (その14) [Unicode][Common Lisp]

XMLを扱う (その13) [Unicode][Common Lisp]

XMLを扱う (その12) [Unicode]

2009年1月31日土曜日

XMLを扱う (その11) [Common Lisp][Unicode]

ラベル

自己紹介

ブログアーカイブ

2009年8月30日日曜日

2009年8月17日月曜日

2009年2月2日月曜日

2009年2月1日日曜日

2009年1月31日土曜日

ラベル

自己紹介

ブログ アーカイブ

ブログアーカイブ