計算機とその周辺: What I Talk About When I Talk About Computers: 【Subversion】Subversionにおけるエンコーディングの取扱い

これくらいのまとめを書くのに実は結構な量のソースを読んで七転八倒した。
ソース以外の設計図がないというのはつらい。

Subversionのエンコーディング取扱
--------------------------

Subversionとして文字コード取り扱いの基礎となるのは、

static svn_error_t *
convert_cstring(const char **dest,
                const char *src,
                xlate_handle_node_t *node,
                apr_pool_t *pool);

である。ここでxlate_handle_node_t *nodeが文字コード
の変換を規程するapr_xlate_t *handleを持っている。ち
なみにapr_xlateはAPRのI18N変換ライブラリである。

typedef struct xlate_handle_node_t {
  apr_xlate_t *handle;
  /* FALSE if the handle is not valid, since its pool is being
     destroyed. */
  svn_boolean_t valid;
  /* The name of a char encoding or APR_LOCALE_CHARSET. */
  const char *frompage, *topage;
  struct xlate_handle_node_t *next;
} xlate_handle_node_t;

handleは次のよう。

struct apr_xlate_t {
    apr_pool_t *pool;
    char *frompage;
    char *topage;
    char *sbcs_table;
    iconv_t ich;
};

ここでchar *frompageが変換元文字コードの指定、char
*topageが変換先文字コードの指定である。

さて、handleはget_ntoU_xlate_handle_nodeが作る。

static svn_error_t *
get_ntou_xlate_handle_node(xlate_handle_node_t **ret, apr_pool_t *pool)
{
  return get_xlate_handle_node(ret, SVN_APR_UTF8_CHARSET,
                               SVN_APR_LOCALE_CHARSET,
                               SVN_UTF_NTOU_XLATE_HANDLE, pool);
}

これはWrapperであり、本体は次のもの。

static svn_error_t *
get_xlate_handle_node(xlate_handle_node_t **ret,
                      const char *topage, const char *frompage,
                      const char *userdata_key,
                      apr_pool_t *pool);

この中で、まずhandleをopenして、

apr_xlate_open(&handle, topage, frompage, pool);

  apr_xlate_openによって、handleの中身がどのように
  作られるのかというと、

  new->ich = iconv_open(topage, frompage);

  ということで、とどのつまりtopageとfrompageの組み
  合わせで指定してiconvから取得しているのだ。


続いてxlate_handle_node t **retを初期化する。

  *ret = apr_palloc(pool, sizeof(xlate_handle_node_t));
  (*ret)->handle = handle;
  (*ret)->valid = TRUE;
  (*ret)->frompage = ((frompage != SVN_APR_LOCALE_CHARSET)
                      ? apr_pstrdup(pool, frompage) : frompage);
  (*ret)->topage = ((topage != SVN_APR_LOCALE_CHARSET)
                    ? apr_pstrdup(pool, topage) : topage);
  (*ret)->next = NULL;

とする。

さてここで作成されたxlate_handle_node_t型オブジェク
トのhandleメンバがapr_xlate_conv_bufferの引数
convsetとして使われる。


APU_DECLARE(apr_status_t) apr_xlate_conv_buffer(apr_xlate_t *convset,
                                                const char *inbuf,
                                                apr_size_t *inbytes_left,
                                                char *outbuf,
                                                apr_size_t *outbytes_left);

この関数の中身で文字コード変換をしている実体は、
iconvである。(ただしwi32にiconvがないので、それは、
apr_iconvという同梱されているものをつかう)

translated = iconv(convset->ich, (ICONV_INBUF_TYPE)&inbufptr,
                   inbytes_left, &outbufptr, outbytes_left);

さて、

convert_cstringの中身はわかった。iconvである。
それのtopageとfrompageがどう与えられるかを確認する。

代表的なのは、

svn_error_t *
svn_utf_cstring_to_utf8(const char **dest,
                        const char *src,
                        apr_pool_t *pool)
{
  xlate_handle_node_t *node;
  svn_error_t *err;

  SVN_ERR(get_ntou_xlate_handle_node(&node, pool));
  err = convert_cstring(dest, src, node, pool);
  put_xlate_handle_node(node, SVN_UTF_NTOU_XLATE_HANDLE, pool);
  SVN_ERR(err);
  SVN_ERR(check_cstring_utf8(*dest, pool));

  return SVN_NO_ERROR;
}

svn_error_t *
svn_utf_cstring_from_utf8(const char **dest,
                          const char *src,
                          apr_pool_t *pool)
{
  xlate_handle_node_t *node;
  svn_error_t *err;

  SVN_ERR(check_utf8(src, strlen(src), pool));

  SVN_ERR(get_uton_xlate_handle_node(&node, pool));
  err = convert_cstring(dest, src, node, pool);
  put_xlate_handle_node(node, SVN_UTF_UTON_XLATE_HANDLE, pool);

  return err;
}

の2つである。これらはほぼシンメトリックだ。

これらを呼び出しているのは例えば、svn_path_*だ。


svn_error_t *
svn_path_cstring_to_utf8(const char **path_utf8,
                         const char *path_apr,
                         apr_pool_t *pool)
{
  svn_boolean_t path_is_utf8;
  SVN_ERR(get_path_encoding(&path_is_utf8, pool));
  if (path_is_utf8)
    {
      *path_utf8 = apr_pstrdup(pool, path_apr);
      return SVN_NO_ERROR;
    }
  else
    return svn_utf_cstring_to_utf8(path_utf8, path_apr, pool);
}

svn_error_t *
svn_path_cstring_from_utf8(const char **path_apr,
                           const char *path_utf8,
                           apr_pool_t *pool)
{
  svn_boolean_t path_is_utf8;
  SVN_ERR(get_path_encoding(&path_is_utf8, pool));
  if (path_is_utf8)
    {
      *path_apr = apr_pstrdup(pool, path_utf8);
      return SVN_NO_ERROR;
    }
  else
    return svn_utf_cstring_from_utf8(path_apr, path_utf8, pool);
}

ここで重要なのは、APRの内部処理がUTF-8かどうかによっ
て、振舞いをかえているということだ。(ちなみに
svn_path_cstring_to_utf8がOSXでsvnがちゃんとうごく
ようにするためのcore foundationのパッチをあてると
ころ)

get_path_encoding(&path_is_utf8, pool)

これは、与えられたpath_*のエンコーディングではなく、
APRの内部エンコーディングがどうなっているかを問合
わせている。

これが、UTF-8の場合は、

svn_path_cstring_to_utf8 は
  *path_utf8 = apr_pstrdup(pool, path_apr); するだけ。

svn_path_cstring_from_utf8 は
  *path_apr = apr_pstrdup(pool, path_utf8); するだけ。

UTF-8じゃない場合は、

svn_path_cstring_to_utf8 は
  svn_utf_cstring_to_utf8する。

svn_path_cstring_from_utf8 は
  svn_utf_cstring_from_utf8する。

ということ。すなわち、

  * APRの内部エンコーディングがUTF-8であるというこ
    とは、その環境(OSなど)のエンコーディングが
    UTF-8であるということの証左である。

  * SVNの内部ではエンコーディングはUTF-8である。

  * ただし、APR判定で環境がUTF-8の場合は、外部から
    与えられたUTF-8バイト列を無変換で内部に取り込
    む。APR判定で環境がUTF-8でない場合は、
    svn_utf_cstring_*等によって変換処理をして内部
    に取り込む。

ということだ。違う言い方をすると、

  * UTF-8として内部に取り込まれる方式が二種類ある
    が、いずれもNFD/NFCについては気にしておらず、
    Subversionの中で、ファイル名やパス文字列を
    NFD/NFCのどちらで扱っているかは、Subversionの
    中では規程されておらず、Subversionを利用してい
    る環境に依存する。

ということだな。



それでは、svn clientがどのようにしてworking
directoryに新規ファイルを追加するのか。そのときに
path名をどう扱っているのかを追ってみよう。



まずsvn addコマンドの本体は次の関数である。

/* This implements the `svn_opt_subcommand_t' interface. */
svn_error_t *
svn_cl__add(apr_getopt_t *os,
            void *baton,
            apr_pool_t *pool);

この中でいろいろな処理をするが、今関心である文字列
についていえば、

  apr_array_header_t *targets;

なる構造が重要である。これを

  SVN_ERR(svn_cl__args_to_target_array_print_reserved(&targets, os,
                                                      opt_state->targets, 
                                                      pool));

によって作成した後、

  for (i = 0; i < targets->nelts; i++)
    {
      const char *target = APR_ARRAY_IDX(targets, i, const char *);

      svn_pool_clear(subpool);
      SVN_ERR(svn_cl__check_cancel(ctx->cancel_baton));
      SVN_ERR(svn_cl__try
              (svn_client_add4(target,
                               opt_state->depth,
                               opt_state->force, opt_state->no_ignore,
                               opt_state->parents, ctx, subpool),
               NULL, opt_state->quiet,
               SVN_ERR_ENTRY_EXISTS,
               SVN_ERR_WC_PATH_NOT_FOUND,
               SVN_NO_ERROR));
    }

にて、svn_client_add4を読んで、entriesへの情報の追
加処理を実施している。

さて、ここでtargetsがどのように作られるか確認しよ
う。

svn_error_t *
svn_cl__args_to_target_array_print_reserved(apr_array_header_t **targets,
                                            apr_getopt_t *os,
                                            apr_array_header_t *known_targets,
                                            apr_pool_t
                                            *pool);

は、

svn_opt_args_to_target_array3(targets, os,
                              known_targets, pool);

のwrapperである。svn_opt_aargs_to_target_array3を
みてみよう。さすがにエンコーディングに関するコメン
トがあるので、そのまま掲載する。


svn_error_t *
svn_opt_args_to_target_array3(apr_array_header_t **targets_p,
                              apr_getopt_t *os,
                              apr_array_header_t *known_targets,
                              apr_pool_t *pool)
{
  int i;
  svn_error_t *err = SVN_NO_ERROR;
  apr_array_header_t *input_targets =
    apr_array_make(pool, DEFAULT_ARRAY_SIZE, sizeof(const char *));
  apr_array_header_t *output_targets =
    apr_array_make(pool, DEFAULT_ARRAY_SIZE, sizeof(const char *));

  /* Step 1:  create a master array of targets that are in UTF-8
     encoding, and come from concatenating the targets left by apr_getopt,
     plus any extra targets (e.g., from the --targets switch.) */

  for (; os->ind < os->argc; os->ind++)
    {
      /* The apr_getopt targets are still in native encoding. */
      const char *raw_target = os->argv[os->ind];
      SVN_ERR(svn_utf_cstring_to_utf8
      /* *****************************************
         ここで一発utf8変換(しないかもだけど、をか
         ける。UTF-8であることは保証される。
        *****************************************/
              ((const char **) apr_array_push(input_targets),
               raw_target, pool));
    }

  if (known_targets)
    {
      for (i = 0; i < known_targets->nelts; i++)
        {
          /* The --targets array have already been converted to UTF-8,
             because we needed to split up the list with svn_cstring_split. */
          const char *utf8_target = APR_ARRAY_IDX(known_targets,
                                                  i, const char *);
          APR_ARRAY_PUSH(input_targets, const char *) = utf8_target;
          /* ********************************
             ここでknown_tagetsをばらして、
             input_tagetsに吸収している。
             ******************************** */
        }
    }

  /* Step 2:  process each target.  */

  for (i = 0; i < input_targets->nelts; i++)
    {
      const char *utf8_target = APR_ARRAY_IDX(input_targets, i, const char *);
      const char *peg_start = NULL; /* pointer to the peg revision, if any */
      const char *target;      /* after all processing is finished */
      int j;

      /* Remove a peg revision, if any, in the target so that it can
         be properly canonicalized, otherwise the canonicalization
         does not treat a ".@BASE" as a "." with a BASE peg revision,
         and it is not canonicalized to "@BASE".  If any peg revision
         exists, it is appended to the final canonicalized path or
         URL.  Do not use svn_opt_parse_path() because the resulting
         peg revision is a structure that would have to be converted
         back into a string.  Converting from a string date to the
         apr_time_t field in the svn_opt_revision_value_t and back to
         a string would not necessarily preserve the exact bytes of
         the input date, so its easier just to keep it in string
         form. */
      for (j = (strlen(utf8_target) - 1); j >= 0; --j)
        {
          /* If we hit a path separator, stop looking.  This is OK
              only because our revision specifiers can't contain
              '/'. */
          if (utf8_target[j] == '/')
            break;
          if (utf8_target[j] == '@')
            {
              peg_start = utf8_target + j;
              break;
            }
        }
      if (peg_start)
        utf8_target = apr_pstrmemdup(pool,
                                     utf8_target,
                                     peg_start - utf8_target);

      /* URLs and wc-paths get treated differently. */
      if (svn_path_is_url(utf8_target))
        /* *******************************
           ここは(scheme)://(optional_stuff)という形
           式をみているだけ。
           ******************************* */
        {
          /* No need to canonicalize a URL's case or path separators. */

          /* Convert to URI. */
          target = svn_path_uri_from_iri(utf8_target, pool);
          /* ***************************
             ここはいわゆるURI-encodeをするだけ。
             UTF-8の部分。
             *************************** */
          /* Auto-escape some ASCII characters. */
          target = svn_path_uri_autoescape(target, pool);
          /* ***************************
             ここもいわゆるURI-encodeをするだけ。
             ASCIIの部分。
             *************************** */

          /* The above doesn't guarantee a valid URI. */
          if (! svn_path_is_uri_safe(target))
            return svn_error_createf(SVN_ERR_BAD_URL, 0,
                                     _("URL '%s' is not properly URI-encoded"),
                                     utf8_target);

          /* Verify that no backpaths are present in the URL. */
          if (svn_path_is_backpath_present(target))
            return svn_error_createf(SVN_ERR_BAD_URL, 0,
                                     _("URL '%s' contains a '..' element"),
                                     utf8_target);

          /* strip any trailing '/' */
          target = svn_path_canonicalize(target, pool);
          /* ***************************
             target 一丁あがり。
             *************************** */
        }
      else  /* not a url, so treat as a path */
        {
          const char *apr_target;
          const char *base_name;
          char *truenamed_target; /* APR-encoded */
          apr_status_t apr_err;

          /* canonicalize case, and change all separators to '/'. */
          SVN_ERR(svn_path_cstring_from_utf8(&apr_target, utf8_target,
                                             pool));
          /* *************************************
             APRの内部表現に変換。
             内部表現がUTF-8ならコピーするだけ。
             ************************************* */
          apr_err = apr_filepath_merge(&truenamed_target, "", apr_target,
                                       APR_FILEPATH_TRUENAME, pool);

          if (!apr_err)
            /* We have a canonicalized APR-encoded target now. */
            apr_target = truenamed_target;
          else if (APR_STATUS_IS_ENOENT(apr_err))
            /* It's okay for the file to not exist, that just means we
               have to accept the case given to the client. We'll use
               the original APR-encoded target. */
            ;
          else
            return svn_error_createf(apr_err, NULL,
                                     _("Error resolving case of '%s'"),
                                     svn_path_local_style(utf8_target,
                                                          pool));

          /* convert back to UTF-8. */
          SVN_ERR(svn_path_cstring_to_utf8(&target, apr_target, pool));
          /* *************************************
             APRの内部表現からUTF-8に変換。
             内部表現がUTF-8ならコピーするだけ。
             ************************************* */
          target = svn_path_canonicalize(target, pool);
          /* ***************************
             target 一丁あがり。
             後続にskip処理があるけどね。
             *************************** */

          /* If the target has the same name as a Subversion
             working copy administrative dir, skip it. */
          base_name = svn_path_basename(target, pool);
          /* FIXME:
             The canonical list of administrative directory names is
             maintained in libsvn_wc/adm_files.c:svn_wc_set_adm_dir().
             That list can't be used here, because that use would
             create a circular dependency between libsvn_wc and
             libsvn_subr.  Make sure changes to the lists are always
             synchronized! */
          if (0 == strcmp(base_name, ".svn")
              || 0 == strcmp(base_name, "_svn"))
            {
              err = svn_error_createf(SVN_ERR_RESERVED_FILENAME_SPECIFIED,
                                      err, _("'%s' ends in a reserved name"),
                                      target);
              continue;
            }
        }

      /* Append the peg revision back to the canonicalized target if
         there was a peg revision. */
      if (peg_start)
        target = apr_pstrcat(pool, target, peg_start, NULL);

      APR_ARRAY_PUSH(output_targets, const char *) = target;
      /* ***************************
         targetをoutput_tagetsに登録。
         *************************** */
    }


  /* kff todo: need to remove redundancies from targets before
     passing it to the cmd_func. */

  *targets_p = output_targets;
   /* ***************************
      targetsできあがり。
      *************************** */

  return err;
}


これでtargetsがどうできるのか理解できた。結局、
UTF-8の環境ならば、pathがNFDならNFDであるし、NFCな
らNFCということだ。

さて、これを受け取ってadd4が処理を実施する。
add4の呼び出し部分は


              (svn_client_add4(target,
                               opt_state->depth,
                               opt_state->force, opt_state->no_ignore,
                               opt_state->parents, ctx, subpool),


であった。さきのtargetsの要素がtargetとして渡され
ている。

svn_client_add4 は、

svn_error_t *
svn_client_add4(const char *path,
                svn_depth_t depth,
                svn_boolean_t force,
                svn_boolean_t no_ignore,
                svn_boolean_t add_parents,
                svn_client_ctx_t *ctx,
                apr_pool_t *pool);

であり、これの主たる処理は、

  err = add(path, depth, force, no_ignore, adm_access, ctx, pool);

である。引数のpathがそのままaddの引数のpathになる。
addのIFは、

static svn_error_t *
add(const char *path,
    svn_depth_t depth,
    svn_boolean_t force,
    svn_boolean_t no_ignore,
    svn_wc_adm_access_t *adm_access,
    svn_client_ctx_t *ctx,
    apr_pool_t *pool);

であり、addの対象がファイルであるときは(今はファイ
ルの場合のみを追う)。

    err = add_file(path, ctx, adm_access, pool);

が処理本体となる。add_fileを見てみよう。


static svn_error_t *
add_file(const char *path,
         svn_client_ctx_t *ctx,
         svn_wc_adm_access_t *adm_access,
         apr_pool_t *pool)
{
  apr_hash_t* properties;
  apr_hash_index_t *hi;
  const char *mimetype;
  svn_node_kind_t kind;
  svn_boolean_t is_special;

  /* Check to see if this is a special file. */
  SVN_ERR(svn_io_check_special_path(path, &kind, &is_special, pool));

  if (is_special)
    mimetype = NULL;
  else
    /* Get automatic properties */
    /* This may fail on write-only files:
       we open them to estimate file type.
       That's why we postpone the add until after this step. */
    SVN_ERR(svn_client__get_auto_props(&properties, &mimetype, path, ctx,
                                       pool));

  /* Add the file */
  SVN_ERR(svn_wc_add2(path, adm_access, NULL, SVN_INVALID_REVNUM,
                      ctx->cancel_func, ctx->cancel_baton,
                      NULL, NULL, pool));
  /* **************************************
    ここでsvn_wc_add2を読んでいる。これが本体
    *************************************** */

/* ... 後略 ... */
}

というわけでpathをそのまま引き継ぎつつ、今度は
svn_wc_add2を呼んでいる。
svn_wc_add2をみてみよう。


svn_error_t *
svn_wc_add2(const char *path,
            svn_wc_adm_access_t *parent_access,
            const char *copyfrom_url,
            svn_revnum_t copyfrom_rev,
            svn_cancel_func_t cancel_func,
            void *cancel_baton,
            svn_wc_notify_func2_t notify_func,
            void *notify_baton,
            apr_pool_t *pool)
{
  const char *parent_dir, *base_name;
  const svn_wc_entry_t *orig_entry, *parent_entry;
  svn_wc_entry_t tmp_entry;
  svn_boolean_t is_replace = FALSE;
  svn_node_kind_t kind;
  apr_uint64_t modify_flags = 0;
  svn_wc_adm_access_t *adm_access;

  SVN_ERR(svn_path_check_valid(path, pool));
  /* *******************************
     ここはpathに制御文字が入ってないか確認してる
     だけ。
     ******************************* */

  /* Make sure something's there. */
  SVN_ERR(svn_io_check_path(path, &kind, pool));
  /* *******************************
     svn_io_check_pathの本体はio_check_path。
     ここはpathにsvn_path_cstring_from_utf8
     を一回かけた上で、
     apr_stat(&finfo, path_apr, flags, pool);
     をやる。
     apr_statはflagsの値によって、
     lstatまたはstatでファイルにあたる。

     man 2 stat によると、引数として渡されるfname
     のエンコーディングにたいする記述は存在しない。

     なので、UTF-8としてもここでNFCで問い合わせるべ
     きなのか、それともNFDで問い合わせるべきなのか
     はOS次第である。OSXの場合は、このpathはNFDな
     ので、statに与えるのもNFDである。statがそれで
     正常に動作するかはわからない。後で実験してみよ
     う。
     ******************************* */

/* ... 中略 ... */

  if (adm_access)
    SVN_ERR(svn_wc_entry(&orig_entry, path, adm_access, TRUE, pool));
    /* *******************************
       orig_entryにsvn_wc_entry_t オブジェクトを設
       定する。その際、ファイルの場合、
       SVN_ERR(svn_wc__adm_retrieve_internal(&dir_access, adm_access, path, pool));
       を呼ぶ。

       ここで、svn_wc_adm_retrieve_internalは
       adm_accessをさぐりつつ、dir_accessを構成す
       る。

       if (associated->set)
         *adm_access = apr_hash_get(associated->set, path, APR_HASH_KEY_STRING);

       があるので、associatedが指し示す
       svn_wc_adm_access_t型構造がset(hash)を持つ
       ならば、、、ここよくわからない。というか、
       svn_wc_adm_access_t型の使われ方をもっと理解
       しないと理解が無理。そしてそれを理解するこ
       とは、Subversion全てを理解することのような
       気がする。その時間はかけられない。どうする
       か。
       ******************************* */
  else
    orig_entry = NULL;

/* ... 中略 ... */

  /* Split off the base_name from the parent directory. */
  svn_path_split(path, &parent_dir, &base_name, pool);
  /* *************************************
     ここで、pathからbase_nameをとりだす。
     ファイルの場合これがファイル名となる。
     ************************************* */

/* ... 中略 ... */

  /* Now, add the entry for this item to the parent_dir's
     entries file, marking it for addition. */
  SVN_ERR(svn_wc__entry_modify(parent_access, base_name, &tmp_entry,
                               modify_flags, TRUE, pool));
 /* *******************************
    ここでbase_nameにて、entriesファイルの書き換え
    を実行する。
    ******************************* */

/* ... 後略 ... */
}

さて、base_nameに辿りつくまでの間いろいろあるのだ
が、エンコーディングの変換は実施されていなさそうだ。
ひとつ気になるのは、すでにentriesに登録済のファイ
ルとのバッティングを調べるところがあるのだが、そこ
で何と何を比べているのかということをわかっていない
ということだ。もしかしたらそこで比較する際にエンコー
ディングの問題が発生しうるかもしれない。

ただ、OSXにおいても、たとえば"が.txt"をsvn addする
こと自体はできたはずなので、それは発生しないという
ことにしておく。OSXで問題が発生するのはsvn status
からだ。

さて、

  SVN_ERR(svn_wc__entry_modify(parent_access, base_name, &tmp_entry,
                               modify_flags, TRUE, pool));

を見なければいけない。
svn_wc__entry_modifyのIFは、

svn_error_t *
svn_wc__entry_modify(svn_wc_adm_access_t *adm_access,
                     const char *name,
                     svn_wc_entry_t *entry,
                     apr_uint64_t modify_flags,
                     svn_boolean_t do_sync,
                     apr_pool_t *pool);

であり、まず、

  apr_hash_t *entries, *entries_nohidden;
  svn_boolean_t entry_was_deleted_p = FALSE;
  /* Load ADM_ACCESS's whole entries file. */
  SVN_ERR(svn_wc_entries_read(&entries, adm_access, TRUE, pool));
  SVN_ERR(svn_wc_entries_read(&entries_nohidden, adm_access, FALSE, pool));

というようにentriesファイルを読み込む。

nameについては、

  if (name == NULL)
    name = SVN_WC_ENTRY_THIS_DIR;

こんな処理をした上で、

  /* If the entry wasn't just removed from the entries hash, fold the
     changes into the entry. */
  if (! entry_was_deleted_p)
    {
      fold_entry(entries, name, modify_flags, entry,
                 svn_wc_adm_access_pool(adm_access));
      if (entries != entries_nohidden)
        fold_entry(entries_nohidden, name, modify_flags, entry,
                   svn_wc_adm_access_pool(adm_access));
    }

として、entriesにnameという名前でentrをfold_entry
する。ちなみにfold_entryが何かというと、

/* Update an entry NAME in ENTRIES, according to the combination of
   entry data found in ENTRY and masked by MODIFY_FLAGS. If the entry
   already exists, the requested changes will be folded (merged) into
   the entry's existing state.  If the entry doesn't exist, the entry
   will be created with exactly those properties described by the set
   of changes. Also cleanups meaningless fields combinations.

   POOL may be used to allocate memory referenced by ENTRIES.
 */
static void
fold_entry(apr_hash_t *entries,
           const char *name,
           apr_uint64_t modify_flags,
           svn_wc_entry_t *entry,
           apr_pool_t *pool);

ということ、この関数にてnameのエンコーディングがい
じられることはない。

そして最後に、

    SVN_ERR(svn_wc__entries_write(entries, adm_access, pool));

にて、svn_wc__entries_writeにてentryを組み込んだ
entriesをファイルEntriesに書き出す。



ということで、svn_wc__entries_write を調べる必要が
ある。

svn_wc__entries_writeをみてみよう。


svn_error_t *
svn_wc__entries_write(apr_hash_t *entries,
                      svn_wc_adm_access_t *adm_access,
                      apr_pool_t *pool)
{
  svn_error_t *err = SVN_NO_ERROR;
  svn_stringbuf_t *bigstr = NULL;
  apr_file_t *outfile = NULL;
  apr_hash_index_t *hi;
  svn_wc_entry_t *this_dir;

  SVN_ERR(svn_wc__adm_write_check(adm_access));

  /* Get a copy of the "this dir" entry for comparison purposes. */
  this_dir = apr_hash_get(entries, SVN_WC_ENTRY_THIS_DIR,
                          APR_HASH_KEY_STRING);

  /* If there is no "this dir" entry, something is wrong. */
  if (! this_dir)
    return svn_error_createf(SVN_ERR_ENTRY_NOT_FOUND, NULL,
                             _("No default entry in directory '%s'"),
                             svn_path_local_style
                             (svn_wc_adm_access_path(adm_access), pool));

  /* Open entries file for writing.  It's important we don't use APR_EXCL
   * here.  Consider what happens if a log file is interrupted, it may
   * leave a .svn/tmp/entries file behind.  Then when cleanup reruns the
   * log file, and it attempts to modify the entries file, APR_EXCL would
   * cause an error that prevents cleanup running.  We don't use log file
   * tags such as SVN_WC__LOG_MV to move entries files so any existing file
   * is not "valuable".
   */
  SVN_ERR(svn_wc__open_adm_file(&outfile,
                                svn_wc_adm_access_path(adm_access),
                                SVN_WC__ADM_ENTRIES,
                                (APR_WRITE | APR_CREATE),
                                pool));

  if (svn_wc__adm_wc_format(adm_access) > SVN_WC__XML_ENTRIES_VERSION)
    {
      apr_pool_t *subpool = svn_pool_create(pool);
      bigstr = svn_stringbuf_createf(pool, "%d\n",
                                     svn_wc__adm_wc_format(adm_access));  //#### ここでentriesファイルの中身たる文字列バッファを作成。
      /* Write out "this dir" */
      write_entry(bigstr, this_dir, SVN_WC_ENTRY_THIS_DIR, this_dir, pool); //#### 起点dir情報を書く。

      for (hi = apr_hash_first(pool, entries); hi; hi = apr_hash_next(hi)) //#### 引数で与えられたentries(hash)をひとつづつ処理する。
        {
          const void *key;
          void *val;
          svn_wc_entry_t *this_entry;

          svn_pool_clear(subpool);

          /* Get the entry and make sure its attributes are up-to-date. */
          apr_hash_this(hi, &key, NULL, &val); //#### apr_hash_thisでkeyに値を仕込む。apr_hash_thisは要調査。
          this_entry = val;

          /* Don't rewrite the "this dir" entry! */
          if (! strcmp(key, SVN_WC_ENTRY_THIS_DIR ))
            continue;

          /* Append the entry to BIGSTR */
          write_entry(bigstr, this_entry, key, this_dir, subpool); //#### ここでthis_entryを文字列に書き出している。keyが引数になっている。
        }

      svn_pool_destroy(subpool);
    }
  else
    /* This is needed during cleanup of a not yet upgraded WC. */
    write_entries_xml(&bigstr, entries, this_dir, pool);

  SVN_ERR_W(svn_io_file_write_full(outfile, bigstr->data,
                                   bigstr->len, NULL, pool),  //#### ここでbigstrをoutfileに書き出し。
            apr_psprintf(pool,
                         _("Error writing to '%s'"),
                         svn_path_local_style
                         (svn_wc_adm_access_path(adm_access), pool)));

  err = svn_wc__close_adm_file(outfile,
                               svn_wc_adm_access_path(adm_access),
                               SVN_WC__ADM_ENTRIES, 1, pool);

  svn_wc__adm_access_set_entries(adm_access, TRUE, entries);
  svn_wc__adm_access_set_entries(adm_access, FALSE, NULL);

  return err;
}


この関数の要点は、apr_hash_thisでhashから順次entry
を取り出して、write_entryでそれをバッファに書いて、
svn_io_file_write_fullでそれをファイルに書くという
こと。

apr_hash_thisは、"Get the current entry's details
from the iteration state." とのこと。

とすると、

apr_hash_this(hi, &key, NULL, &val);

は、keyについてそのままわたすだけ。


続いてwrite_entryを調べてみよう。


/* Append a single entry ENTRY to the string OUTPUT, using the
   entry for "this dir" THIS_DIR for comparison/optimization.
   Allocations are done in POOL.  */
static void
write_entry(svn_stringbuf_t *buf,
            svn_wc_entry_t *entry,
            const char *name, //#### ここでkeyが渡される。
            svn_wc_entry_t *this_dir,
            apr_pool_t *pool)
{
  const char *valuestr;
  svn_revnum_t valuerev;
  svn_boolean_t is_this_dir = strcmp(name, SVN_WC_ENTRY_THIS_DIR) == 0;
  svn_boolean_t is_subdir = ! is_this_dir && (entry->kind == svn_node_dir);

  assert(name);

  /* Name. */
  write_str(buf, name, pool); //#### 渡されたkeyをそのままwrite_strに渡す。

  /* Kind. */
  switch (entry->kind)
    {
    case svn_node_dir:
      write_val(buf, SVN_WC__ENTRIES_ATTR_DIR_STR,
                 sizeof(SVN_WC__ENTRIES_ATTR_DIR_STR) - 1);
      break;

    case svn_node_none:
      write_val(buf, NULL, 0);
      break;

    case svn_node_file:
    case svn_node_unknown:
    default:
      write_val(buf, SVN_WC__ENTRIES_ATTR_FILE_STR,
                 sizeof(SVN_WC__ENTRIES_ATTR_FILE_STR) - 1);
      break;
    }

  /* Revision. */
  if (is_this_dir || (! is_subdir && entry->revision != this_dir->revision))
    valuerev = entry->revision;
  else
    valuerev = SVN_INVALID_REVNUM;
  write_revnum(buf, valuerev, pool);

  /* URL. */
  if (is_this_dir ||
      (! is_subdir && strcmp(svn_path_url_add_component(this_dir->url, name,
                                                        pool),
                             entry->url) != 0))
    valuestr = entry->url;
  else
    valuestr = NULL;
  write_str(buf, valuestr, pool); //#### URLについてもentry->urlをそのままwrite_str。

  /* Repository root. */
  if (! is_subdir
      && (is_this_dir
          || (this_dir->repos == NULL
              || (entry->repos
                  && strcmp(this_dir->repos, entry->repos) != 0))))
    valuestr = entry->repos;
  else
    valuestr = NULL;
  write_str(buf, valuestr, pool);

  //... 中略 ...

  /* Remove redundant separators at the end of the entry. */
  while (buf->len > 1 && buf->data[buf->len - 2] == '\n')
    buf->len--;

  svn_stringbuf_appendbytes(buf, "\f\n", 2);
}


nameとentry->urlを処理しているのはwrite_strであった。
なのでwrite_strを調べてみよう。


/* If STR is non-null, append STR to BUF, terminating it with a
   newline, escaping bytes that needs escaping, using POOL for
   temporary allocations.  Else if STR is null, just append the
   terminating newline. */
static void
write_str(svn_stringbuf_t *buf, const char *str, apr_pool_t *pool)
{
  const char *start = str;
  if (str)
    {
      while (*str)
        {
          /* Escape control characters and | and \. */
          if (svn_ctype_iscntrl(*str) || *str == '\\')
            {
              svn_stringbuf_appendbytes(buf, start, str - start);
              svn_stringbuf_appendcstr(buf,
                                       apr_psprintf(pool, "\\x%02x", *str));
              start = str + 1;
            }
          ++str;
        }
      svn_stringbuf_appendbytes(buf, start, str - start); //#### エスケープされた制御文字以外は、ここでappendbytesするだけ。
    }
  svn_stringbuf_appendbytes(buf, "\n", 1);
}


write_strは、svn_stringbuf_appendbytesでバイトを足すだけだ。

この枝はこれで葉。svn_wc__entries_writeにおける処理の次の枝は、
svn_io_file_write_full だ。

svn_id_file_write_full を見てみよう。


svn_error_t *
svn_io_file_write_full(apr_file_t *file, const void *buf,
                       apr_size_t nbytes, apr_size_t *bytes_written,
                       apr_pool_t *pool)
{
  apr_status_t rv = apr_file_write_full(file, buf, nbytes, bytes_written);

#ifdef WIN32
#define MAXBUFSIZE 30*1024
  if (rv == APR_FROM_OS_ERROR(ERROR_NOT_ENOUGH_MEMORY)
      && nbytes > MAXBUFSIZE)
    {
      apr_size_t bw = 0;
      *bytes_written = 0;

      do {
           rv = apr_file_write_full(file, buf, 
                                 nbytes > MAXBUFSIZE ? MAXBUFSIZE : nbytes, &bw); //#### 実質ここでファイルに書き出している。
        *bytes_written += bw;
        buf = (char *)buf + bw;
        nbytes -= bw;
      } while (rv == APR_SUCCESS && nbytes > 0);
    }
#undef MAXBUFSIZE
#endif

  return do_io_file_wrapper_cleanup
    (file, rv,
     N_("Can't write to file '%s'"),
     N_("Can't write to stream"),
     pool);
}


これは、apr_file_write_fullのラッパーであった。
apr_file_write_fullは、

------
apr_status_t apr_file_write_full( apr_file_t * thefile,
                                  const void * buf,
                                  apr_size_t nbytes,
                                  apr_size_t * bytes_written
     ) 
     
     Write data to the specified file, ensuring that all of the data is written before returning.

     Parameters:
     thefile The file descriptor to write to.
     buf The buffer which contains the data.
     nbytes The number of bytes to write.
     bytes_written If non-NULL, this will contain the number of bytes written.
------

なのでたぶんbufのバイト列を書き出すだけということ
だろう。


さて、svn_wc__entries_writeとは何だったのか。

* svn_wc__entries_writeは、既に作成済みのentries(ハッ
* シュ)をentriesファイルに書き出す処理を実施するが、
* その処理過程にて、entries(ハッシュ)に格納されてい
* るnameなどは無加工である。


またずいぶん長かったが、svn addにてファイルを
working directoryに足すとき、.svn/entitiesファイル
に書かれるファイル名は、コマンドラインでの

$ svn add が.txt

の"が.txt"がNFDならNFD、NFCならNFCということがわかっ
た。

実際にOSXでsvn add が.txtしたときのentriesファイル
をみてみると、

$ od -t x1 -t c entries
... 前略 ...
0000360    36  34  31  61  63  61  64  66  33  0a  0c  0a  e3  81  8c  2e
           6   4   1   a   c   a   d   f   3  \n  \f  \n  が  **  **   .
0000400    74  78  74  0a  66  69  6c  65  0a  0a  0a  0a  61  64  64  0a
           t   x   t  \n   f   i   l   e  \n  \n  \n  \n   a   d   d  \n
0000420    0c  0a                                                        
          \f  \n                                                        
0000422
$ 

このようにNFCになっている。ということはコマンドラ
インの引数の取り扱いがOSXではNFCであるということだ
ろうか？

GDBで確認してみよう。

int main(int argc, char *argv [])
{
     return (0);
}

をコンパイルして"が.txt"を引数にして実行して、GDBでみ
てみると、

(gdb) x/3cx argv[1]
0xbffff3c5: 0xe3 0x81 0x8c

たしかにNFCだ。しかしこれはGDBの所作かもしれないが、
まあよしとしちゃう。(疲弊しているのだ。。。)

ここがOSXでのNFD/NFC問題の一方の原因なのかもしれな
い。


readdirでディレクトリの中身を読取ると、ファイル名は
NFDで返ってくる。同じファイル名でsvn addするとその
引数はNFCで渡される。
だいぶ見えてはきている。
Subversionの内部エンコーディングはUTF-8。ただし、それがNFDかNFCかは常に関知していなくて、それぞれそのまま扱う。
Subversionの中でエンコーディングの変換をする場合は、APRの内部エンコーディング=環境のエンコーディングという仮定のもと、外部から受け取ったpathについてはUTF-8に変換する。
変換エンジンはiconvである。
OSXは、コマンドラインからアプリへの引数の受け渡しはNFCのようだ。さきに確認したようにreaddirはNFDでファイル名を返す。このあたりでアンマッチを発生させている予感。
あとstat(2)はNFDなのかNFCなのかも気になる。
OSXでsubversionがまともに動くようにするためのpatchは、svn_path_cstring_to_utf8について、core foundationにあるNFCへの変換関数を必ず通るようにする、というものだ。
ここで必ずNFCにするようにしておけば、Subversionの内部においては、常にNFCであることが担保されるのであろう。
そうすると、ファイルシステム上のファイルは、本体であろうがtext-base/配下であろうがNFDであり、それをsubversionの目で見ると、entriesの内容含めてすべてNFCに統一されてみえるということだろう。
CF patch無しのsvnにおける別の状況として、例えば、LinuxマシンでNFCの"が.txt"をcheck inして、それをcheck outした場合はどうなるのだろう。
おそらく、OSXのファイルを作るAPIにおいて、外からNFCとして投入されたファイルの名前はNFDに自動変換されるのだろう。
そして、entriesの中身はNFCなのだろう。
ここでアンマッチが発生して使えないworking directoryが一丁あがりとなるのだろう。
さて、ここからどこまで調べるか。。。。もういいとするか。。。
しかし、精魂尽きた。。。土曜日のShibuya.lispに行けるのだろうか。。。
計算機とその周辺: What I Talk About When I Talk About Computers

2009年2月27日金曜日

【Subversion】Subversionにおけるエンコーディングの取扱い

0 件のコメント:

ラベル

自己紹介

ブログアーカイブ

計算機とその周辺: What I Talk About When I Talk About Computers

2009年2月27日金曜日

【Subversion】Subversionにおけるエンコーディングの取扱い

0 件のコメント:

ラベル

自己紹介

ブログ アーカイブ

ブログアーカイブ