UTF-8: バイト列→文字列変換(C++&FFI版)

昨日作成したutf8-octets-to-string関数をclispでも試してみた。

;;;; データ準備
;; ファイルをバイト列として読み込む
(defun read-octets-file (path)
  (with-open-file (in path :element-type '(unsigned-byte 8))
    (let ((as (make-array (file-length in) :element-type '(unsigned-byte 8))))
      (read-sequence as in)
      as)))

;; 読込み   ※ 読み込むファイルのエンコードはUTF-8とする 
(defvar *kokoro* (read-octets-file "/path/to/kokoro"))

;;;; 計時
;; utf8-octets-to-string関数: 定義は昨日の記事を参照
> (time (dotimes (i 10) (utf8-octets-to-string *kokoro*)))
Real time: 2.829648 sec.        ; sbclより30倍程度遅い
Run time: 2.824177 sec.
Space: 228623504 Bytes
GC: 200, GC time: 0.456027 sec.

;; clispの組み込みの変換関数を使ってみる
> (time (dotimes (i 10) (ext:convert-string-from-bytes *kokoro* charset:utf-8)))
Real time: 0.03013 sec.         ; こっちはすごく速い: sb-ext:octest-to-stringの約6倍、utf8-octets-to-string(sbcl)の約3倍
Run time: 0.028002 sec.
Space: 7488624 Bytes
GC: 10, GC time: 0.012001 sec.

utf8-octets-to-string関数がclisp上で遅いことは予想通りなのだが、意外なことに、ついでに試したclispのext:convert-string-from-bytes関数はsbclのsb-ext:octets-to-string関数よりもだいぶ高速だった。
これはおそらくclispの変換関数がC言語で書かれているためだと思うが、この速度差は少し悔しいので、sbclでもC++の変換関数を用意して対抗してみた。

C++ & FFI版

以下がそのソースコード。

まずはC++の共有ライブラリを作成:
※ sbcl(1.0.28)の内部表現に大幅に依存している

// ファイル名: utf8_decode.cc
// コンパイル: g++ -fPIC -O2 -c utf8_decode.cc
// .so作成:    g++ -shared -Wl,-soname,libutf8_decode.so -o libutf8_decode.so utf8_decode.o

extern "C" {
#define BYTE(pos,mask,shift) ((s[pos]&mask)<<shift)
  unsigned utf8_to_ucs(const unsigned char*& s) {
    if     (!(*s&0x80)) return *s++;                                                                  // 0xxxxxxx
    else if(!(*s&0x40)) return s++,  0;      // 予想外の値: 0を返す                                   // 10xxxxxx
    else if(!(*s&0x20)) return s+=2, BYTE(-2,0x1F,6)|BYTE(-1,0x3F,0);                                 // 110xxxxx
    else if(!(*s&0x10)) return s+=3, BYTE(-3,0xF,12)|BYTE(-2,0x3F,6)| BYTE(-1,0x3F,0);                // 1110xxxx
    else if(!(*s&0x08)) return s+=4, BYTE(-4,0xF,18)|BYTE(-3,0x3F,12)|BYTE(-2,0x3F,6)|BYTE(-1,0x3F,0);// 11110xxx
    for(s++; *s&0x80; s++); // 範囲外の値: 0を返す
    return 0;
  }
#undef BYTE
  
  // src:  変換元のバイト列(simple-array (unsigned-byte 8))へのポインタ
  // dist: 変換後の文字列を格納するバッファ(simple-array character)へのポインタ
  unsigned utf8_octets_to_sbcl_string(const unsigned char* src, unsigned char *dist) {
    const unsigned size = *reinterpret_cast<const unsigned*>(src-4)>>2; // バイト列のサイズ
    const unsigned char* end = src+size;                                // バイト列の末尾

    unsigned *cur = reinterpret_cast<unsigned *>(dist); // 変換後の文字列を格納するバッファの開始位置: バッファ自体はlisp(sbcl)側で十分な量を確保済
    for(; src < end; cur++)
      *cur = utf8_to_ucs(src);
    return cur-reinterpret_cast<unsigned *>(dist);      // 変換後の文字列のサイズを返す
  }
}

次にそれを呼び出すlisp側のコード:

;; 共有ライブラリロード
(sb-alien:load-shared-object "libutf8_decode.so") 

;; 変換関数定義
(sb-alien:define-alien-routine 
  utf8_octets_to_sbcl_string 
  unsigned (src-octets int) (dist-string int)) ; XXX: ポインタ型をintで代用している (sizeof(void*)==sizeof(int)が前提)

(defun c-utf8-octets-to-string (octets)
  (declare (optimize (speed 3) (debug 0) (safety 0) (compilation-speed 0)) 
           (vector octets))
  (let* ((buf (make-string (length octets)))                ; 文字列バッファは多めに確保
         (len (utf8_octets_to_sbcl_string  ; 変換関数呼び出し
               (1+ (sb-kernel:get-lisp-obj-address octets)) ; バイト列のポインタ(int値)を渡す
               (1+ (sb-kernel:get-lisp-obj-address buf))))) ; バッファのポインタ(int値)を渡す: C++の関数内で破壊的に修正される
    (subseq buf 0 len)))  ; 必要な部分だけを取り出す

#|
;; 下の方法でも良い
;; 　これだとintではなくポインタ型を渡せるし、
;;　　sb-alien:sap-alienを通してポインタを取得しているので、GC的にも安全そう
;;    ただし、sb-alien:sap-alien関数(のアロケート処理?)は結構重く、
;;    呼び出し回数が多い場合は少なからず負担となるようなので、今回は採用しない
;; 補足: (sb-sys:vector-sap octets)は、(sb-sys:int-sap (1+ (sb-kernel:get-lisp-obj-address octets))と等価
(sb-alien:define-alien-routine utf8_octets_to_sbcl_string unsigned (src-octets (* t)) (dist-string (* t)))
(defun c-utf8-octets-to-string (octets)
  (declare (optimize (speed 3) (debug 0) (safety 0) (compilation-speed 0)) 
           (vector octets)
           (unmuffle-conditions compiler-note))
  (let* ((buf (make-string (length octets)))
         (len (utf8_octets_to_sbcl_string
               (sb-alien:sap-alien (sb-sys:vector-sap octets) (* t))    
               (sb-alien:sap-alien (sb-sys:vector-sap buf)    (* t))))) 
    (subseq buf 0 len)))
|#

計時:

> (defvar *kokoro* (read-octets-file "/path/to/kokoro"))

> (time (dotimes (i 10) (c-utf8-octets-to-string *kokoro*)))
Evaluation took:
  0.025 seconds of real time   ; clispの変換関数と同程度か少し速いくらい
  0.024002 seconds of total run time (0.016001 user, 0.008001 system)
  96.00% CPU
  80,011,099 processor cycles
  29,676,960 bytes consed

速度的には満足できるものが完成。
これくらいの時間で変換が行えるなら、sbclで大量の文字列処理を行う際の抵抗がだいぶ減る。
いろいろ危険そうなので、上記コードをそのまま簡単に使うわけにはいかないけど...。

;;;; おまけ
;; 多数の小さな文字列を対象とした場合の(sbclでの)処理速度計時
;; データには、IPADICの単語を使用  ※ IPADICはMeCabのサイト(http://mecab.sourceforge.net/)より入手可能
> (defvar *ipadic-words* ...)  ; 単語(バイト列)のリストを用意する: 手順は省略
> (length *ipadic-words*)
--> 392126  ; 約40万単語

> (float (/ (reduce #'+ (mapcar #'length *ipadic-words*)) (length *ipadic-words*)))
--> 10.551565  ; 一単語は平均11バイト

;;;;;;;;;;
;; sb-ext:octets-to-string
> (time 
   (dolist (w *ipadic-words*)
     (sb-ext:octets-to-string w)))
Evaluation took:
  1.268 seconds of real time  ; 1.27秒
  1.268080 seconds of total run time (1.268080 user, 0.000000 system)
  [ Run times consist of 0.016 seconds GC time, and 1.253 seconds non-GC time. ]
  100.00% CPU
  4,021,897,067 processor cycles
  80,566,208 bytes consed

;; utf8-octets-to-string
> (time
   (dolist (a *ipadic-words*)
     (utf8-octets-to-string a)))
Evaluation took:
  0.450 seconds of real time   ; 0.45秒
  0.452029 seconds of total run time (0.452029 user, 0.000000 system)
  [ Run times consist of 0.008 seconds GC time, and 0.445 seconds non-GC time. ]
  100.44% CPU
  1,429,158,644 processor cycles
  58,352,632 bytes consed

;; c-utf8-octets-to-string
> (time
   (dolist (a *ipadic-words*)
     (c-utf8-octets-to-string a)))
Evaluation took:
  0.052 seconds of real time    ; 0.05秒
  0.052003 seconds of total run time (0.052003 user, 0.000000 system)
  [ Run times consist of 0.012 seconds GC time, and 0.041 seconds non-GC time. ]
  100.00% CPU
  165,189,771 processor cycles
  33,256,928 bytes consed