UTF-8: バイト列→文字列変換

前々回に作成したURLデコード用の関数では、sb-ext:octets-to-string関数が処理のボトルネックとなっていた。
確かsbcl(1.0.28)はバイト列から文字列への変換には、UTF-8でもShift-JISでもEUC-JP(及びその他)でも出来るような汎用的な方法(枠組み?)*1を採用していたはずだが、(sbclでは)文字は内部的にはユニコード値として表現されているので、それを利用すれば(UTF-8に限れば)もっと効率的に変換できるはずだと思う。
今回はそれを試してみた。

以下がUTF-8バイト列をユニコード文字列に変換する関数。
入力のバイト列はsimple-arrayだと云うことが前提で、若干エラーチェックが不足している。
※ この関数はsbcl用に作られたものだが、文字の表現としてユニコードを採用している処理系なら一応動作するはず

;;;; 型定義および宣言
(deftype octet () '(unsigned-byte 8))  
(deftype simple-octets () '(simple-array octet))

(declaim (optimize (speed 3) (debug 0) (safety 0) (compilation-speed 0))
         (inline bit-off? bit-val utf8-octets-to-string))

;;;; 補助関数定義
;; 下からpos番目のbitが0ならtrue
(defun bit-off? (pos digit)
  (not (ldb-test (byte 1 pos) digit)))

;; 下からlength番目までの値を取り出し、必要ならシフトさせる
(defun bit-val (length digit &optional (shift 0))
  (ash (ldb (byte length 0) digit) shift))

;;;; 変換関数
;; posから始まるUTF8バイト列(octets)を、ユニコード一文字に変換してstringに追加する
;; ※ バイト列はsimple-octetsである必要がある
(defun utf8-to-ucs(octets pos string &aux (os octets))
  (declare (simple-octets os) (fixnum pos))
  (let* ((octet (aref os pos))
         (code
          (cond ((bit-off? 7 octet) ; #b0xxxxxxx
                 octet)
                ((bit-off? 6 octet) ; #b10xxxxxx
                 (error "Illegal UTF-8 character starting at byte position ~D." pos))
                ((bit-off? 5 octet) ; #b110xxxxx #b10xxxxxx
                 (+ (bit-val 5 octet 6)  
                    (bit-val 6 (aref os (incf pos)))))  ; XXX: 後続の値が適切かどうかのチェックは行っていない(以下同)
                ((bit-off? 4 octet) ; #b1110xxxx #b10xxxxxx #b10xxxxxx
                 (+ (bit-val 4 octet 12) 
                    (bit-val 6 (aref os (incf pos)) 6)
                    (bit-val 6 (aref os (incf pos)))))
                ((bit-off? 3 octet) ; #b11110xxx #b10xxxxxx #b10xxxxxx #b10xxxxxx 
                 (+ (bit-val 3 octet 18) 
                    (bit-val 6 (aref os (incf pos)) 12)
                    (bit-val 6 (aref os (incf pos)) 6)
                    (bit-val 6 (aref os (incf pos)))))
                ;; CHAR-CODE-LIMITの値が21bit以下であることが前提のコード
                (t (error "CHAR-CODE-LIMIT(~D) exceeded (position ~D)." char-code-limit pos)))))
    (vector-push (code-char code) string)
    (the fixnum (1+ pos))))  ; 次の文字が始まる位置を返す

;; UTF8バイト列をユニコード文字列に変換する  
;; フィルポインタ付きのstringを返す
(defun utf8-octets-to-fill-pointer-string (octets &aux (len (length octets)))
  (declare  (simple-octets octets))
  (let ((as (make-array len :element-type 'character :fill-pointer 0)))  ; バッファを多めに確保しておく
    (do ((i 0 (utf8-to-ucs octets i as))) 
        ((>= i len) as)       
      (declare (fixnum i)))))

;; utf8-octets-to-fill-pointer-stringの戻り値をsimple-stringに変換する
(defun utf8-octets-to-string (octets)
  (coerce (utf8-octets-to-fill-pointer-string octets) 'simple-string))

sb-ext:octets-to-string関数との比較結果。
予想通り、今回作成した関数の方が速かった(おおよそ二倍*2 )。

;;;;;;;;;;;;;;;
;;;; データ用意
;; ファイル読込み関数
(defun read-file (path)
  (sb-ext:octets-to-string
    (with-open-file (in path :element-type '(unsigned-byte 8))
      (let ((as (make-array (file-length in) :element-type '(unsigned-byte 8))))
	        (read-sequence as in)
	as))))

;; 夏目漱石の『心』のUTF-8バイト列
(defvar *kokoro* (sb-ext:string-to-octets (read-file "/path/to/kokoro") :external-format :utf8))

;;;;;;;;;
;;;; 計時
;; sb-ext:octets-to-string関数
> (time (dotimes (i 10) (sb-ext:octets-to-string *kokoro* :external-format :utf8)))
Evaluation took:
  0.181 seconds of real time
  0.164010 seconds of total run time (0.160010 user, 0.004000 system)
  90.61% CPU
  572,873,551 processor cycles
  28,440,864 bytes consed

;; utf8-octets-to-string関数
> (time (dotimes (i 10) (utf8-octets-to-string *kokoro*)))
Evaluation took:
  0.095 seconds of real time
  0.092007 seconds of total run time (0.084006 user, 0.008001 system)
  96.84% CPU
  303,277,449 processor cycles
  29,676,960 bytes consed

;; ついでにutf8-octets-to-fill-pointer-string関数も
> (time (dotimes (i 10) (utf8-octets-to-fill-pointer-string *kokoro*)))
Evaluation took:
  0.059 seconds of real time   ; (coerce ... 'simple-string)が不要な分高速
  0.056003 seconds of total run time (0.052003 user, 0.004000 system)
  [ Run times consist of 0.004 seconds GC time, and 0.053 seconds non-GC time. ]
  94.92% CPU
  187,361,375 processor cycles
  22,189,040 bytes consed

追記

冒頭で「sbclは汎用的な方法を採用しているから...」といったことを書いたが、今回作成した関数も、utf8-to-ucs関数の部分をeuc-jp-to-ucsとかshift-jis-to-ucsとかに置き換えてあげれば、他のエンコード用に使えるはず(ある意味同程度に汎用的)なので、そのことはあまり関係ないかもしれない。

*1:変換に必要ないくつかの関数があって、特定のエンコード用のそれらの関数を定義/用意してあげれば、どんなエンコード方式にも対応できる、とかそういった感じの方法

*2:例によって適当な比較しかしていないので、あくまでも一つの目安にしかならないが