ハフマン符号化(DoubleArray用)

DoubleArrayでの実験用にハフマン符号化を実装。 ※ここの続きでもある

ハフマン符号化は、本で良く目にしてはいたのだが、これまで実装したことはなかった。
実際にやってみると案外簡単で、大体以下の四ステップで実装できた。

1. 入力文字列中の各文字の出現頻度を数える。
2. 出現頻度が高い文字ほどルートに近くなるような二分木(trie)を作成する。
3. 上の木をもとに、各文字からビット列への対応テーブル(マッピング)を作成する。
4. 後は、そのテーブルに従って文字列をエンコード。

具体的な実装方法や詳細な説明は、いつもの通り、他の文献やサイトに任せるとして、以下にソースコードを載せる。
依存関数など: nlet、paring heap

準備

;;; 一応packageを作成
(defpackage :huffman
  (:use :cl)
  (:export :count-freq
           :make-huffman-tree
           :make-encode-table
           :*encode-table*
           :string-to-octets))
(in-package :huffman)

;;; 以前実装したheapを使う
(rename-package :pairing-heap :pairing-heap '(:heap))

ハフマン木作成・その他

;; ハフマン木のノード用の関数群
(defun make-node (frequency element) (cons frequency element))  
(defun frequency (node)              (car node)) 
(defun element   (node)              (cdr node)) ; element = character or list(sub-tree)
(defun freq<     (node1 node2)       (< (frequency node1) (frequency node2)))
(defun freq+     (node1 node2)       (+ (frequency node1) (frequency node2)))

;; 文字の出現頻度を数える
(defun count-freq (str &optional (map (make-hash-table)))
  (loop FOR c ACROSS str DO
    (incf (gethash c map 0)))
  map)

;; ハフマン木を作成
(defun make-huffman-tree (freq-map)
  (let ((hp (heap:make-heap :test #'freq<)))
    ;; 1] 初期heap作成
    (maphash (lambda (char freq)
               (heap:push (make-node freq char) hp))
             freq-map)

    ;; 2] heapが空になるまで、「頻度最小の二つのnodeを取り出し、併合して、heapに戻す」という操作を繰り返す
    (do ((node #1=(heap:pop hp) #1#))
        ((heap:empty? hp) node)    ; heapに最後に残ったnodeがハフマン木
      (let ((node2 (heap:pop hp)))
        (heap:push (make-node (freq+ node node2) (cons node node2)) hp)))))

;; 文字とbit配列の対応を保持するテーブルを作成
;; 典型的な(深さ優先の)木探索
(defun make-encode-table (tree)
  (let ((table (make-array char-code-limit :initial-element nil)))
    (nlet self ((tr tree) (bits '()) current-bit)
      (when tr
        (if (atom #1=(element tr))
            (setf (aref table (char-code #1#)) 
                  (coerce (reverse (cons current-bit bits)) 'bit-vector))
          (destructuring-bind (left . right) #1#
            (self left  (cons 0 bits) 0)
            (self right (cons 1 bits) 1)))))
    table))

文字列エンコード

;; 型定義とスペシャル変数
(deftype octet () '(unsigned-byte 8))
(defvar *buffer* (make-array 128 :element-type 'octet :adjustable t :fill-pointer 0))
(defvar *encode-table*)

;; (zerof a b c) -> (setf a 0 b 0 c 0)
(defmacro zerof (&rest args)
  `(setf ,@(mapcan (lambda (a) (list a 0)) args)))

;; 文字列エンコード関数
;;  入力文字列の各文字に対応するbit表現をエンコードテーブルから取得し、つなぎ、(vector octet)として返す
;;  文字列の終端を判別するために、エンコード後の各バイトの1bit目は、識別フラグとして利用(1なら終端)
;;  入力文字列中に、エンコードテーブルに含まれない文字があった場合は、nilを返す
(defun string-to-octets (str &optional (table *encode-table*))
  (let ((octets *buffer*)
        (n 0) (i 0))
    (zerof (fill-pointer octets))
    (loop FOR ch ACROSS str 
          FOR bits = (aref table (char-code ch)) DO
          (when (null bits)
            (return-from string-to-octets nil))

          (loop FOR bit ACROSS bits DO
            (setf n (+ (ash n 1) bit))
            (when (= (incf i) 7)
              (vector-push-extend n octets)
              (zerof n i))))
    (vector-push-extend (logior #b10000000 n) octets)  ; ※ 入力によっては1byte無駄に追加することになるが、このバージョンではこのままにしておく。
    (copy-seq octets)))

例

;; テキストファイルをエンコード(圧縮)して保存
;; ※ デコードは出来ない 
(defun encode-file (from to)
  (let ((*encode-table* (make-encode-table 
                         (make-huffman-tree 
                          (count-freq (read-file from))))))
  (with-open-file (in from)
    (with-open-file (out to :direction :output
                            :if-exists :supersede
                            :element-type 'octet)
      (a.while (read-line in nil nil)
        (write-sequence (string-to-octets it) out))))))

;;; 補助関数
(defun read-file (path)
  (with-output-to-string (out)
    (with-open-file (in path)
      (let ((io (make-echo-stream in out)))
        (unwind-protect
            (loop while (read-line io nil nil))
          (close io))))))

(defmacro a.while (expr &body body)
  `(loop for it = ,expr 
         while it
         do ,@body))

一応、完成。

これをベースに試してみて、もしDoubleArrayにエンコード(圧縮)方法の改良が有効だったら、圧縮(エンコード)アルゴリズムの再検討も含めて、しっかり作り直そう。

とりあえず、今日はDoubleArrayの初期バージョンに上のhuffman:string-to-octetsでエンコードした文字列を挿入した結果を載せておく。
諸々の条件などは、ここのものと同様。

(in-package :common-lisp-user)

;; octets作成関数を再定義
(defun eos-terminated-octets (octs) octs)  ; 末尾にEOSをつける必要はない

(defun make-octets (#1=string-or-octets)
  (if (stringp #1#) (huffman:string-to-octets #1#) #1#))


;;; 実行
(time
 (let ((freq-map (make-hash-table)))
   (with-open-file (in #1="/path/to/words")
     (a.while (read-line in nil nil)
       (huffman:count-freq it freq-map)))
   (let ((huffman:*encode-table* 
          (huffman:make-encode-table 
           (huffman:make-huffman-tree freq-map)))
         (da (make-da)))

     (with-open-file (in #1#)
       (a.while (read-line in nil nil)
         (insert it da)))
     da)))

;;; 結果
Evaluation took:
  474.757 seconds of real time
  473.661602 seconds of total run time (468.461277 user, 5.200325 system)
  [ Run times consist of 39.933 seconds GC time, and 433.729 seconds non-GC time. ]
  99.77% CPU
  1,506,112,788,879 processor cycles
  124,178,138,832 bytes consed
#<DOUBLE-ARRAY base:142464 check:142464 tail:330867>

圧縮用のアルゴリズムなのだから当たり前ではあるが、euc-jpなどのエンコード方式に比べても、性能は良くなっている。

また、*encode-table*の作成に掛かる時間は、0.1秒にも満たなかったので、ハフマン符号化のための(速度的な)コストはそれほど気にしないでもよさそうだ。

ただ、もともとのDoubleArrayの実装がEOS(=1)終端の文字列(octet列)を想定していたので、今回のようなエンコーディング方法に直接適用した場合、挿入が上手くいかないケース*1もいくつかあった(86/50000 -> 約0.2%)。
それは修正する必要があるが、今日やるのは面倒なので、また今度。

*1:正確には、挿入したはずなのに、検索したらnilが返ってくるケース。多分一旦は挿入されたものが、後続の処理のどこかで上書きされてしまっている。