All the common prefixes - sileのブログ

『Pearls of Functional Algorithm Design』の15章「All the common prefixes」に目が止まったので、その要件を満たすものをcommon lispで実装してみたメモ(かなり雑)。

以下、15章の冒頭から引用した「All the common prefixes」の概要。

Let llcp xs ys denote the length of the longest common prefix of two lists
xs and ys. For example llcp “common” “computing” = 3. Now consider the
function allcp, short for all the common prefixes, defined by
　allcp xs = map (llcp xs) (tails xs)
where tails xs returns the nonempty tails of xs. For example:
　　
xs a b a c a b a c a b
allcp xs 10 0 1 0 6 0 1 0 2 0

The first element of allcp xs is, of course, length xs. Executed directly, the
definition of allcp gives a quadratic-time algorithm. But can it be done in
linear time? Yes it can, and the aim of this pearl is to show how. The function
allcp is an important component of the Boyer–Moore algorithm for string
matching, a problem we will take up in the following pearl, so a linear-time
solution is of practical as well as theoretical interest.

まず性能は良くない(quadratic-time)けれど簡単な実装。

(defun allcp~ (s &aux (len (length s)))
  (loop FOR c ACROSS s
        FOR i FROM 0
    COLLECT (or (mismatch s s :start2 i) len) INTO rlt
    FINALLY (return (coerce rlt 'vector))))

次にlinear-timeな実装*1。
今回の実装は(おそらく)3Nの時間が掛かる*2けど、少し修正すれば2Nにすることが可能だと思う。※ N = 入力文字列の長さ

(defun allcp (s &aux (len (length s)))
  ;; 結果の配列の用意と最初の要素のセットを行う
  (let ((rlt (make-array len)))
    (when (plusp len)
      (setf (aref rlt 0) len)  ; 最初の要素は常に(length s)
      (allcp-impl s rlt len))
    rlt))

;; 各共通接頭辞の算出を実際に行う関数
(defun allcp-impl (s rlt len)
  (flet ((char-compare (x y)
           (and (< x len) (< y len) (char= (char s x) (char s y)))))
    (nlet recur ((i 1) (offset 0) (overlap-i 0) (overlap-count 0))
      (when (< i len)
        (cond ((< overlap-i overlap-count)
               (if (>= (aref rlt overlap-i) offset)
                   (recur i offset 0 0)  ; 完全に重複(包含)されていないので、この位置からカウントを再開する
                 (progn
                   (setf (aref rlt i) (aref rlt overlap-i))  ; 既に格納済みの値を利用
                   (recur (1+ i) (1- offset) (1+ overlap-i) overlap-count))))
              ((char-compare offset (+ i offset))
               (recur i (1+ offset) 0 0))  ; 共通接頭辞の長さを引き続きカウント 
              (t 
               (setf (aref rlt i) offset)  ; 共通接頭辞の長さを格納
               (if (< offset 2)
                   (recur (1+ i) 0 0 0)  ; 開始位置を進めて、新たにカウント
                 (recur (1+ i) (1- offset) 1 offset)))))))) ; 重複部分は、既に算出した値を再利用するようにする

実行例。

(allcp~ "abacabacab")
--> #(10 0 1 0 6 0 1 0 2 0)

(allcp "abacabacab")
--> #(10 0 1 0 6 0 1 0 2 0)

後者の実装は、基本的には以下の繰り返し。

位置iからoffsetを一ずつ増やして、共通接頭辞の数をカウントする。
s[offset] != s[i+offset]の位置に達したら、今度はiを、i+offsetに達するまで一つずつ増やす。
- rlt[i]〜rlt[i+offset]の値には、rlt[0]〜rlt[offset]の値を再利用することが(大抵の場合は)可能なので、走査しつつrltを埋めていく

各recur呼び出しは必ずoffsetかiのどちらかをインクリメントしていて(※例外あり。後述)、再帰呼び出しはi==(length s)になるまで続くので、処理ステップは大体2N。
ただし、一ヶ所offsetとiのどちらも変化させない関数呼び出し*3が含まれている、実際には3Nとなる。

この実装が適切なのかどうかは分からないけど、とりあえずこれが今日の成果。

*1:バグがあるかも・・・。

*2:ちゃんと調べて訳ではないので本当に3Nかどうかは不確か。

*3:allcp-impl関数の8行目