ham: 文字NグラムとバイトNグラム

ham(0.0.2)。
Nグラムベースのベイジアンフィルタ。
必要な分だけ実装してサクッと終わらせたかったけど、いくつか試したことが出てきたのでもう少し続ける。

UTF-8以外の文字コード

hamは0.0.1ではUTF-8にのみ対応し、その文字Nグラム(N〜Mグラム)*1を素性として扱っていた。
0.0.2では、それ以外にバイトNグラムも扱えるように拡張してみた*2。
この方法だと文字コードごとに個別に対応する必要がないのが楽。
以下は、そのバイトNグラムと文字Nグラムとの比較結果。

比較

バイトNグラムと文字Nグラムの比較。
まずは比較用データの準備。
※ データ取得スクリプト(download.rb,download_yahoo_blog.rb,half_cp.rb)及びその詳細に関しては前回を参照のこと

####################
## 青空文庫データ ##
# ルートディレクトリ作成
$ mkdir book
$ cd book

# データダウンロード
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person1154.html kishida_kunio
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person183.html makino_shinniti
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person76.html okamoto_kanoko
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person154.html tanaka_koutarou

# 文字Nグラム用ディレクトリ作成
$ mkdir -p {learn,test}_data/{ham,spam}

# 学習/評価用データ作成
$ ruby half_cp.rb kishida_kunio learn_data/ham 0    # ham
$ ruby half_cp.rb okamoto_kanoko learn_data/spam 0  # spam
$ ruby half_cp.rb makino_shinniti learn_data/spam 0 # spam
$ ruby half_cp.rb tanaka_koutarou learn_data/spam 0 # spam

$ ruby half_cp.rb kishida_kunio test_data/ham 1    # ham
$ ruby half_cp.rb okamoto_kanoko test_data/spam 1  # spam
$ ruby half_cp.rb makino_shinniti test_data/spam 1 # spam
$ ruby half_cp.rb tanaka_koutarou test_data/spam 1 # spam

# バイトNグラム用ディレクトリ作成
# ※ バイトNグラム用の文字コードにはShift_JISを使うことにする
$ mkdir -p sjis_{learn,test}_data/{ham,spam}

# 文字コード変換
$ for f in {learn,test}_data/*/*; do nkf -s $f > sjis_${f}; done

#
$ ls *_data/
learn_data/:
ham  spam

test_data/:
ham  spam

sjis_learn_data/:
ham  spam

sjis_test_data/:
ham  spam

$ cd ..

########################
## Yahoo!ブログデータ ##
# ルートディレクトリ作成
$ mkdir blog
$ cd blog

# データダウンロード
$ mkdir keizai bunka
$ ruby -Ku download_yahoo_blog.rb ビジネスと経済 keizai  # "ビジネスと経済"カテゴリ
$ ruby -Ku download_yahoo_blog.rb 生活と文化 bunka       # "生活と文化"カテゴリ

# 文字Nグラム用ディレクトリ作成
$ mkdir -p {learn,test}_data/{ham,spam}

# 学習/評価用データ作成
$ ruby half_cp.rb keizai learn_data/ham 0
$ ruby half_cp.rb bunka learn_data/spam 0
$ ruby half_cp.rb keizai test_data/ham 1
$ ruby half_cp.rb bunka test_data/spam 1

# バイトNグラム用ディレクトリ作成
$ mkdir -p sjis_{learn,test}_data/{ham,spam}

# 文字コード変換
$ for f in {learn,test}_data/*/*; do nkf -s $f > sjis_${f}; done

#
$ ls *_data/
learn_data/:
ham  spam

test_data/:
ham  spam

sjis_learn_data/:
ham  spam

sjis_test_data/:
ham  spam

$ cd ..

学習と分類。
※ 0.0.1と0.0.2では、hampのprecision計算方法を若干変更したので、前回の結果とは異なっている

##########
## 学習 ##
# 青空文庫: 文字2〜5グラム
$ hamt book/learn_data 2 5 > book-char.txt
$ hamc --lower-frequency-limit=4 book-char.idx < book-char.txt

# 青空文庫: バイト2〜10グラム
$ hamt book/sjis_learn_data 2 10 > book-octet.txt
$ hamc --lower-frequency-limit=4 book-octet.idx < book-octets.txt

# Yahoo!ブログ: 文字2〜5グラム
$ hamt blog/learn_data 2 5 > blog-char.txt 
$ hamc --lower-frequency-limit=4 blog-char.idx < blog-char.txt

# Yahoo!ブログ: バイト2〜10グラム
$ hamt --octet blog/sjis_learn_data 2 10 > blog-octet.txt
$ hamc --lower-frequency-limit=4 blog-octet.idx < blog-octet.txt

# 素性定義サイズ
$ ls -lh *.txt
-rw-r--r-- 1 user user  61M 2010-07-29 00:56 blog-char.txt  
-rw-r--r-- 1 user user 196M 2010-07-29 00:56 blog-octet.txt  # バイトN〜Mグラムの方が、範囲(M-N)が広い分、サイズが大きい 
-rw-r--r-- 1 user user 100M 2010-07-29 00:55 book-char.txt
-rw-r--r-- 1 user user 376M 2010-07-29 00:55 book-octet.txt  # こっちも。大体三倍ちょっとくらい。

$ wc -l *.txt
  2052870 blog-char.txt  #  205万行
  7696562 blog-octet.txt #  770万行
  3308562 book-char.txt  #  331万行
 10361063 book-octet.txt # 1040万行

# インデックスサイズ
$ ls -lh *.idx
-rw-r--r-- 1 user user 2.0M 2010-07-29 00:50 blog-char.idx
-rw-r--r-- 1 user user 4.5M 2010-07-29 00:52 blog-octet.idx  # 低頻度素性を切り捨てているので、文字Nグラムとあまり差がない
-rw-r--r-- 1 user user 3.0M 2010-07-29 00:39 book-char.idx
-rw-r--r-- 1 user user 1.4M 2010-07-29 00:40 book-octet.idx  # こっちは文字Nグラムよりも小さい

##########
## 分類 ##
# 青空文庫
$ hamp book-char.idx book/test_data  # 文字Nグラム
In 'book/test_data/ham', 204 files were classified as HAM (total 234 files)
In 'book/test_data/spam', 154 files were classified as SPAM (total 154 files)

precision: 1.0
recall   : 0.871794871794872
f-measure: 0.931506849315069  # 93.2%

$ hamp book-octet.idx book/sjis_test_data  # バイトNグラム
In 'book/sjis_test_data/ham', 209 files were classified as HAM (total 234 files)
In 'book/sjis_test_data/spam', 154 files were classified as SPAM (total 154 files)

precision: 1.0
recall   : 0.893162393162393
f-measure: 0.943566591422122  # 94.4%

# Yahoo!カテゴリ
$ hamp blog-char.idx blog/test_data  # 文字Nグラム
In 'blog/test_data/ham', 1058 files were classified as HAM (total 1221 files)
In 'blog/test_data/spam', 962 files were classified as SPAM (total 1689 files)

precision: 0.592717086834734
recall   : 0.866502866502866
f-measure: 0.703925482368596  # 70.3%

$ hamp blog-octet.idx blog/sjis_test_data  # バイトNグラム
In 'blog/sjis_test_data/ham', 1020 files were classified as HAM (total 1221 files)
In 'blog/sjis_test_data/spam', 1003 files were classified as SPAM (total 1689 files)

precision: 0.5978898007034
recall   : 0.835380835380835
f-measure: 0.696959344038265  # 69.7%

上の結果では、分類性能的にはバイトNグラムと文字Nグラムの間にほとんど違いはない。
バイトNグラムの方が素性定義ファイルが大きくなるのは難点*3だけど、汎用性を考えるとバイトNグラムに統一してしまっても良いかもしれない*4。

*1:UTF-8の文字N〜Mグラム取り出しに関してはここも参照

*2:これに関連して変更したのは入力テキスト群から素性を取り出す処理を担当している'hamt.cc'のみ。素性抽出後は、文字NグラムもバイトNグラムも同様に扱われる。

*3:加えて云うなら、バイトN〜Mグラムは、そのMを適切に決定するのが難しいという問題もあるように思う。例えば、Mを12とした場合でも、(平均的な日本語テキストの場合)asciiテキストなら文字12個分、Shift_JISなら文字6個分、UTF-8なら文字4個分、といったように文字のエンコーディングによって、情報量が変化してしまう。Mをある程度機械的に適切に設定する仕組みが必要かな。

*4:でもそうするとユーザが人手で素性定義ファイルを弄れなくなるか...