ham: 分類性能評価的なもの

hamの分類性能評価的なもの。
そんなに本格的なことは行わない。
hamでは、基本的に『Practical Common Lisp』にて説明されているスパムフィルタリングの方法(ベイジアンフィルタの一種)をそのまま使わせてもらっている。
このベイジアンフィルタによる分類ロジック自体は、それなりの性能を有するということを前提とし、(hamの)実際の実装上の問題により性能が大幅に悪化していることがないか、ということの確認を行うのが今回の趣旨。
大きな懸念点は以下の二つ:

素性にNグラムを使っていることにより、性能が大幅に悪化していることがないか
プログラムのバグにより、性能が大幅に悪化していることがないか

本当は、いくつかの実装方法を比較/検討でもして、性能が最も良いのを採用するようにした方がいいのだろうが*1、前回も書いたように二値分類器が急遽必要(かつそこまで時間を掛けたくない)なので、分類性能が酷く悪くなければ良しとする。

評価1: サイズが大きいテキストの分類

一つ目。
青空文庫から取得した作家別の作品(テキスト)群の分類を行う(作家Aの作品かどうかの判定)。

まずはデータ準備。
作品取得用スクリプトとその実行。

# ファイル名: download.rb
#
# 青空文庫の作家別作品リストページから、
# その作家の公開中作品を全てダウンロードする
#
# ダウンロードテキストからは、以下の項目が除外されている
#  - ヘッダ部分
#    -- 著者名
#    -- 作品名
#  - 本文
#    -- HTMLタグ
#    -- ルビ
#  - フッタ部分編
#    -- 作品情報
#    -- 入力者情報
#    -- その他

require 'rubygems'
require 'kconv'
require 'hpricot'
require 'open-uri'
require 'uri'

if ARGV.size != 2
  puts "Usage: ruby -Ku download.rb <作家別作品リストページURL> <保存ディレクトリ>"
  exit 1
end

PersonURL = ARGV[0]
SaveDir   = ARGV[1]
`mkdir -p #{SaveDir}`

open(PersonURL) do |root|
  root_doc = Hpricot(Kconv.toutf8(root.read))
  (root_doc/"ol > li").each do |li|
    next if li.at(:a).nil?
    
    title = li.at(:a).inner_html
    puts "= #{title}"
    book_uri = root.base_uri.merge(li.at(:a)[:href])
    
    open(book_uri) do |book|
      book_doc = Hpricot(Kconv.toutf8(book.read))
      (book_doc/:a).each do |a|
        if (a[:href]||'') =~ /\/files\/.+html$/
          xhtml_uri = book.base_uri.merge(a[:href])
          puts "  == #{xhtml_uri}"
          
          begin
            filename = xhtml_uri.path.split("/")[-1].sub(/html$/,'txt')
            text = Hpricot(Kconv.toutf8(open(xhtml_uri).read)).at('div.main_text').inner_html
            open("#{SaveDir}/#{filename}",'w'){|f|
              f.write text.gsub(/<(rp|rt)>.*?<\/\1>/,'').gsub(/<.*?>/,'').gsub(/[\r\n]+/m,"\n")
            }
            break
          rescue => ex
            # 作品xhtmlの構造にはいくつか種類があり
            # <div class="main_text">を含まないものの場合は、nilエラーがでる。
            #
            # 取得数が若干減るだけで特に問題はないので、スルーする
            puts " == #{ex.class}: #{ex.message}"
          end
        end
      end
    end
  end
end

# 作家別作品ダウンロード
# 対象となる作家は、作品数が適度に多い人の中から、てきとうに選択
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person1154.html kishida_kunio
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person183.html makino_shinniti
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person76.html okamoto_kanoko
$ ruby -Ku download.rb http://www.aozora.gr.jp/index_pages/person154.html tanaka_koutarou

# 作品数
$ ls kishida_kunio | wc -l    # 岸田 国士  ※ この人をham、それ以外をspamとする
468

$ ls makino_shinniti | wc -l  # 牧野 信一
77

$ ls okamoto_kanoko | wc -l   # 岡本 かの子
102 

$ ls tanaka_koutarou | wc -l  # 田中 貢太郎
131

# ファイルサイズ
$ ls -lh kishida_kunio | head 
合計 6.7M
-rw-r--r-- 1 user user  11K 2010-07-27 02:40 43579_17337.txt
-rw-r--r-- 1 user user  17K 2010-07-27 02:40 43580_17339.txt
-rw-r--r-- 1 user user 5.2K 2010-07-27 02:40 43581_17340.txt
-rw-r--r-- 1 user user 9.4K 2010-07-27 02:40 43582_17341.txt
-rw-r--r-- 1 user user 3.6K 2010-07-27 02:40 43593_17335.txt
-rw-r--r-- 1 user user  14K 2010-07-27 02:40 43594_17336.txt
-rw-r--r-- 1 user user  27K 2010-07-27 02:40 43595_17338.txt
-rw-r--r-- 1 user user 7.5K 2010-07-27 02:40 44303_21829.txt
-rw-r--r-- 1 user user  16K 2010-07-27 02:39 44304_21827.txt

$ ls -lh makino_shinniti | head
合計 2.7M
-rw-r--r-- 1 user user  42K 2010-07-27 02:40 1890_19617.txt
-rw-r--r-- 1 user user  12K 2010-07-27 02:40 1891_7611.txt
-rw-r--r-- 1 user user  42K 2010-07-27 02:40 1892_7609.txt
-rw-r--r-- 1 user user  12K 2010-07-27 02:40 1893_7615.txt
-rw-r--r-- 1 user user 5.6K 2010-07-27 02:40 1895_22534.txt
-rw-r--r-- 1 user user 2.6K 2010-07-27 02:40 45207_28593.txt
-rw-r--r-- 1 user user  39K 2010-07-27 02:40 45214_18527.txt
-rw-r--r-- 1 user user  48K 2010-07-27 02:40 45216_23057.txt
-rw-r--r-- 1 user user  62K 2010-07-27 02:40 45217_23162.txt

# 中身
$ head -5 kishida_kunio/44304_21827.txt

　一代の人気女優、ド・リュジイ嬢は、給料の問題で、作者にも金を払はなければならないと云ふことを聞いて、「何だつて。一体作者なんて云ふものを、なしにするわけには行かないかね」と、やつつけた。ラ・カメラニイ夫人もまた、オペラ座の化粧部屋に納つて、「作者なんぞゐるうちは、芝居の繁昌するわけはない」と宣言した。それが、仏蘭西のことである。しかも、そんなに旧いことではない。
　モリエール、マリヴォオを先輩と仰ぐ仏蘭西劇作家である。それくらゐのことを云はれても腹は立てまい。まして、相手は男に非ず。たゞ、さう云はれながらも、書いたものが舞台に上り、舞台に上つたものが相当の金になり、金にならずとも、いくらか見てくれ手があればまだいゝのであるが、実際さうなるまでの手数が大変である。先づ脚本を書く、勿論傑作である。成る可くなら原稿はタイプライタアで打つ。人に打たせるなら、それは大抵、若い女だ。批評の悪からう筈はない。原稿は、自分で持つて行くよりも「彼女」に持つて行かせた方がいゝ。なぜなら、劇場の門番は、おほかた無名の天才に対して冷酷だからである。反感さへ持つてゐるらしい。「これをどうぞ」「今、大将は忙しいんだがなあ」「こつちは別に急ぎませんから」門番はにやりと笑ふのである。四十日経つと、返事を聞きに行くのである。勿論、原稿を返して貰ひに行くのと同じことである。「まだ見てなさうですが、もつとお預りして置きますか、それとも持つてお帰りになりますか」――持つて帰りますと云ふ元気ありや。「わが原稿は眠れり」――無名作家の嘆声である。脚本の原稿は劇場に持ち込むときまつてゐる。雑誌社なぞでは受けつけてくれない。（ミュッセは大方その戯曲を舞台に掛けないつもりで雑誌に発表した）
「あゝ、××さんですか、お作を拝見しました。結構だと思ひますが、あの第三幕ですがね」劇場主の注文が出る。

コメントにもある通り「岸田国士」をhamとして、それ以外の人をspamとして扱うことにする(つまり与えられた作品の著者が岸田国士かどうかを判定する)。

次は、学習用データと評価用データの作成。

# ファイル名: half_cp.rb
#
# ディレクトリ内のファイルのコピーを行うスクリプト
# 引数に応じて、偶数番目か奇数番目かのどちらかのファイルだけをコピーする
# ※ 学習用と評価用に、ファイルを振り分けるために使用する

if ARGV.size != 3
  puts "Usage: ruby half_cp.rb <コピー元ディレクトリ> <コピー先ディレクトリ> <0 or 1. 0なら偶数番目, 1なら奇数番目のファイルがコピーされる>"
  exit 1
end

FROM_DIR = ARGV[0]
TO_DIR   = ARGV[1]
n        = ARGV[2].to_i
`mkdir -p #{TO_DIR}`

Dir.glob("#{FROM_DIR}/*") do |path|
  name = path.split("/")[-1]
  if (n+=1)%2==1
    open(path){|fin|
      open("#{TO_DIR}/#{name}",'w'){|fout|
        fout.write fin.read
      }
    }
  end
end

# 学習用データ用意
$ mkdir learn_data
$ ruby half_cp.rb kishida_kunio learn_data/ham 0    # ham
$ ruby half_cp.rb okamoto_kanoko learn_data/spam 0  # spam
$ ruby half_cp.rb makino_shinniti learn_data/spam 0 # spam
$ ruby half_cp.rb tanaka_koutarou learn_data/spam 0 # spam

$ ls learn_data/ham | wc -l
234
$ du -hs learn_data/ham
3.4M	learn_data/ham

$ ls learn_data/spam/ | wc -l
156
$ du -hs learn_data/spam
3.7M	learn_data/spam

# 評価用データ用意
$ mkdir test_data
$ ruby half_cp.rb kishida_kunio test_data/ham 1    # ham
$ ruby half_cp.rb okamoto_kanoko test_data/spam 1  # spam
$ ruby half_cp.rb makino_shinniti test_data/spam 1 # spam
$ ruby half_cp.rb tanaka_koutarou test_data/spam 1 # spam

学習。

## 作成
# バイグラム
$ hamt learn_data 2 2 2>/dev/null | hamc --lower-frequency-limit=5 book-2-2-5.idx 2>/dev/null

# 1〜4グラム
$ hamt learn_data 1 4 2>/dev/null | hamc --lower-frequency-limit=5 book-2-4-5.idx 2>/dev/null

# 3〜6グラム
hamt learn_data 3 6 2>/dev/null | hamc --lower-frequency-limit=5 book-3-6-5.idx 2>/dev/null

## サイズ
$ ls -lh *.idx
-rw-r--r-- 1 user user 434K 2010-07-27 03:10 book-2-2-5.idx
-rw-r--r-- 1 user user 1.9M 2010-07-27 03:11 book-1-4-5.idx
-rw-r--r-- 1 user user 2.2M 2010-07-27 03:25 book-3-6-5.idx

##
$ hamt learn_data 2 6 2>/dev/null | head
000000ea 0000009c ==TOTAL==
00000001 00000000 が同じ卓
00000000 00000001 、寝呆けて
00000000 00000002 からそう
00000000 00000001 焼火
00000001 00000000 容を細
00000001 00000000 まふつてい
00000000 00000001 安府では五
0000000e 00000002 の日常
00000001 00000000 の母がゐ

分類。

## hamとspamの境界スコアがデフォルト(0.5)の場合
# バイグラム
$ hamp book-2-2-5.idx test_data
In 'test_data/ham', 173 fil〜es were classified as HAM (total 234 files)
In 'test_data/spam', 154 files were classified as SPAM (total 154 files)

precision: 1.0
recall   : 0.739316239316239
f-measure: 0.85012285012285   # 85.0%

# 1〜4グラム
$ hamp book-1-4-5.idx test_data
In 'test_data/ham', 201 files were classified as HAM (total 234 files)
In 'test_data/spam', 154 files were classified as SPAM (total 154 files)

precision: 1.0
recall   : 0.858974358974359
f-measure: 0.924137931034483  # 92.4%

# 3〜6グラム
$ hamp book-3-6-5.idx test_data                        
In 'test_data/ham', 213 files were classified as HAM (total 234 files)
In 'test_data/spam', 153 files were classified as SPAM (total 154 files)

precision: 0.992916818016709
recall   : 0.91025641025641
f-measure: 0.949791521890201  # 95.0%

## 境界スコアを最適(に近い)ものに設定した場合 
## hamをspamとして分類してしまうケースが多いようなので、hamの範囲を広げる
# バイグラム
$ hamp --min-spam-score=0.9 book-2-2-5.idx test_data
In 'test_data/ham', 214 files were classified as HAM (total 234 files)
In 'test_data/spam', 151 files were classified as SPAM (total 154 files)

precision: 0.979143145760295
recall   : 0.914529914529915
f-measure: 0.945734209544581  # 94.6%

# 1〜4グラム
$ hamp --min-spam-score=0.9 book-1-4-5.idx test_data                         
In 'test_data/ham', 223 files were classified as HAM (total 234 files)
In 'test_data/spam', 149 files were classified as SPAM (total 154 files)

precision: 0.967053390403244
recall   : 0.952991452991453
f-measure: 0.959970928607369  # 96.0%

# 3〜6グラム
$ hamp --min-spam-score=0.9 book-3-6-5.idx test_data              
In 'test_data/ham', 227 files were classified as HAM (total 234 files)
In 'test_data/spam', 147 files were classified as SPAM (total 154 files)ruby half_cp.rb keizai learn_data/ham 0

precision: 0.955241009946442
recall   : 0.97008547008547
f-measure: 0.96260601387818  # 96.3%

##
$ head -5 kishida_kunio/44304_21827.txt | ham book-3-6-5.idx
0.092976	HAM

グラムの範囲と分類境界スコアを適切に設定すれば、それなりに分類できる感じ。　

評価2: サイズが小さいテキストの分類

二つ目。
Yahoo!ブログの記事カテゴリを利用し、'ビジネスと経済'と'生活と文化'へのブログ記事の分類を行う。

データ取得:

# ファイル名: download_yahoo_blog.rb
#
# Yahoo!ブログから、引数で渡された大カテゴリに属する新着ブログ記事を取得する 

require 'open-uri'
require 'rubygems'
require 'hpricot'
require 'kconv'
require 'rss'
require 'cgi'

def each_category_id (category_name)
  open("http://blogs.yahoo.co.jp/FRONT/cat.html"){|top|
    doc = Hpricot(Kconv.toutf8(top.read))
    (doc/"h2 > a").each do |a|
      if a.inner_html == category_name
        pdoc = a.parent.next_sibling
        while pdoc.name == 'dl'
          (pdoc/"dd a").each do |dd_a|
            category_id = dd_a[:href].scan(/cid=(\d+)/)[0][0]
            yield category_id
          end
          pdoc = pdoc.next_sibling
        end
        break
      end
    end
  }
end

def each_rss_item(rss_url)
  open(rss_url) do |f|
    RSS::Parser.parse(f.read).items.each do |item|
      yield item
    end
  end
end

if ARGV.size != 2
  puts "Usage: ruby -Ku download_yahoo_blog.rb <大カテゴリ> <保存ディレクトリ>"
  exit 1
end

CategoryName = ARGV[0]
SaveDir      = ARGV[1]

# カテゴリに属する新着記事を書いたブロガー(のRSS)を取得する
puts "= root category: #{CategoryName}"
user_rss_set = Array.new
each_category_id(CategoryName) do |cid|
  puts " == category: #{cid}"
  each_rss_item("http://blogs.yahoo.co.jp/DIRECTORY/rss.xml?cid=#{cid}") do |item|
    user_rss_set << item.source.url.sub(/^.+\*/,'')
  end
end

# 上で取得したブロガーの最新ブログ記事を取得/保存する
# ※ カテゴリAに属する新着記事を書いたブロガーの他の記事もカテゴリAに属しているはず、と仮定している
puts ""
puts "= user: #{user_rss_set.uniq.size}"
user_rss_set.uniq.each_with_index do |rss_url,i|
  puts " == #{i}"
  each_rss_item(rss_url) do |item|
    id = item.link.sub(/^.+http:\/\/blogs.yahoo.co.jp\//,'').gsub('/','_')
    title = CGI.unescapeHTML(item.title)
    desc  = CGI.unescapeHTML(item.description).gsub(/&nbsp;/,' ').gsub(/<[^>]*>/,'').gsub(/[\r\n]+/,"\n").gsub(/[ 　\t]/,' ')

    # puts "  === #{id}"
    open("#{SaveDir}/#{id}",'w'){|f|
      f.write "#{title}\n#{desc}"
    }
  end
end

# 評価1とは別のディレクトリ

## データダウンロード
# "ビジネスと経済"のブログ記事を取得
$ mkdir keizai
$ ruby -Ku download_yahoo_blog.rb ビジネスと経済 keizai

# "生活と文化"のブログ記事を取得
$ mkdir bunka
$ ruby -Ku download_yahoo_blog.rb 生活と文化 bunka

# 個数とサイズ
$ ls keizai | wc -l
2342
$ du -hs keizai
9.7M	keizai

$ ls bunka | wc -l
1175
$ du -hs bunka
4.9M	bunka

## 学習/評価データ用意
# 学習
$ ruby half_cp.rb keizai learn_data/ham 0
$ ruby half_cp.rb bunka learn_data/spam 0

# 評価
$ ruby half_cp.rb keizai test_data/ham 1
$ ruby half_cp.rb bunka test_data/spam 1

学習。

## 作成
# バイグラム
$ hamt learn_data 2 2 2>/dev/null | hamc blog-2-2-2.idx 2>/dev/null

# 1〜3グラム
$ hamt learn_data 1 3 2>/dev/null | hamc blog-1-3-2.idx 2>/dev/null

# 2〜6グラム
$ hamt learn_data 2 6 2>/dev/null | hamc blog-2-6-2.idx 2>/dev/null

## サイズ
$ ls -lh *.idx
-rw-r--r-- 1 user user 550K 2010-07-27 05:49 book-2-2-2.idx
-rw-r--r-- 1 user user 1.7M 2010-07-27 05:50 book-1-3-2.idx
-rw-r--r-- 1 user user 4.0M 2010-07-27 05:50 book-2-6-2.idx

分類。

# バイグラム
$  hamp blog-2-2-2.idx test_data
In 'test_data/ham', 821 files were classified as HAM (total 1171 files)
In 'test_data/spam', 485 files were classified as SPAM (total 587 files)

precision: 0.801383177383603
recall   : 0.701110162254483
f-measure: 0.747900672436617  # 74.8%

# 1〜3グラム
$ hamp blog-1-3-2.idx test_data                         
In 'test_data/ham', 835 files were classified as HAM (total 1171 files)
In 'test_data/spam', 487 files were classified as SPAM (total 587 files)

precision: 0.807161853946924
recall   : 0.713065755764304
f-measure: 0.757201716022128  # 75.7%

# 2〜6グラム
$ hamp blog-2-6-2.idx test_data                          
In 'test_data/ham', 917 files were classified as HAM (total 1171 files)
In 'test_data/spam', 442 files were classified as SPAM (total 587 files)

precision: 0.76020161734508
recall   : 0.783091374893254
f-measure: 0.771476748377406  # 77.1%


## 境界スコアを変えると、途端に偏りが激しくなる
$ hamp --min-spam-score=0.55 blog-2-6-2.idx test_data                           
In 'test_data/ham', 1146 files were classified as HAM (total 1171 files)
In 'test_data/spam', 91 files were classified as SPAM (total 587 files)

precision: 0.536651248725587
recall   : 0.97865072587532
f-measure: 0.693187421266993

$ hamp --min-spam-score=0.45 blog-2-6-2.idx test_data                        
In 'test_data/ham', 351 files were classified as HAM (total 1171 files)
In 'test_data/spam', 576 files were classified as SPAM (total 587 files)

precision: 0.941160617217406
recall   : 0.299743808710504
f-measure: 0.454679767625332

学習用/評価用データの質があまり良くないこと*2もあってか、精度等は(青空文庫の場合に比べ)かなり下がっている。
一応、7割強は正しく分類*3できているが、この数字が良いのか悪いのかは不明。
すごく悪いということはなさそうだけど。

結論

取り合えず冒頭で書いたような問題はなさそう。
余裕があれば既存のパッケージなども調べてみても良いけど、学習データに気をつければhamでも十分間に合いそうに思う。

*1:というか既存のパッケージを使った方がいいのかな?

*2:スプログ的な記事も結構多い

*3:取得方法がてきとうで、学習データ(教師データ)がそもそも正しい分類となっていない可能性が結構あるため、この表現は微妙