$Id: learn.html 65 2007-01-30 00:52:53Z taku-ku $;
ÇнÀ¿ë ÄÚÆĽº·ÎºÎÅÍ ÆĶó¹ÌÅÍ( ÄÚ½ºÆ®Ä¡) (À»)¸¦ ÃßÁ¤ÇÒ ¼ö ÀÖ½À´Ï´Ù. MeCab ÀÚ½ÅÀº Ç°»ç ü°è¿¡ ºñÀÇÁ¸ÀÎ ¼³°è°¡ µÇ¾î Àֱ⠶§¹®¿¡, µ¶ÀÚÀûÀÎ Ç°»ç ü°è, »çÀü, ÄÚÆĽº¿¡ ±Ù°ÅÇÏ´Â Çؼ®±â¸¦ ÀÛ¼ºÇÒ ¼ö ÀÖ½À´Ï´Ù. ÆĶó¹ÌÅÍ ÃßÁ¤¿¡´Â Conditinoal Random Fields (CRF) (À»)¸¦ »ç¿ëÇÏ°í ÀÖ½À´Ï´Ù.
µ¥ÀÌÅÍ Ç÷ο쵵´Â ´ÙÀ½°ú °°ÀÌ µË´Ï´Ù.
ÆĶó¹ÌÅÍ ÃßÁ¤¿¡´Â ÀÌÇÏÀÇ ¼ºê ŽºÅ©°¡ ÀÖ½À´Ï´Ù.
°¢°¢ ¼ø¼¿¡ ¼³¸íÇØ °¥ °ÍÀÔ´Ï´Ù.
MeCab ÀÇ »çÀüÀº CSV ±×¸®°í ±â¼úµË´Ï´Ù. Seed »çÀü°ú ¹èÆ÷ »çÀüÀÇ Æ÷¸· Æ®´Â ±âº»ÀûÀ¸·Î µ¿ÀÏÇÕ´Ï´Ù.
ÀÌÇÏ°¡ »çÀüÀÇ ¿£Æ®¸®ÀÇ ¿¹ÀÔ´Ï´Ù.
ÁøÇб³,0,0,0, ¸í»ç, ÀϹÝ,*,*,*,*, ÁøÇб³, ½Å°¡Å©ÄÚ¿ì, ½Å°¡Å©ÄÚ ¸ÅȲÉ,0,0,0, ¸í»ç, ÀϹÝ,*,*,*,*, ¸ÅȲÉ, ¿ì¸Þ°í¿ä¹Ì, ¿ì¸Þ°í¿ä¹Ì ±â¾Ð,0,0,0, ¸í»ç, ÀϹÝ,*,*,*,*, ±â¾Ð, Å°¾ÆÆ®, Å°¾ÆÆ® ¼öÁßÀͼ±,0,0,0, ¸í»ç, ÀϹÝ,*,*,*,*, ¼öÁßÀͼ±, ½ºÀÌÃò¿ì¿äÅ©¼¾, ½ºÀÌÃò¿äÅ©¼¾
ÃÖÃÊÀÇ4 Ä÷³´«±îÁö´Â, Çʼö Ç׸ñÀ¸·Î,
µÇ°í ÀÖ½À´Ï´Ù. ¿ÞÂÊ ¿¬Á¢ »óÅ ¹øÈ£, ¿À¸¥ÂÊ ¿¬Á¢ »óÅ ¹øÈ£, ÄÚ½ºÆ®´Â, Seed »çÀü¿¡¼´Â »ç¿ëµÇÁö ¾Ê±â ¶§¹®¿¡ 0 (À¸)·Î ÇصӴϴÙ.
5 Ä÷³´« ÀÌÈÄ´Â ¡¸Å»ý¡¹À̶ó°í ºÒ¸®´Â Ç׸ñÀÔ´Ï´Ù. MeCab ÇÏ, ½Ã½ºÅÛÀÇ ¹ü¿ë¼º (À»)¸¦ ³ôÀ̱â À§Çؼ, ¡¸Ç°»ç¡¹ ¡¸È°¿ë¡¹ ¡¸Àб⡹ ¡¸¹ßÀ½¡¹À̶ó°í ÇÑ ¡¸´Ü¾î¿¡ ºÎ¿©µÇ¾î Á¤º¸¡¹¸¦ ½Ã½ºÅÛÀº ±¸º°ÇÏÁö ¾Ê°í ¡¸Å»ý¡¹À¸·Î¼ Ãë±ÞÇÏ°í ÀÖ½À´Ï´Ù. À¯Àú´Â CSV ÇÏÁö¸¸ Çã¶ôÇÏ´Â ÇÑ ¸î°³¿¡¼µµ Å»ýÀ» ºÎ¿©ÇÒ ¼ö ÀÖ½À´Ï´Ù. ´Ù¸¸, °¢ Ä÷³ÀÇ Å»ýÀÇ Á¤ÀÇ´Â °®Ãß¾î µÑ ÇÊ¿ä°¡ ÀÖ½À´Ï´Ù. (5 Ä÷³´«Àº Ç°»ç, 6 Ä÷³´«Àº Ç°»çÀçºÐ Á¾·ùµî) Åë»ó, Å»ý ¹øÈ£ÀÇ ÀþÀº °ÍÀ¸·ÎºÎÅÍ ¼ø¼¿¡ ÀϹÝÀûÀÎ Å»ýÀ» ¿°ÅÇØ °¥ °ÍÀÔ´Ï´Ù. ( ·Ê: Ç°»ç, Ç°»ç ¼¼ºÐ·ù, È°¿ëÇü, È°¿ëÇü, ¿øÇü, Àбâ, ¹ßÀ½)
Å»ýÀº ³»ºÎÀûÀ¸·Î´Â ¹è¿·Î¼ ´Ù·ç¾îÁý´Ï´Ù. 0 ¹ø°ÀÇ Å»ý, 1 ¹ø°ÀÇ Å»ý.. (¿Í)°ú ¸»ÇÏ´Â ºÎ¸£´Â ¹ýÀ¸·Î Å»ýÀ» ÂüÁ¶ÇÏ´Â ÀÏÀÌ ÀÖ½À´Ï´Ù. Å»ýÀÇ ¹øÈ£¿Í ³»ºÎ Ç¥Çö( Ç°»ç, µ¶ ºÁ µî) ÇÏ, À¯Àú ÀÚ½ÅÀÌ °ü¸®ÇØ ÁÖ¼¼¿ä.
»ó±âÀÇ ¿¹´Â, ipadic ÀÇ ¿¹ÀÔ´Ï´Ù. Å»ý¿·Î¼
ÇÏÁö¸¸ Á¤Àǵǰí ÀÖ½À´Ï´Ù.
MeCab (Àº)´Â È°¿ë 󸮸¦ ½Ç½ÃÇÏÁö ¾Ê½À´Ï´Ù. È°¿ëÇÏ´Â ¸»ÀÇ °æ¿ì´Â, À¯Àú°¡ »çÀü¿¡ È°¿ë (À»)¸¦ Àü°³ÇÒ ÇÊ¿ä°¡ ÀÖ½À´Ï´Ù.
µ¥¸®°í ³ª°£´Ù,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, ±âº»Çü, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù½º, Æ®·¹´Ù½º µ¿¹ÝÃâÀ̾ß,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, ¹Ì¿¬Çü, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù»ç, Æ®·¹´Ù»ç µ¿¹ÝÃâ,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, ¹Ì¿¬¿ì Á¢¼Ó, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù¼Ò, Æ®·¹´Ù¼Ò µ¥¸®°í ³ª°¡,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, ¿¬¿ëÇü, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù½Ã, Æ®·¹´Ù½Ã µ¥¸®°í ³ª°¡¶ó,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, °¡Á¤Çü, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù¼¼, Æ®·¹´Ù¼¼ µ¥¸®°í ³ª°¡¶ó,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, ¸í·É e, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù¼¼, Æ®·¹´Ù¼¼ µ¿¹Ý ³»¹Ð±â,0,0,0, µ¿»ç, ÀÚ¸³,*,*, 5´Ü¡¤»çÇà, °¡Á¤Ãà¾à 1, µ¥¸®°í ³ª°£´Ù, Æ®·¹´Ù»þ, Æ®·¹´Ù»þ
»çÀüÀÇ ´Ù¾çÇÑ µ¿ÀÛÀ» ÁöÁ¤ÇÏ´Â ÆÄÀÏÀÔ´Ï´Ù. ÀÌÇÏ°¡ ÃÖÀúÇÑÀÇ ¼³Á¤ÀÔ´Ï´Ù.
cost-factor = 800 bos-feature = BOS/EOS,*,*,*,*,*,*,*,* eval-size = 6 unk-eval-size = 4
¹ÌÁö¾î ó¸®ÀÇ Á¤ÀÇ ÆÄÀÏÀÔ´Ï´Ù. Åë»ó ÀϺ»¾îÀÇ ÇüÅÂ¼Ò Çؼ®¿¡¼´Â ÀÚÁ¾¿¡ ±Ù°ÅÇÏ´Â ¹ÌÁö ¸»Ã³¸®¸¦ ÇÕ´Ï´Ù. MeCab ±×·³, ¾î´À ¹®ÀÚ¸¦ ¾î´À ÀÚÁ¾À¸·Î¼ Á¤ÀÇÇϴ°¡ ÇÑ ¼³ Á¤À» ¼¼¼¼ÇÏ°Ô ÁöÁ¤ÇÒ ¼ö ÀÖ½À´Ï´Ù. ÇÑÃþ ´õ, °¢ ÀÚÁ¾¿¡ ´ëÇØ, ¾î¶°ÇÑ ¹ÌÁö¾î 󸮸¦ ½Ç½ÃÇÒ±î ¼¼¼¼ÇÏ°Ô ÁöÁ¤ÇÒ ¼ö ÀÖ½À´Ï´Ù.
ÆÄÀÏÀÇ ÃÖÃʷδÂ, Ä«Å×°í¸®¸íÀÇ Á¤ÀÇ¿Í, °¢ Ä«Å×°í¸®ÀÇ ¹ÌÁö¾î ó¸®ÀÇ µ¿ÀÛ (À»)¸¦ Á¤ÀÇÇÕ´Ï´Ù.
Ä«Å×°í¸®¸í µ¿ÀÛ Å¸À̹Ö(0/1) ±×·ì(0/1) ±æÀÌ(0,1, 2... n)
·Ê
KANJI 0 0 2 SYMBOL 1 1 0 NUMERIC 1 1 0 ALPHA 1 1 0 HIRAGANA 0 1 2
´ÙÀ½¿¡, °¢ Ä«Å×°í¸®°¡UCS2 ÀÇ ÄÚµå Æ÷ÀÎÆ®ÀÇ ¾îµð¿¡ ÇØ´çÇÏ´ÂÁö Á¤ÀÇÇÕ´Ï´Ù.
codepoint µðÆúÆ® Ä«Å×°í¸®¸í ȣȯ Ä«Å×°í¸®¸í1 ȣȯ Ä«Å×°í¸®¸í2 ..
ȤÀº,
low_codepoint..high_codepoint µðÆúÆ® Ä«Å×°í¸®¸í ȣȯ Ä«Å×°í¸®¸í1 ȣȯ Ä«Å×°í¸®¸í2 ..
·Ê
0x0009 SPACE 0x30A1..0x30FF KATAKANA 0x30FC KATAKANA HIRAGANA # -
ÄÚµå Æ÷ÀÎÆ®´Â UCS2(Unicode) (À»)¸¦ 0x (À¸)·ÎºÎÅÍ ½ÃÀ۵ȴÙ16 Áø¼ö·Î ±â¼úÇÕ´Ï´Ù.
ÃÖÃÊÀÇ Ä«Å×°í¸®´Â, ±× ÄÚµå Æ÷ÀÎÆ®ÀÇ µðÆúÆ® Ä«Å×°í¸®ÀÔ´Ï´Ù. ÇÑÃþ ´õ, ȣȯ Ä«Å×°í¸®¸¦ ¿°ÅÇÒ ¼ö ÀÖ½À´Ï´Ù. »ó±âÀÇ ¿¹¿¡¼´Â, ÀåÀ½ ±âÈ£ ¡¸-¡¹ ÇÏ, µðÆúÆ®¿¡¼´Â īŸī³ªÀÔ´Ï´Ù¸¸, È÷¶ó°¡³ª¸¦ ȣȯ Ä«Å×°í¸®·Î¼ °¡Áý´Ï´Ù. ±×·ì µ¿ÀÛ¶§¿¡ ȣȯ Ä«Å×°í¸®´Â °°Àº ±×·ìÀ¸·Î¼ º¸ÀÔ´Ï´Ù.
ÀÌÇÏ°¡ char.def ÀÇ ±¸Ã¼ÀûÀÎ ¿¹ÀÔ´Ï´Ù.
DEFAULT 0 1 0 # DEFAULT is a mandatory category! SPACE 0 1 0 KANJI 0 0 2 SYMBOL 1 1 0 NUMERIC 1 1 0 ALPHA 1 1 0 HIRAGANA 0 1 2 KATAKANA 1 1 0 KANJINUMERIC 1 1 0 GREEK 1 1 0 CYRILLIC 1 1 0 # SPACE 0x0020 SPACE # DO NOT REMOVE THIS LINE, 0x0020 is reserved for SPACE 0x00D0 SPACE 0x0009 SPACE 0x000B SPACE 0x000A SPACE # ASCII 0x0021..0x002F SYMBOL 0x0030..0x0039 NUMERIC ... # KATAKANA 0x30A1..0x30FF KATAKANA 0x31F0..0x31FF KATAKANA # Small KU .. Small RO 0x30FC KATAKANA HIRAGANA # -
¹ÌÁö¾î¿ëÀÇ »çÀüÀÔ´Ï´Ù.
DEFAULT,0,0,0, ±âÈ£, ÀϹÝ,*,*,*,*,* SPACE,0,0,0, ±âÈ£, °ø¹é,*,*,*,*,* KANJI,0,0,0, ¸í»ç, ÀϹÝ,*,*,*,*,* KANJI,0,0,0, ¸í»ç, »çÇà º¯°ÝÈ°¿ë Á¢¼Ó,*,*,*,*,* HIRAGANA,0,0, ¸í»ç, ÀϹÝ,*,*,*,*,* HIRAGANA,0,0,0, ¸í»ç, »çÇà º¯°ÝÈ°¿ë Á¢¼Ó,*,*,*,*,* HIRAGANA,0,0,0, ¸í»ç, °íÀ¯ ¸í»ç, Áö¿ª, ÀϹÝ,*,*,* ...
Ç¥ÃþÀÇ ºÎºÐÀ» char.def ±×¸®°í Á¤ÀÇÇÑ Ä«Å×°í¸®¸íÀ¸·Î ÇÑ »çÀü ÆÄÀÏÀÔ´Ï´Ù. °¢ Ä«Å×°í¸®¿¡ ´ëÇؼ ¾î¶°ÇÑ ¼Ò»ý·ÄÀ» ºÎ¿©ÇÏ´ÂÁö¸¦ Á¤ÀÇÇÕ´Ï´Ù. 1 °³ÀÇ Ä«Å×°í¸®¿¡ º¹¼öÀÇ Å»ýÀ» Á¤ÀÇÇصµ ±¦Âú½À´Ï´Ù. ÇнÀ ÈÄ, ÀûÀýÇÑ ÄÚ½ºÆ®Ä¡°¡ ÀÚµ¿ÀûÀ¸·Î ÁÖ¾îÁý´Ï´Ù.
Å»ý¿·ÎºÎÅÍ ³»ºÎ »óżһý·Ä·Î º¯È¯ÇÏ´Â ¸ÅÇÎÀ» Á¤ÀÇÇÕ´Ï´Ù.
CRF ÇÏ, unigram, ¿ÞÂÊ ¹®¸Æ bigram, ¿ì¹®¸Æ bigram ÀÇ3 Á¤º¸¸¦ »ç¿ëÇØ Åë°è Á¤º¸¸¦ ÇÕ°è Çì¾Æ¸³´Ï´Ù. ¿¹¸¦ µé¸é ÀÌÇÏÀÇ ¡¸¾Æ¸§´Ù¿î °¡¹À̶ó°í ÇÏ´Â ÀÌÇÏÀÇ ¿¹¿¡¼´Â, »çÀü¿¡ Á¤Àǵǰí ÀÖ´Â Å»ýÀ¸·ÎºÎÅÍ unigram Å»ý, ¿ÞÂÊ ¹®¸Æ Å»ý( ±× ÇüżҸ¦ ÁÂÃø¿¡¼ º¸¾ÒÀ» ¶§ÀÇ Å»ý), ¿ì¹®¸ÆÅ»ý( ±× ÇüżҸ¦ ÁÂÃø¿¡¼ º¸¾ÒÀ» ¶§ÀÇ Å»ý) ÀÇ3 °³°¡ »ç¿ëµË´Ï´Ù. rewrite.def ÇÏ, »çÀüÀÇ Å»ýÀ¸·ÎºÎÅÍ °¢°¢ÀÇ ³»ºÎ Å»ý¿¡ÀÇ ¸ÅÇÎÀ» Á¤ÀÇÇÕ´Ï´Ù.
±¸Ã¼ÀûÀ¸·Î ÀÌÇÏ¿Í °°Àº ÀÏÀÌ ¸ÅÇÎ ÇÔ¼ö¸¦ ÀûÀýÈ÷ Á¤ÀÇÇÏ´Â °ÍÀ¸·Î ½ÇÇöµÉ ¼ö ÀÖ½À´Ï´Ù.
rewrite.def ¿¡´Â 3 °³ÀÇ ¼½¼ÇÀÌ ÀÖ½À´Ï´Ù.
°¢°¢ÀÇ ¼½¼ÇÀÇ µÚ¿¡, 1 Çà¿¡1 °³ÀÇ ¸ÅÇÎ ·êÀÌ °è¼Ó µË´Ï´Ù. ¸ÅÇÎ ·êÀº
¸ÅÄ¡ ÆÐÅÏ º¯È¯Ã³±×·¸´Ù°í ÇÏ´Â Çü½Ä¿¡¼ ±â¼úÇÕ´Ï´Ù. ¸ÅÇÎ ·êÀº ¼±µÎ·ÎºÎÅÍ ¼ø¼¿¡ ÁÖ»ç µÇ¾î ÃÖÃÊ·Î ¸ÅÄ¡ÇÑ °ÍÀÌ »ç¿ëµË´Ï´Ù.
¸ÅÄ¡ ÆÐÅÏ¿¡¼´Â °£´ÜÇÑ Á¤±Ô Ç¥ÇöÀ̸¦ »ç¿ëÇÒ ¼ö ÀÖ½À´Ï´Ù.
º¯È¯Ã³´Â $1 $2, $3.. ±×·¸´Ù°í ÇÏ´Â ¸ÅÅ©·Î¸¦ »ç¿ëÇØ Å»ýÀÇ °¢ ¿ä¼Ò (CSV ±×¸®°í ±â·ÏµÈ ¿ä¼Ò) ÀÇ ³»¿ëÀ» ÂüÁ¶ÇÒ ¼ö ÀÖ½À´Ï´Ù.
·Ê
[unigram rewrite] # Àбâ, ¹ßÀ½À» Á¦°ÅÇØ, Ç°»ç1,2,3,4, È°¿ëÇü, È°¿ëÇü, ¿øÇü, Àоî (À»)¸¦ »ç¿ëÇÑ´Ù *,*,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,$8 # ÀбⰡ ¾ø´Â °æ¿ì´Â ¹«½Ã *,*,*,*,*,*,* $1,$2,$3,$4,$5,$6,$7,* [left rewrite] ( Á¶»ç| Á¶µ¿»ç),*,*,*,*,*,( ¾ø´Ù| ¾ø´Ù) $1,$2,$3,$4,$5,$6, ¾ø´Ù ( Á¶»ç| Á¶µ¿»ç), Á¾Á¶»ç,*,*,*,*,( | ¿ä) $1,$2,$3,$4,$5,$6, ... [right rewrite] ( Á¶»ç| Á¶µ¿»ç),*,*,*,*,*,( ¾ø´Ù| ¾ø´Ù) $1,$2,$3,$4,$5,$6, ¾ø´Ù ( Á¶»ç| Á¶µ¿»ç), Á¾Á¶»ç,*,*,*,*,( | ¿ä) $1,$2,$3,$4,$5,$6, ..
³»ºÎ »óÅÂÀÇ ¼Ò»ý·Ä·ÎºÎÅÍ CRF ÀÇ ¼Ò»ý·ÄÀ» ÃßÃâÇϱâ À§ÇÑ ÅÛÇø´À» Á¤ÀÇÇÑ ÆÄÀÏÀÔ´Ï´Ù
°¢ ÇàÀÌ 1 ÅÛÇø´¿¡ ´ëÀÀÇÕ´Ï´Ù. UNIGRAM ±×·³ ±ÛÀÚ ¸¸ÀÇ °ÍÀº UNIGRAM ¿ë ÀÇ ÅÛÇø´, BIGRAM ±×·³ ±ÛÀÚ ¸¸ÀÇ °ÍÀº ¿¬Á¢¿ëÀÇ ÅÛÇø´ÀÔ´Ï´Ù.
°¢ ÅÛÇø´¿¡¼´Â, ÀÌÇÏÀÇ ¸ÅÅ©·Î¸¦ »ç¿ëÇÒ ¼ö ÀÖ½À´Ï´Ù
·Ê
UNIGRAM W0:%F[6] UNIGRAM W1:%F[0]/%F[6] UNIGRAM W2:%F[0],%F?[1]/%F[6] UNIGRAM W3:%F[0],%F[1],%F?[2]/%F[6] UNIGRAM W4:%F[0],%F[1],%F[2],%F?[3]/%F[6] UNIGRAM T0:%t UNIGRAM T1:%F[0]/%t UNIGRAM T2:%F[0],%F?[1]/%t UNIGRAM T3:%F[0],%F[1],%F?[2]/%t UNIGRAM T4:%F[0],%F[1],%F[2],%F?[3]/%t BIGRAM B00:%L[0]/%R[0] BIGRAM B01:%L[0],%L?[1]/%R[0] BIGRAM B02:%L[0]/%R[0],%R?[1] BIGRAM B03:%L[0]/%R[0],%R[1],%R?[2] BIGRAM B04:%L[0],%L?[1]/%R[0],%R[1],%R?[2] BIGRAM B05:%L[0]/%R[0],%R[1],%R[2],%R?[3] BIGRAM B06:%L[0],%L?[1]/%R[0],%R[1],%R[2],%R?[3] ...
ÇнÀ µ¥ÀÌÅÍ´Â, MeCab ÀÇ µðÆúÆ® Ãâ·Â°ú µ¿ÀÏ Æ÷¸ËÀ¸·Î ±â¼úÇÕ´Ï´Ù.
Ÿ·Î ¸í»ç, °íÀ¯ ¸í»ç, Àθí, ¸í,*,*, Ÿ·Î, Ÿ·Î, Ÿ·Î ÇÏ Á¶»ç, °èÁ¶»ç,*,*,*,*, ÇÏ, ÇÏ, ¿Í Çϳª²¿ ¸í»ç, °íÀ¯ ¸í»ç, Àθí, ¸í,*,*, Çϳª²¿, ÇϳªÄÚ, ÇϳªÄÚ ÇÏÁö¸¸ Á¶»ç, °ÝÁ¶»ç, ÀϹÝ,*,*,*, ÇÏÁö¸¸, °¡, °¡ ÁÁ¾Æ ¸í»ç, Çü¿ëµ¿»ç ¾î°£,*,*,*,*, ÁÁ¾Æ, ºóÆ´, ºóÆ´ (ÀÌ)´Ù Á¶µ¿»ç,*,*,*, Ư¼ö¡¤´Ù, ±âº»Çü, (ÀÌ)´Ù, ´Ù, ´Ù . ±âÈ£, ±¸µÎÁ¡,*,*,*,*, . , . , . EOS ¼ÒÁÖ ¸í»ç, ÀϹÝ,*,*,*,*, ¼ÒÁÖ, ¼î¿ìÃò¿ì, ¼î Ãò- ÁÁ¾Æ ¸í»ç, Çü¿ëµ¿»ç ¾î°£,*,*,*,*, ÁÁ¾Æ, ºóÆ´, ºóÆ´ ÀÇ Á¶»ç, ¿¬Ã¼È,*,*,*,*, ÀÇ, ³ë, ³ë ¾Æ¹öÁö ¸í»ç, ÀϹÝ,*,*,*,*, ¾Æ¹öÁö, ¾Æ¹öÁö, ¾Æ¹öÁö . ±âÈ£, ±¸µÎÁ¡,*,*,*,*, . , . , . EOS ...
ÅÇÀ¸·Î ´Ü¶ôÁö¾îÁø ÃÖÃÊÀÇ ºÎºÐÀÌ Ç¥Ãþ ¹®ÀÚÀÔ´Ï´Ù. ´ÙÀ½¿¡ Å»ý ¹è¿À» CSV ±×¸®°í Ç¥ÇöÇÑ ¹®Àå ÀÚ·ÄÀÌ °è¼Ó µË´Ï´Ù. ¹®ÀåÀÇ ´Ü¶ô¿¡´Â EOS ¸¸ÀÇ ÇàÀ» µÓ´Ï´Ù.
ÇöÀçÀÇ ÀÛ¾÷ µð·ºÅ丮¸¦ WORK (À¸)·Î ÇÕ´Ï´Ù. WORK ÀÌÇÏ¿¡ seed (¿Í)°ú final (¿Í)°ú ¸»ÇÏ´Â µÎ °³ÀÇ µð·ºÅ丮¸¦ ¸¸µé¾î ÁÖ¼¼¿ä.
cd $WORK mkdir seed final
seed µð·ºÅ丮¿¡ Á¶±Ý Àü ¼³¸íÇÑ ÀÌÇÏÀÇ ÆÄÀÏÀ» Ä«ÇÇÇÕ´Ï´Ù.
·Ê
% cd $WORK/seed % ls Adj.csv Interjection.csv Noun.name.csv Noun.verbal.csv Symbol.csv rewrite.def Adnominal.csv Noun.adjv.csv Noun.number.csv Others.csv Verb.csv unk.def Adverb.csv Noun.adverbal.csv Noun.org.csv Postp-col.csv char.def Auxil.csv Noun.csv Noun.others.csv Postp.csv corpus Conjunction.csv Noun.demonst.csv Noun.place.csv Prefix.csv dicrc Filler.csv Noun.nai.csv Noun.proper.csv Suffix.csv feature.def
ÀÌÇÏÀÇ Ä¿¸àµå¸¦ ½ÇÇàÇØ, ÇнÀ¿ë ¹ÙÀ̳ʸ® »çÀüÀ» ÀÛ¼ºÇÕ´Ï´Ù.
% cd $WORK/seed % /usr/local/libexec/mecab/mecab-dict-index ÀÌÇÏ¿Í °°ÀÌ -d, -o (À»)¸¦ »ç¿ëÇÒ ¼öµµ ÀÖ½À´Ï´Ù. % /usr/local/libexec/mecab/mecab-dict-index -d $WORK/seed -o $WORK/seed
% cd $WORK/seed % /usr/local/libexec/mecab/mecab-cost-train -c 1.0 corpus model ÀÌÇÏ¿Í °°ÀÌ -d (À»)¸¦ »ç¿ëÇØ »çÀüÀ» ÁöÁ¤ÇÒ ¼öµµ ÀÖ½À´Ï´Ù< % /usr/local/libexec/mecab/mecab-cost-train -d $WORK/seed -c 1.0 $WORK/seed/corpus $WORK/seed/model
mecab-cost-train (Àº)´Â ¹ÙÀ̳ʸ® ¸ðµ¨ÀÇ ÀÛ¼º¶§¿¡ ´ë·®ÀÇ ¸Þ¸ð¸®¸¦ ¼ÒºñÇÕ´Ï´Ù. ÀÌÇÏ¿Í °°ÀÌ ¹ÙÀ̳ʸ® ¸ðµ¨ÀÇ ÀÛ¼ºÀ» º°ÇÁ·Î¼¼½º·Î ½Ç½ÃÇÏ´Â °ÍÀ¸·Î ¸Þ¸ð¸® ¼Òºñ¸¦ ¾ïÁ¦ÇÒ ¼ö ÀÖ½À´Ï´Ù.
% /usr/local/libexec/mecab/mecab-cost-train -y -c 1.0 corpus model % /usr/local/libexec/mecab/mecab-cost-train -b model.txt model
ÇÏÀÌÆÛ ÆĶó¹ÌÅÍC ÇÏ, ÇнÀÀÇ ¡¸Èû¡¹À» °áÁ¤ÇÕ´Ï´Ù. C (À»)¸¦ Å©°Ô Çϸé, ÇнÀ µ¥ÀÌÅÍ·Î ÇÒ ¼ö ÀÖÀ» »Ó(¸¸Å) ÇÇÆ®ÇÏ·Á°í ÇÕ´Ï´Ù¸¸, °úÇнÀÇÒ °¡´É¼ºÀÌ ÀÖ½À´Ï´Ù. ÀÛ°Ô Çϸé, °úÇнÀÀ» ÇÇÇÏ·Á°í ÇÕ´Ï´Ù¸¸, ÃæºÐÇÑ ÇнÀÀ» ÇÒ ¼ö ¾øÀ» °¡´É¼ºÀÌ ÀÖ½À´Ï´Ù. ÀûÀýÇÑ C ÇÏ, ±³Â÷ °ËÁ¤µîÀÇ ¸ðµ¨ ¼±Åà ¼ö¹ýÀ¸·Î ¹ß°ßÀûÀ¸·Î ã¾Æ³¾ ¼ö ¹Û¿¡ ¾ø½À´Ï´Ù. µðÆúÆ®ÀÇ °ªÀº 1. 0 µÇ°í ÀÖ½À´Ï´Ù.
-f ¿É¼Ç¿¡ ÀÇÇؼ Å»ý ºóµµÀÇ ¹ÝÀÀÀ» ÀÏÀ¸Å°´Â ÃÖ¼ÒÀÇ ¹°¸®·®À» ÁöÁ¤ÇÒ ¼ö ÀÖ½À´Ï´Ù. ¿¹¸¦ µé¸é, -f 3 (À¸)·Î Çϸé, ÇнÀ µ¥ÀÌÅÍÁß¿¡3 ȸÀÌ»ó ÃâÇöÇÑ Å»ý¸¸À» »ç¿ëÇÕ´Ï´Ù. ÀûÀýÇÑ Å»ý ¹ÝÀÀÀ» ÀÏÀ¸Å°´Â ÃÖ¼ÒÀÇ ¹°¸®·®Àº, ±³Â÷ °ËÁ¤µîÀÇ ¸ðµ¨ ¼±Åà ¼ö¹ýÀ¸·Î ¹ß°ßÀûÀ¸·Î ã¾Æ³¾ ¼ö ¹Û¿¡ ¾ø½À´Ï´Ù.
ÇнÀÁß, ÀÌÇÏ¿Í °°Àº Á¤º¸°¡ Ãâ·ÂµË´Ï´Ù.
reading corpus ... adding virtual node: ¸í»ç, °íÀ¯ ¸í»ç, Áö¿ª, ÀϹÝ,*,*, µ¿ÀÏ, Æ®¿ì´ÏÄ¡, Æ®¿ì´ÏÄ¡ adding virtual node: ºÎ»ç, Á¶»ç·ù Á¢¼Ó,*,*,*,*, ²Ï, Ä«³ª¸®, Ä«³ª¸® Number of sentences: 32 Number of features: 47547 eta: 0.00010 freq: 1 C(sigma^2): 1.00000 iter=0 err=1.00000 F=0.41186 target=1691.68869 diff=1.00000 iter=1 err=1.00000 F=0.68727 target=1077.14848 diff=0.36327 iter=2 err=0.87500 F=0.81904 target=621.20311 diff=0.42329 iter=3 err=0.81250 F=0.86354 target=384.72432 diff=0.38068 iter=4 err=0.68750 F=0.93685 target=233.72722 diff=0.39248 ..
% cd $WORK/seed % /usr/local/libexec/mecab/mecab-dict-gen -o ../final -m model ÀÌÇÏ¿Í °°ÀÌ -d, -o (À»)¸¦ »ç¿ëÇØ »çÀüÀ» ÁöÁ¤ÇÒ ¼öµµ ÀÖ½À´Ï´Ù % /usr/local/libexec/mecab/mecab-dict-gen -o $WORK/final -d $WORK/seed -m $WORK/seed/model
¹èÆ÷¿ë »çÀüÀº, seed »çÀü°ú ´Ù¸¥ µð·ºÅ丮¿¡ Ãâ·ÂÇÏÁö ¾ÊÀ¸¸é ¾ÈµË´Ï´Ù. Åë»ó, ¹èÆ÷ »çÀü µð·ºÅ丮 final (À»)¸¦ ¾îÄ«À̺ê(archive) ÇØ À¯Àú¿¡°Ô ¹èÆ÷ÇÕ´Ï´Ù.
% cd $WORK/final % /usr/local/libexec/mecab/mecab-dict-index ÀÌÇÏ¿Í °°ÀÌ -d, -o (À»)¸¦ »ç¿ëÇÒ ¼öµµ ÀÖ½À´Ï´Ù. % /usr/local/libexec/mecab/mecab-dict-index -d $WORK/final -o $WORK/final
Áö±Ý ¸¸µç »çÀüÀ» »ç¿ëÇØ ½ÇÁ¦·Î Çؼ®ÇØ º¸°Ú½À´Ï´Ù.
% mecab -d $WORK/final ¼ÒÁÖ ÁÁ¾ÆÇÏ´Â ¾Æ¹öÁö. ¼ÒÁÖ ¸í»ç, ÀϹÝ,*,*,*,*, ¼ÒÁÖ, ¼î¿ìÃò¿ì, ¼î Ãò- ÁÁ¾Æ ¸í»ç, Çü¿ëµ¿»ç ¾î°£,*,*,*,*, ÁÁ¾Æ, ºóÆ´, ºóÆ´ ÀÇ Á¶»ç, ¿¬Ã¼È,*,*,*,*, ÀÇ, ³ë, ³ë ¾Æ¹öÁö ¸í»ç, ÀϹÝ,*,*,*,*, ¾Æ¹öÁö, ¾Æ¹öÁö, ¾Æ¹öÁö . ±âÈ£, ±¸µÎÁ¡,*,*,*,*,.,.,. EOS
Å×½ºÆ® µ¥ÀÌÅ͸¦ ÁغñÇÕ´Ï´Ù. Å×½ºÆ® µ¥ÀÌÅÍ´Â MeCab ÀÇ µðÆúÆ® Ãâ·Â°ú µ¿ÀÏ Æ÷¸ËÀ¸·Î ±â¼úÇÕ´Ï´Ù.
¿ì¼±, mecab-test-gen (À»)¸¦ »ç¿ëÇØ Å×½ºÆ®ÄÚÆĽº(test) (À¸)·ÎºÎÅÍ, ¹®À常(test.sen) (À»)¸¦ ÃßÃâÇÕ´Ï´Ù.
% /usr/local/libexec/mecab/mecab-test-gen < test > test.sen
test.sen (À»)¸¦ Á¶±Ý Àü ¸¸µç »çÀüÀ¸·Î Çؼ®ÇÕ´Ï´Ù.
% mecab -d $WORK/final test.sen > test.result
Æò°¡ ½ºÅ©¸³Æ® mecab-system-eval (À»)¸¦ ½ÇÇàÇÕ´Ï´Ù. Á¦ÀÏ Àμö°¡ ½Ã½ºÅÛÀÇ °á°ú, Á¦2 Àμö°¡ Á¤´äÀÇ ÆÄÀÏÀÔ´Ï´Ù.
% /usr/local/libexec/mecab/mecab-system-eval test.result test precision recall F LEVEL 0: 98.6887(647112/655710) 98.9793(647112/653785) 98.8338 LEVEL 1: 98.2163(644014/655710) 98.5055(644014/653785) 98.3607 LEVEL 2: 97.2230(637501/655710) 97.5093(637501/653785) 97.3659 LEVEL 4: 96.8367(634968/655710) 97.1218(634968/653785) 96.9791
-l ¿É¼Ç¿¡ ÀÇÇؼ, ¾î´À Å»ýÀÇ ·¹º§À» »ç¿ëÇØ Æò°¡ÇÒ±î ÁöÁ¤ÇÒ ¼ö ÀÖ½À´Ï´Ù.
$Id: learn.html 65 2007-01-30 00:52:53Z taku-ku $;