bact' is a name

different treatments of Mai Yamok in BEST Corpus

In the first release of BEST Word Segmented Corpus (free registration required for corpus download), I found different segmentations for May Yamok (repetition mark):

|พร้อม|ๆ| |กับ|
|ร้อย|ๆ |ปี|
|ทั้งๆ ที่|
|ต่างๆ| |ดัง|
|ย่อ|ๆ| |ว่า|
|ย่อ|ๆ |ว่า|

(Real data, taken from encyclopedia_00005.txt. ‘|’ is word/token boundary)

These are probably intended. Or inconsistency ? Not quite sure, will ask people around.

BEST เป็นการประเมินประสิทธิภาพอัลกอริธึม/ซอฟต์แวร์ประมวลผลภาษาไทย ปีนี้จัดแข่งขันซอฟต์แวร์ตัดคำไทย ในงาน NSC ครั้งที่ 11 – สนใจร่วมได้

technorati tags:
BEST corpus,
word segmentation,
Thai

July 17, 2008

bact

Computational Linguistics

corpora, thai, word break

2 responses to “different treatments of Mai Yamok in BEST Corpus”

bact' says:

2008.07.19 at 20:06

อีเมลตอบจาก ดร.ชัย หนึ่งในทีมพัฒนา BEST2008/7/19 Chai Wutiwiwatchai เข้าใจว่ามีการกำหนดข้อความบางประเภทที่ต้องอยู่รวมกันเช่น"ทั้งๆ ที่" เป็นคำเดียว ดังนั้น ไม้ยมกบางตัวอาจจะรวมอยู่ในคำได้เหมือนกันครับ นอกจากนี้ มีความเป็นไปได้ที่ฐานข้อมูลจะมีError อยู่บ้างแต่พยายามไม่ให้เกิน 10% ครับชัย

Reply
bact' says:

2008.07.22 at 01:54

อีเมลตอบจาก อ.วิโรจน์ หัวหน้าทีม BEST:" เป็นไปได้ว่า โปรแกรมตัดแบบนั้น เพราะใน dictionary ที่ใช้มีคำ "ต่างๆ" "ทั้ง ๆ ที่" หากเป็นคำอื่นๆ ที่ไม่มีใน dictionary โปรแกรมน่าจะตัด ๆ ออกมา แต่ทำไมบางครั้ง รวม ๆ กับ space น่าจะเป็น error ของโปรแกรม เกณฑ์เฉพาะสำหรับ ๆ ไม่มี เพียงแต่คำที่มองว่าเป็น unit เดียวได้อย่าง ทั้งๆที่ ก็อาจรวมเป็นคำเดียวไม่รู้ว่าตอบคำถามหรือเปล่าครับ "

Reply

different treatments of Mai Yamok in BEST Corpus

Share this:

2 responses to “different treatments of Mai Yamok in BEST Corpus”

Leave a ReplyCancel reply