bact' is a name

Category: Computational Linguistics

Corpus-Related Research

สาขาวิจัยที่สามารถใช้ประโยชน์จากคลังข้อความได้ เช่น ภาษาศาสตร์เชิงคำนวณ วัฒนธรรมศึกษา และ การวิเคราะห์วาทกรรม ใน Linguistics of Political Argument: The Spin-Doctor and the Wolf-Pack at the White House [gbook], Alan Partington รองศาสตราจารย์ด้านภาษาศาสตร์ แห่งคณะรัฐศาสตร์ มหาวิทยาลัยคาเมรีโน ประเทศอิตาลี ได้พิจารณาความสัมพันธ์ระหว่างทำเนียบขาวกับสื่อ โดยการวิเคราะห์ทางภาษาศาสตร์จากคลังข้อความ ซึ่งประกอบไปด้วยสรุปคำแถลงข่าวประมาณ 50 ชิ้นในช่วงปีท้าย ๆ ของการดำรงตำแหน่งของประธานาธิบดีคลินตัน โดยหัวข้อนั้น มีตั้งแต่เรื่องในโคโซโวไปจนถึงเรื่องความสัมพันธ์คลินตัน-เลวินสกี งานชิ้นนี้ไม่เหมือนใครก่อนหน้า ตรงที่มันทำให้เราเห็นว่า เราสามารถนำเทคโนโลยี concordance (การแสดงคำที่กำหนดในบริบทต่าง ๆ) และหลักฐานทางภาษาศาสตร์อย่างละเอียด มาใช้ในการศึกษาคุณสมบัติต่าง ๆ ของวาทกรรม ทั้งในตัวบทและกลวิธีการสื่อสารของผู้พูดได้-อย่างไร Tony McEnery and Andrew Wilson, Corpus Linguistics, Edinburgh…

December 27, 2007
GATE experiment at KIND Lab, SIIT

งานทดลองสุดสัปดาห์ที่ผ่านมา เมื่อวานทดลองเขียน wrapper ครอบ Stanford Log-linear Part-Of-Speech Tagger ให้กลายเป็นปลั๊กอินสำหรับใช้กับ GATE (หลังจากตั้งท่ามานาน) pipeline ในรูป มี 3 Processing Resources คือ tokensier, splitter และ tagger tokensier คือ net.siit.gate.DictionaryBasedTokeniser เป็นตัวตัดคำธรรมดา ๆ ใช้พจนานุกรม1 และออกแบบให้ตัดได้คำที่ยาวที่สุด (longest-matching) ทำงานกับ AnnotationSet ของ GATE โดยตรง — จะสร้าง AnnotationSet ชื่อ “Token” ขึ้นมา splitter คือ ANNIE Sentence Splitter เป็นตัวแบ่งประโยค โดยใช้กฎ (ภาษา JAPE เป็นลักษณะ regular expression over annotation)…

July 16, 2007
The 2nd School of Asian ANLP

The 2nd School of Asian ANLP Special topic on Morpho-Syntactic Analysis ที่ SIIT บางกะดี จัดโดย ADD พรุ่งนี้ (12 มี.ค.) มีเรื่อง text mining และ text summarization technorati tags: text mining, summarization

March 12, 2007
Newline in GATE

ใน GATE, ถ้าเราอยากรู้ว่า เอกสารที่เรากำลังทำงานอยู่เนี่ย มันใช้ อักขระขึ้นบรรทัดใหม่ (newline) แบบไหน ก็เรียกดูได้จากฟีเจอร์ที่ชื่อ “docNewLineType” โดย docNewLineType นี้ เป็น String มีค่าได้ 4 อย่าง: { “CR”, “LF”, “CRLF”, “LFCR” } CR คือ Carriage Return — ปัดแคร่(ไปซ้ายสุด) (\r ในหลายภาษาโปรแกรม), LF คือ Line Feed — เลื่อนบรรทัดใหม่ (\n) เพื่อขึ้นบรรทัดใหม่ เครือญาติ UNIX อย่าง Linux กับ Mac OS X ใช้ LF ตัวเดียว, ใน Mac OS (จนถึงรุ่น 9)…

January 8, 2007
Using dictionary with ICU4J BreakIterator

การสร้างและเรียกใข้พจนานุกรมสำหรับตัดคำ ใน ICU4J จดวิธีการตัดคำด้วย DictionaryBasedBreakIterator ของ ICU4J และการสร้างพจนานุกรมตัดคำเอง (เฮ้! นี่คือ “จาวา” ขวัญอ่อน? รักสวยรักงาม? .. ระวังถูกงับมือ! เราเตือนคุณแล้วนะ :P) การสร้างไฟล์พจนานุกรมสำหรับตัดคำ ใช้โปรแกรม BuildDictionaryFile สร้างไฟล์พจนานุกรม, วิธีใช้คือ: BuildDictionaryFile input [encoding] [output] [list] input = ข้อมูลเข้า ไฟล์พจนานุกรม เป็นไฟล์ชนิดข้อความ หนึ่งคำต่อหนึ่งบรรทัด encoding = รหัสตัวอักษรของไฟล์พจนานุกรม เช่น TIS-620, UTF-8 (ถ้าไม่ใส่จะใช้ค่าปริยาย คือ UTF-8) output = ข้อมูลออก ผลลัพธ์ เป็นไฟล์ชนิดไบนารี (จะใช้เป็นอินพุตของคอนสตรัคเตอร์ของคลาส DictionaryBasedBreakIterator ต่อไป) list = ข้อมูลออก รายการคำที่ถูกบรรจุในพจนานุกรม (output)…

December 27, 2006
Statistical Machine Translation lecture at Kasetsart University

บรรยาย: การแปลภาษาด้วยเครื่องด้วยวิธีทางสถิติ: อะไรที่เป็นไปได้ในวันนี้? โดย ฟิลิปป์ เคิห์น มหาวิทยาลัยเอดินบะระ สก็อตแลนด์ วันจันทร์ที่ 18 ธันวาคม 2549 9:30-11:30 น. ห้อง 204 ตึกวิศวกรรมคอมพิวเตอร์ (ตึก 15) มหาวิทยาลัยเกษตรศาสตร์ บางเขน ลงทะเบียน Lecture: Statistical Machine Translation: What is possible today? by Philipp Koehn, University of Edinburgh, Scotland Monday, December 18, 2006 9:30-11:30 am Room 204, Computer Engineering Building (Building 15), Kasetsart University Register…

December 13, 2006
Sanskrit and Artificial Intelligence

Knowledge Representation in Sanskrit and Artificial Intelligence by Rick Briggs, Roacs, NASA Ames Research Center (published in AI Magazine, Volume 6, Number 1, Spring 1985) There have been suggestions to use Sanskrit as a metalanguage for knowledge representation in e.g. machine translation, and other areas of natural language processing because of its highly regular structure…

July 22, 2006
The 1st School of Asian Applied NLP

ใครสนใจก็ลองสมัครไปเรียนดูนะครับ รับประมาณ 30 คนได้ ไม่เห็นเค้าพูดถึงค่าใช้จ่ายเลย (หรือว่าออกให้ ฟรี ? :P) Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development (ADD) is delighted to announce the call for participation of the First School of Asian Applied NLP (August 21 – September 1, 2006). More information, course outline and detail schedule. Important dates Jul 21,…

July 2, 2006
KICSS 2006 : Knowledge, Information and Creativity Support Systems

The 1st International Conference on Knowledge, Information and Creativity Support Systems August 1-4, 2006 Ayutthaya, Thailand (abstract submission deadline: May 15, 2006) http://kind.siit.tu.ac.th/kicss2006/ The conference will cover a broad range of research topics in the fields of knowledge engineering and science, information technology, creativity support systems and complex system modeling. They include, but are not…

May 10, 2006
Character set detection in Java

jchardet — a Java port of Mozilla’s automatic charset detection algorithm.

April 11, 2006