Category: Computational Linguistics

  • Open Source HTML Parsers in Java

    Open Source HTML Parsers in Java, a list by Java-Source.net NekoHTML, HTML Parser, Java HTML Parser, Jericho HTML Parser, JTidy, TagSoup, HotSax แถม Nux เหมือนจะทำอะไรได้หลายอย่างสารพัดเกี่ยวกับ XML (เป็น wrapper ของตัวอื่น ๆ ด้วย)

  • TIGER API 1.8 released

    TIGER API is a library which allows Java programmers to easily access the structure of any corpus given as a TIGER-XML file. oeze, one of the authors of TIGER API, has leave a message to us today: BTW, Tiger API has moved. This is the new URL: TIGER API. We have also included a section…

  • Looking for Structures

    keywords: semi-structured text, unstructured text, structure recognition Retrieving Hierarchical Text Structure from Typeset : Scientific Articles – a Prerequisite for E-Science Text Mining Indexing Real-World Data using Semi-Structured Documents Inferring Structure Information from Typography Dr. Rolf Brugger Modeling Documents for Structure Recognition Using Generalized N-Grams A DTD Extension for Document Structure Recognition Jedi: Extracting and…

  • Parsing Parsing

    Natural Language Parsing (course) @ Uni Heidelberg The Program Transformation Wiki ANTLR tutorial @ The University of Birmingham (+ many other Java-related tutorials) Parsing books: by Dick Grune Modern Compiler Design, Parsing Techniques – A Practical Guide, Parsing Techniques – 2nd Edition Formalism / Tools SDF – Modular Syntax Definition Formalism TXL – The TXL…

  • (Better, Faster,) Lighter NLTK

    NLTK-Lite is substantially simplified and streamlined version of NLTK (Natural Language Toolkit). NLTK is no longer supported. NLTK-Lite is a new collection of lightweight NLP modules designed for maximum simplicity and efficiency. NLTK-Lite only covers the simple variants of standard data structures and tasks. Simplicity and efficiency are valued over generality and extensibility. Key differences…

  • A Collection Of POS Taggers

    ACOPOST implements and extends well-known machine learning techniques for Part-of-Speech tagging, and (in the future) provides a uniform environment for testing.

  • quoting

    can we consider “quoting” as a grasp of the whole idea ? or as a grasp of distinguished points ? if so, it may be interesting to look at a piece of text that being quoted quite a lot – can we use that to improve automatic summarization ? การ “อ้างคำพูด” / “ยกคำพูด” เนี่ย ถือว่าเป็นการดึงใจความสำคัญออกมารึเปล่า…

  • Thai language processing

    สารานุกรมไทยสำหรับเยาวชน เล่มที่ 25 บทที่ 7 การประยุกต์ใช้ภาษาไทยบนคอมพิวเตอร์

  • Emdros – a database engine for annotated text

    เมื่อคืนวีร์พูดถึง Emdros ว่าน่าสนใจ สำหรับงานฐานข้อมูลทางภาษาศาสตร์ ก็เลยเข้าไปดูเว็บซะหน่อย Emdros is: an opensource text database engine for storage and retrieval of analyzed or annotated text. applicable especially in corpus linguistics and computational linguistics. equiped with a powerful query-language MQL, based on the Extended MdF mathematical model of text. A short paper explaninig Emdros. ข้างบนจะเห็นคำว่า Extended MdF หรือที่ในเว็บ Emdros จะใช้คำว่า…

  • First day in Potsdam

    Halo. Now in Haus 24, 1.82. Institut für Linguistik, Uni Potsdam. I’m going to work in the Project SUMMaR. From the project info page – “SUMMaR is part of the BMBF project PINK (‘Plattform fuer INtelligente Kollaborationsportale’), a consortium of companies and universities from Berlin-Brandenburg, funded in the framework ‘Innovative regionale Wachstumskerne’.” Travel info, this…