-
Open Source HTML Parsers in Java
Open Source HTML Parsers in Java, a list by Java-Source.net NekoHTML, HTML Parser, Java HTML Parser, Jericho HTML Parser, JTidy, TagSoup, HotSax แถม Nux เหมือนจะทำอะไรได้หลายอย่างสารพัดเกี่ยวกับ XML (เป็น wrapper ของตัวอื่น ๆ ด้วย)
-
Looking for Structures
keywords: semi-structured text, unstructured text, structure recognition Retrieving Hierarchical Text Structure from Typeset : Scientific Articles – a Prerequisite for E-Science Text Mining Indexing Real-World Data using Semi-Structured Documents Inferring Structure Information from Typography Dr. Rolf Brugger Modeling Documents for Structure Recognition Using Generalized N-Grams A DTD Extension for Document Structure Recognition Jedi: Extracting and…
-
invalid/partial HTML parsing
Jericho HTML Parser (Java) JavaScript libraries for various kind of HTML parsing LAPIS project | Detecting and Parsing Embedded Lightweight Structures (Java)
-
Island Grammars / Parsing
Water of uncertainty. Islands of certainty. Island Grammars and Island Parsing + Document Structure Parsing What is a Topic Map? (Durusau & O’Donnell, 2002) Semantic Role Parsing: Adding Semantic Structure to Unstructured Text (Pradhan, 2003) Adding Structure to Unstructured Text (Maletic & Collard, 2005) Island Parsing and Bidirectional Charts (Stock, 1988) (CiteSeer) Generating Robust Parsers…
-
Parsing Parsing
Natural Language Parsing (course) @ Uni Heidelberg The Program Transformation Wiki ANTLR tutorial @ The University of Birmingham (+ many other Java-related tutorials) Parsing books: by Dick Grune Modern Compiler Design, Parsing Techniques – A Practical Guide, Parsing Techniques – 2nd Edition Formalism / Tools SDF – Modular Syntax Definition Formalism TXL – The TXL…
-
ANTLR for Ruby
ANTLR is a parser generator (thinking of lex/yacc, but better). Now it can generate a Ruby source code.
-
Piccolo SAX Parser
From benchmarks here and here, this Piccolo Java SAX parser performs really, incredibly, fast.
-
Universal Feed Parser (Python)
Universal Feed Parser. “Parse RSS and Atom feeds in Python. 2000 unit tests. Open source.”
-
Project Log Analyzer
คุณ pok เขียนถึงวิธีการประยุกต์ใช้ tag เพื่อการวิเคราะห์ log file เอาไว้ เขียนได้น่าอ่านมาก ละเอียด น่าสนใจ 🙂 Project Log Analyzer #1, #2 โดยมีการใช้ Common Digester มาช่วย parse xml file, และ ANTLR ในการ parse query ขออนุญาตสมัครเป็นแฟนบล็อก 🙂
-
FreeLing
An open source C++ library providing language analysis services. Like tokenization, sentence splitting, morphological analysis, named entity and date/number/currency recognition, PoS tagging, and shallow parsing. The software is released under LGPL. Developed by Natural Language Research Group, Technical University of Catalonia, Spain