keywords: semi-structured text, unstructured text, structure recognitionRetrieving Hierarchical Text Structure from Typeset : Scientific Articles – a Prerequisite for E-Science Text Mining Indexing Real-World Data using Semi-Structured Documents Inferring Structure Information from Typography Dr. Rolf Brugger Modeling Documents for Structure Recognition Using Generalized N-Grams A DTD Extension for Document Structure Recognition Jedi: Extracting and Synthesizing Information from the Web
MarkItUp! An incremental approach to document structure recognition
Water of uncertainty. Islands of certainty.
Island Grammars and Island Parsing
+ Document Structure Parsing
- What is a Topic Map? (Durusau & O’Donnell, 2002)
- Semantic Role Parsing: Adding Semantic Structure to Unstructured Text (Pradhan, 2003)
- Adding Structure to Unstructured Text (Maletic & Collard, 2005)
- Island Parsing and Bidirectional Charts (Stock, 1988) (CiteSeer)
- Generating Robust Parsers using Island Grammars (Moonen, 2001) (CiteSeer)
- Implementation Strategies for Island Grammars (van der Leek, 2005)
- Redundancy-free Island Parsing of Word Graphs (Kiefer, 2005)
- A Prolog based Information Extraction System (Emms, 2001) (CiteSeer)
Parsing Spoken Phrases Despite Missing Words
ANTLR tutorial @ The University of Birmingham (+ many other Java-related tutorials)
Formalism / ToolsSDF – Modular Syntax Definition Formalism TXL – The TXL Source Transformation Language, A Language for Programming Language Tools and Applications
Packrat Parsing + Parsing Expression Grammars
Universal Feed Parser.
“Parse RSS and Atom feeds in Python. 2000 unit tests. Open source.”
An open source C++ library providing language analysis services.
Like tokenization, sentence splitting, morphological analysis, named entity and date/number/currency recognition, PoS tagging, and shallow parsing.
The software is released under LGPL.
Developed by Natural Language Research Group, Technical University of Catalonia, Spain