Unsupervised Partial Parsing

Unsupervised partial parsing (UPP) is a simple but effective approach to unsupervised constituency parsing from raw text. The idea is to focus on partial parsing, or chunking raw text into non-overlapping multiword constituents. This means, our models start by finding low- level constituents, such as noun phrases, and build up constituent tree structures from there.

As an unsupervised chunker, this system is pretty good: when evaluated on base noun phrase extraction, using the Penn Treebank WSJ annotations as a gold standard, it achieves over 75% for both precision and recall.

As an unsupervised parser, this system is at or near the state-of-the art for short sentences, and competitive with systems that require part-of-speech annotations (a common assumption in unsupervised parsing research), rather than raw text. For all length sentences, upparse is the current best model for unsupervised parsing that we know of.

For more information, see the paper:

Elias Ponvert, Jason Baldridge and Katrin Erk (2011), “Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 2011.

<a href=http://www.aclweb.org/anthology/P/P11/P11-1108.pdf><img src=http://www.adobe.com/images/pdficon_small.png> PDF</a> from the ACL Anthology

Slides from the presentation and the thesis defense are up on Slideshare:

Unsupervised Partial Parsing: Thesis defense

Simple unsupervised grammar induction from raw text with cascaded finite state models: ACL 2011 talk

Also, here are the slides from a recent presentation about this at the Austin Machine Learing Meetup:

<img src=”images/ml_group_slides.png” width=500>

The software – upparse – is available for download and use under an Apache license. The version of upparse used for this paper is included below. It’s in Java and Python. For latest updates, and to contribute, check out the source on Github at eponvert/upparse.

The song goes: You down with UPP? YEAH YOU KNOW ME!