Here's something cool you can do with HMMs

Unsupervised parsing and solving a (relatively) difficult NLP problem with (relatively) simple machine learning and hacks

Elias Ponvert

People Pattern Corporation

Data Science

Hello Austin ML

What this talk is about

  • Unsupervised parsing = considered a pretty hard problem
  • We achieved new state-of-the-art-results mostly by hacking
  • With a little ML (which is why I'm here)

What you'll learn

  • A little about HMMs
  • Strong baselines are hard to beat
  • Don't stop hacking

Hidden Markov models

What? Why?


Constituent Context Model

Klein & Manning 2002


(here's where the bodies are buried)

The Competition

Datasets for evaluation

What CCL Parser was getting right

So we created a new evaluation


Worst. Name. Ever.

Anyway, we found CCL Parser did great at this evaluation

Overall precision Clumps precision
WSJ 53.9 65.4
Negra 33.4 38.3
CTB 38.6 31.5

A new baseline

Right-branching trees, except for the clumps

Our first approach

Simplest. Model. Ever.

It sorta worked

Quick Recap

  • Unsupervised parsing is considered kinda hard
  • The vast majority of approaches curiously assume gold standard POS
  • CCL Parser does not, and performed well
  • But we discovered it's real strength was in predicting low level constituents
  • Which we inexcusably named "clumps"

Back to the HMMs

Unsupervised Partial Parsing

The same thing with a cooler name

Chunking as a tagging problem

Learning Partial Parsing

Here's the models



How to predict trees

Hint: it's a hack


Full length sentences

<10 Word Sentences

Yeah, I didn't do any of that stuff

Still the best

Lots more details