Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models

Valentin I. Spitkovsky, Hiyan Alshawi, Daniel Jurafsky
; Proceedings of the Eleventh International Conference on Grammatical Inference, PMLR 21:189-194, 2012.

Abstract

Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing inter-punctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v21-spitkovsky12a, title = {Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models}, author = {Valentin I. Spitkovsky and Hiyan Alshawi and Daniel Jurafsky}, booktitle = {Proceedings of the Eleventh International Conference on Grammatical Inference}, pages = {189--194}, year = {2012}, editor = {Jeffrey Heinz and Colin Higuera and Tim Oates}, volume = {21}, series = {Proceedings of Machine Learning Research}, address = {University of Maryland, College Park, MD, USA}, month = {05--08 Sep}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v21/spitkovsky12a/spitkovsky12a.pdf}, url = {http://proceedings.mlr.press/v21/spitkovsky12a.html}, abstract = {Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing inter-punctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task.} }
Endnote
%0 Conference Paper %T Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models %A Valentin I. Spitkovsky %A Hiyan Alshawi %A Daniel Jurafsky %B Proceedings of the Eleventh International Conference on Grammatical Inference %C Proceedings of Machine Learning Research %D 2012 %E Jeffrey Heinz %E Colin Higuera %E Tim Oates %F pmlr-v21-spitkovsky12a %I PMLR %J Proceedings of Machine Learning Research %P 189--194 %U http://proceedings.mlr.press %V 21 %W PMLR %X Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing inter-punctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task.
RIS
TY - CPAPER TI - Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models AU - Valentin I. Spitkovsky AU - Hiyan Alshawi AU - Daniel Jurafsky BT - Proceedings of the Eleventh International Conference on Grammatical Inference PY - 2012/08/16 DA - 2012/08/16 ED - Jeffrey Heinz ED - Colin Higuera ED - Tim Oates ID - pmlr-v21-spitkovsky12a PB - PMLR SP - 189 DP - PMLR EP - 194 L1 - http://proceedings.mlr.press/v21/spitkovsky12a/spitkovsky12a.pdf UR - http://proceedings.mlr.press/v21/spitkovsky12a.html AB - Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing inter-punctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task. ER -
APA
Spitkovsky, V.I., Alshawi, H. & Jurafsky, D.. (2012). Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models. Proceedings of the Eleventh International Conference on Grammatical Inference, in PMLR 21:189-194

Related Material