ContextGen: Targeted Data Generation for Low Resource Domain Specific Text Classification

Lukas Fromme, Jasmina Bogojeska, Jonas Kuhn
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:3016-3027, 2022.

Abstract

To address the challenging low-resource non-topical text classification problems in domain specific settings we introduce ContextGen – a novel approach that uses targeted text generation with no fine tuning to augment the available small annotated dataset. It first adapts the powerful GPT-2 text generation model to generate samples relevant for the domain by using properly designed context text as input for generation. Then it assigns class labels to the newly generated samples after which they are added to the initial training set. We demonstrate the superior performance of a state-of-the-art text classifier trained with the augmented labelled dataset for four different non-topical tasks in the low resource setting, three of which are from specialized domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-fromme22a, title = { ContextGen: Targeted Data Generation for Low Resource Domain Specific Text Classification }, author = {Fromme, Lukas and Bogojeska, Jasmina and Kuhn, Jonas}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {3016--3027}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/fromme22a/fromme22a.pdf}, url = {https://proceedings.mlr.press/v151/fromme22a.html}, abstract = { To address the challenging low-resource non-topical text classification problems in domain specific settings we introduce ContextGen – a novel approach that uses targeted text generation with no fine tuning to augment the available small annotated dataset. It first adapts the powerful GPT-2 text generation model to generate samples relevant for the domain by using properly designed context text as input for generation. Then it assigns class labels to the newly generated samples after which they are added to the initial training set. We demonstrate the superior performance of a state-of-the-art text classifier trained with the augmented labelled dataset for four different non-topical tasks in the low resource setting, three of which are from specialized domains. } }
Endnote
%0 Conference Paper %T ContextGen: Targeted Data Generation for Low Resource Domain Specific Text Classification %A Lukas Fromme %A Jasmina Bogojeska %A Jonas Kuhn %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-fromme22a %I PMLR %P 3016--3027 %U https://proceedings.mlr.press/v151/fromme22a.html %V 151 %X To address the challenging low-resource non-topical text classification problems in domain specific settings we introduce ContextGen – a novel approach that uses targeted text generation with no fine tuning to augment the available small annotated dataset. It first adapts the powerful GPT-2 text generation model to generate samples relevant for the domain by using properly designed context text as input for generation. Then it assigns class labels to the newly generated samples after which they are added to the initial training set. We demonstrate the superior performance of a state-of-the-art text classifier trained with the augmented labelled dataset for four different non-topical tasks in the low resource setting, three of which are from specialized domains.
APA
Fromme, L., Bogojeska, J. & Kuhn, J.. (2022). ContextGen: Targeted Data Generation for Low Resource Domain Specific Text Classification . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:3016-3027 Available from https://proceedings.mlr.press/v151/fromme22a.html.

Related Material