ContextGen: Targeted Data Generation for Low Resource Domain Specific Text Classification
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:3016-3027, 2022.
To address the challenging low-resource non-topical text classification problems in domain specific settings we introduce ContextGen – a novel approach that uses targeted text generation with no fine tuning to augment the available small annotated dataset. It first adapts the powerful GPT-2 text generation model to generate samples relevant for the domain by using properly designed context text as input for generation. Then it assigns class labels to the newly generated samples after which they are added to the initial training set. We demonstrate the superior performance of a state-of-the-art text classifier trained with the augmented labelled dataset for four different non-topical tasks in the low resource setting, three of which are from specialized domains.