[edit]
Word Complexity Prediction Through ML-Based Contextual Analysis
Proceedings of the UK AI Conference 2024, PMLR 295:53-61, 2025.
Abstract
This paper conducted a comparative evaluation two approaches for predicting word complexity using
contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods
were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes
both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder
Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and
another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset
covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity
levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were
applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving
macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison,
dual-BERT achieved a macro-average F1-score equal to 0.78.