Word Complexity Prediction Through ML-Based Contextual Analysis

Muhammad Uzzam, Amal Htait
Proceedings of the UK AI Conference 2024, PMLR 295:53-61, 2025.

Abstract

This paper conducted a comparative evaluation two approaches for predicting word complexity using contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, dual-BERT achieved a macro-average F1-score equal to 0.78.

Cite this Paper


BibTeX
@InProceedings{pmlr-v295-uzzam25a, title = {Word Complexity Prediction Through ML-Based Contextual Analysis}, author = {Uzzam, Muhammad and Htait, Amal}, booktitle = {Proceedings of the UK AI Conference 2024}, pages = {53--61}, year = {2025}, editor = {Benford, Alistair and Cabrera, Christian and Kiden, Sarah and Salili-James, Arianna and Zakka, Vincent Gbouna}, volume = {295}, series = {Proceedings of Machine Learning Research}, month = {05 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v295/main/assets/uzzam25a/uzzam25a.pdf}, url = {https://proceedings.mlr.press/v295/uzzam25a.html}, abstract = {This paper conducted a comparative evaluation two approaches for predicting word complexity using contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, dual-BERT achieved a macro-average F1-score equal to 0.78. } }
Endnote
%0 Conference Paper %T Word Complexity Prediction Through ML-Based Contextual Analysis %A Muhammad Uzzam %A Amal Htait %B Proceedings of the UK AI Conference 2024 %C Proceedings of Machine Learning Research %D 2025 %E Alistair Benford %E Christian Cabrera %E Sarah Kiden %E Arianna Salili-James %E Vincent Gbouna Zakka %F pmlr-v295-uzzam25a %I PMLR %P 53--61 %U https://proceedings.mlr.press/v295/uzzam25a.html %V 295 %X This paper conducted a comparative evaluation two approaches for predicting word complexity using contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, dual-BERT achieved a macro-average F1-score equal to 0.78.
APA
Uzzam, M. & Htait, A.. (2025). Word Complexity Prediction Through ML-Based Contextual Analysis. Proceedings of the UK AI Conference 2024, in Proceedings of Machine Learning Research 295:53-61 Available from https://proceedings.mlr.press/v295/uzzam25a.html.

Related Material