Word Complexity Prediction Through ML-Based Contextual Analysis

Muhammad Uzzam; Amal Htait

Word Complexity Prediction Through ML-Based Contextual Analysis

Muhammad Uzzam, Amal Htait

Proceedings of the UK AI Conference 2024, PMLR 295:53-61, 2025.

Abstract

This paper conducted a comparative evaluation two approaches for predicting word complexity using contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, dual-BERT achieved a macro-average F1-score equal to 0.78.

Cite this Paper

BibTeX

@InProceedings{pmlr-v295-uzzam25a,
  title = 	 {Word Complexity Prediction Through ML-Based Contextual Analysis},
  author =       {Uzzam, Muhammad and Htait, Amal},
  booktitle = 	 {Proceedings of the UK AI Conference 2024},
  pages = 	 {53--61},
  year = 	 {2025},
  editor = 	 {Benford, Alistair and Cabrera, Christian and Kiden, Sarah and Salili-James, Arianna and Zakka, Vincent Gbouna},
  volume = 	 {295},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {05 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v295/main/assets/uzzam25a/uzzam25a.pdf},
  url = 	 {https://proceedings.mlr.press/v295/uzzam25a.html},
  abstract = 	 {This paper conducted a comparative evaluation two approaches  for predicting word complexity using 
 contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods 
 were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes 
 both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder 
 Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and 
 another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset 
 covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity 
 levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were 
 applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving 
 macro-average F1-score  of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, 
 dual-BERT achieved a macro-average F1-score equal to 0.78.
 }
}

Endnote

%0 Conference Paper
%T Word Complexity Prediction Through ML-Based Contextual Analysis
%A Muhammad Uzzam
%A Amal Htait
%B Proceedings of the UK AI Conference 2024
%C Proceedings of Machine Learning Research
%D 2025
%E Alistair Benford
%E Christian Cabrera
%E Sarah Kiden
%E Arianna Salili-James
%E Vincent Gbouna Zakka	
%F pmlr-v295-uzzam25a
%I PMLR
%P 53--61
%U https://proceedings.mlr.press/v295/uzzam25a.html
%V 295
%X This paper conducted a comparative evaluation two approaches  for predicting word complexity using 
 contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods 
 were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes 
 both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder 
 Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and 
 another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset 
 covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity 
 levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were 
 applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving 
 macro-average F1-score  of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, 
 dual-BERT achieved a macro-average F1-score equal to 0.78.

APA

Uzzam, M. & Htait, A.. (2025). Word Complexity Prediction Through ML-Based Contextual Analysis. Proceedings of the UK AI Conference 2024, in Proceedings of Machine Learning Research 295:53-61 Available from https://proceedings.mlr.press/v295/uzzam25a.html.

Related Material

Download PDF