[edit]
Research on Chinese Text Similarity by Fusing Deep and Shallow Features
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:808-818, 2025.
Abstract
Existing Chinese text similarity calculation methods typically focus on a single dimension, resulting in insufficient information integration and difficulty in comprehensively merging semantic, feature, and structural information. To address this issue, a Chinese text similarity calculation model that integrates deep and shallow similarities has been proposed. The model first utilizes a Siamese neural network to obtain dynamic vector representations of the texts, further extracting features and calculating deep semantic similarity. Next, based on traditional edit distance algorithms, an improved component-weighted edit distance algorithm is designed by introducing tokenization and assigning weights to different parts of speech, to more accurately reflect the lexical-level shallow features and structural information of the texts. Finally, by linearly weighting and fusing deep semantic similarity with shallow feature similarity, a more comprehensive text similarity evaluation is achieved. Experimental results show that in experiments based on Chinese STS-B and Chinese SICK datasets, the Spearman correlation coefficients improved by 4.34 and 3.76, respectively, compared to the baseline model Siamese-RoBERTa. This model effectively enhances the performance of Chinese short text similarity calculation and better aligns with the expression habits of Chinese texts.