Research on Chinese Text Similarity by Fusing Deep and Shallow Features

Chengfang Lu, Gang Li, Linjie Hou
Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, PMLR 278:808-818, 2025.

Abstract

Existing Chinese text similarity calculation methods typically focus on a single dimension, resulting in insufficient information integration and difficulty in comprehensively merging semantic, feature, and structural information. To address this issue, a Chinese text similarity calculation model that integrates deep and shallow similarities has been proposed. The model first utilizes a Siamese neural network to obtain dynamic vector representations of the texts, further extracting features and calculating deep semantic similarity. Next, based on traditional edit distance algorithms, an improved component-weighted edit distance algorithm is designed by introducing tokenization and assigning weights to different parts of speech, to more accurately reflect the lexical-level shallow features and structural information of the texts. Finally, by linearly weighting and fusing deep semantic similarity with shallow feature similarity, a more comprehensive text similarity evaluation is achieved. Experimental results show that in experiments based on Chinese STS-B and Chinese SICK datasets, the Spearman correlation coefficients improved by 4.34 and 3.76, respectively, compared to the baseline model Siamese-RoBERTa. This model effectively enhances the performance of Chinese short text similarity calculation and better aligns with the expression habits of Chinese texts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v278-lu25a, title = {Research on Chinese Text Similarity by Fusing Deep and Shallow Features}, author = {Lu, Chengfang and Li, Gang and Hou, Linjie}, booktitle = {Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing}, pages = {808--818}, year = {2025}, editor = {Zeng, Nianyin and Pachori, Ram Bilas and Wang, Dongshu}, volume = {278}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v278/main/assets/lu25a/lu25a.pdf}, url = {https://proceedings.mlr.press/v278/lu25a.html}, abstract = {Existing Chinese text similarity calculation methods typically focus on a single dimension, resulting in insufficient information integration and difficulty in comprehensively merging semantic, feature, and structural information. To address this issue, a Chinese text similarity calculation model that integrates deep and shallow similarities has been proposed. The model first utilizes a Siamese neural network to obtain dynamic vector representations of the texts, further extracting features and calculating deep semantic similarity. Next, based on traditional edit distance algorithms, an improved component-weighted edit distance algorithm is designed by introducing tokenization and assigning weights to different parts of speech, to more accurately reflect the lexical-level shallow features and structural information of the texts. Finally, by linearly weighting and fusing deep semantic similarity with shallow feature similarity, a more comprehensive text similarity evaluation is achieved. Experimental results show that in experiments based on Chinese STS-B and Chinese SICK datasets, the Spearman correlation coefficients improved by 4.34 and 3.76, respectively, compared to the baseline model Siamese-RoBERTa. This model effectively enhances the performance of Chinese short text similarity calculation and better aligns with the expression habits of Chinese texts.} }
Endnote
%0 Conference Paper %T Research on Chinese Text Similarity by Fusing Deep and Shallow Features %A Chengfang Lu %A Gang Li %A Linjie Hou %B Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing %C Proceedings of Machine Learning Research %D 2025 %E Nianyin Zeng %E Ram Bilas Pachori %E Dongshu Wang %F pmlr-v278-lu25a %I PMLR %P 808--818 %U https://proceedings.mlr.press/v278/lu25a.html %V 278 %X Existing Chinese text similarity calculation methods typically focus on a single dimension, resulting in insufficient information integration and difficulty in comprehensively merging semantic, feature, and structural information. To address this issue, a Chinese text similarity calculation model that integrates deep and shallow similarities has been proposed. The model first utilizes a Siamese neural network to obtain dynamic vector representations of the texts, further extracting features and calculating deep semantic similarity. Next, based on traditional edit distance algorithms, an improved component-weighted edit distance algorithm is designed by introducing tokenization and assigning weights to different parts of speech, to more accurately reflect the lexical-level shallow features and structural information of the texts. Finally, by linearly weighting and fusing deep semantic similarity with shallow feature similarity, a more comprehensive text similarity evaluation is achieved. Experimental results show that in experiments based on Chinese STS-B and Chinese SICK datasets, the Spearman correlation coefficients improved by 4.34 and 3.76, respectively, compared to the baseline model Siamese-RoBERTa. This model effectively enhances the performance of Chinese short text similarity calculation and better aligns with the expression habits of Chinese texts.
APA
Lu, C., Li, G. & Hou, L.. (2025). Research on Chinese Text Similarity by Fusing Deep and Shallow Features. Proceedings of 2025 2nd International Conference on Machine Learning and Intelligent Computing, in Proceedings of Machine Learning Research 278:808-818 Available from https://proceedings.mlr.press/v278/lu25a.html.

Related Material