Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia

Austin Talbot, Alex V. Kotlar, Lavanya Rishishwar, Yue Ke
Proceedings of the 10th Machine Learning for Healthcare Conference, PMLR 298, 2025.

Abstract

Thalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual’s profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual’s genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of $0.99$ and specificity of $0.93$ for thalassemia detection, and accuracy of $91.5%$ for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and $93.9%$ when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.

Cite this Paper


BibTeX
@InProceedings{pmlr-v298-talbot25a, title = {Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia}, author = {Talbot, Austin and Kotlar, Alex V. and Rishishwar, Lavanya and Ke, Yue}, booktitle = {Proceedings of the 10th Machine Learning for Healthcare Conference}, year = {2025}, editor = {Agrawal, Monica and Deshpande, Kaivalya and Engelhard, Matthew and Joshi, Shalmali and Tang, Shengpu and Urteaga, Iñigo}, volume = {298}, series = {Proceedings of Machine Learning Research}, month = {15--16 Aug}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v298/main/assets/talbot25a/talbot25a.pdf}, url = {https://proceedings.mlr.press/v298/talbot25a.html}, abstract = {Thalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual’s profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual’s genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of $0.99$ and specificity of $0.93$ for thalassemia detection, and accuracy of $91.5%$ for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and $93.9%$ when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.} }
Endnote
%0 Conference Paper %T Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia %A Austin Talbot %A Alex V. Kotlar %A Lavanya Rishishwar %A Yue Ke %B Proceedings of the 10th Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2025 %E Monica Agrawal %E Kaivalya Deshpande %E Matthew Engelhard %E Shalmali Joshi %E Shengpu Tang %E Iñigo Urteaga %F pmlr-v298-talbot25a %I PMLR %U https://proceedings.mlr.press/v298/talbot25a.html %V 298 %X Thalassemia, a blood disorder and one of the most prevalent hereditary genetic disorders worldwide, is often caused by copy number variations (CNVs) in the hemoglobin genes. This disorder has incredible diversity, with a large number of distinct profiles corresponding to alterations of different regions in the genes. Correctly classifying an individual’s profile is critical as it impacts treatment, prognosis, and genetic counseling. However, genetic classification is challenging due to the large number of profiles worldwide, and often requires a large number of sequential tests. Targeted next generation sequencing (NGS), which characterizes segments of an individual’s genome, has the potential to dramatically reduce the cost of testing and increase accuracy. In this work, we introduce a probabilistic state space model for profiling thalassemia from targeted NGS data, which naturally characterize the spatial ordering of the genes along the chromosome. We then use decision theory to choose the best profile among the different options. Due to our use of Bayesian methodology, we are also able to detect low-quality samples to be excluded from consideration, an important component of clinical screening. We evaluate our model on a dataset of 57 individuals, including both controls and cases with a variety of thalassemia profiles. Our model has a sensitivity of $0.99$ and specificity of $0.93$ for thalassemia detection, and accuracy of $91.5%$ for characterizing subtypes. Furthermore, the specificity and accuracy rise to $0.96$ and $93.9%$ when low-quality samples are excluded using our automated quality control method. This approach outperforms alternative methods, particularly in specificity, and is broadly applicable to other disorders.
APA
Talbot, A., Kotlar, A.V., Rishishwar, L. & Ke, Y.. (2025). Classifying Copy Number Variations Using State Space Modeling of Targeted Sequencing Data: A Case Study in Thalassemia. Proceedings of the 10th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 298 Available from https://proceedings.mlr.press/v298/talbot25a.html.

Related Material