Bayesian Torrent Classification by File Name and Size Only
; Proceedings of the Eighth International Conference on Probabilistic Graphical Models, PMLR 52:136-146, 2016.
Torrent traffic, much of which is assumed to be illegal downloads of copyrighted content, accounts for up to 35% of internet downloads. Yet, the process of classification and identification of these downloads is unclear, and original data for such studies is often unavailable. Many torrent items lack supporting description or meta-data, in which case only file name and size are available. We describe a novel Bayesian network based classifier system that predicts medium category, pornographic content and risk of fakes and malware based on torrent name and size, optionally supplemented with external databases of titles and actors. We show that our method outperforms a commercial benchmark system and has the potential to rival human classifiers.