Using Grammatical Inference to Build Privacy Preserving Data-sets of User Logs
Proceedings of the Fifteenth International Conference on Grammatical Inference, PMLR 153:176-190, 2021.
In many web applications, user logs are extracted to build a user model which can be part of further development, recommendation systems or personalization. This is the case for education platforms like X5GON. In order to obtain community collaboration, these logs should be shared, but logical privacy issues arise. In this work, we propose to build a user model from a data-set of logs: this will be a timed and probabilistic $k$-testable automaton, which can then be used to generate a new data-set having statistically close characteristics, yet have in which the original sequences have been sufficiently chunked the original data to not be able to identify the original logs. Following ideas from Differencial Privacy, we provide a second algorithm allowing to eliminate any strings whose influence would be too great. Experiments validate the approach.