Compressing Tabular Data via Latent Variable Estimation

Andrea Montanari; Eric Weiner

Compressing Tabular Data via Latent Variable Estimation

Andrea Montanari, Eric Weiner

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:25174-25208, 2023.

Abstract

Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i) Estimate latent variables associated to rows and columns; (ii) Partition the table in blocks according to the row/column latents; (iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv) Append a compressed encoding of the latents. We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-montanari23a,
  title = 	 {Compressing Tabular Data via Latent Variable Estimation},
  author =       {Montanari, Andrea and Weiner, Eric},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {25174--25208},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/montanari23a/montanari23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/montanari23a.html},
  abstract = 	 {Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i) Estimate latent variables associated to rows and columns; (ii) Partition the table in blocks according to the row/column latents; (iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv) Append a compressed encoding of the latents. We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.}
}

Endnote

%0 Conference Paper
%T Compressing Tabular Data via Latent Variable Estimation
%A Andrea Montanari
%A Eric Weiner
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-montanari23a
%I PMLR
%P 25174--25208
%U https://proceedings.mlr.press/v202/montanari23a.html
%V 202
%X Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i) Estimate latent variables associated to rows and columns; (ii) Partition the table in blocks according to the row/column latents; (iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv) Append a compressed encoding of the latents. We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.

APA


Montanari, A. & Weiner, E.. (2023). Compressing Tabular Data via Latent Variable Estimation. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:25174-25208 Available from https://proceedings.mlr.press/v202/montanari23a.html.

Compressing Tabular Data via Latent Variable Estimation

Abstract

Cite this Paper

Related Material