OAK: Enriching Document Representations using Auxiliary Knowledge for Extreme Classification

Shikhar Mohan, Deepak Saini, Anshul Mittal, Sayak Ray Chowdhury, Bhawna Paliwal, Jian Jiao, Manish Gupta, Manik Varma
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:36012-36028, 2024.

Abstract

The objective in eXtreme Classification (XC) is to find relevant labels for a document from an exceptionally large label space. Most XC application scenarios have rich auxiliary data associated with the input documents, e.g., frequently clicked webpages for search queries in sponsored search. Unfortunately, most of the existing XC methods do not use any auxiliary data. In this paper, we propose a novel framework, Online Auxiliary Knowledge (OAK), which harnesses auxiliary information linked to the document to improve XC accuracy. OAK stores information learnt from the auxiliary data in a knowledge bank and during a forward pass, retrieves relevant auxiliary knowledge embeddings for a given document. An enriched embedding is obtained by fusing these auxiliary knowledge embeddings with the document’s embedding, thereby enabling much more precise candidate label selection and final classification. OAK training involves three stages. (1) Training a linker module to link documents to relevant auxiliary data points. (2) Learning an embedding for documents enriched using linked auxiliary information. (3) Using the enriched document embeddings to learn the final classifiers. OAK outperforms current state-of-the-art XC methods by up to $\sim 5 %$ on academic datasets, and by $\sim 3 %$ on an auxiliary data-augmented variant of LF-ORCAS-800K dataset in Precision@1. OAK also demonstrates statistically significant improvements in sponsored search metrics when deployed on a large scale search engine.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-mohan24a, title = {{OAK}: Enriching Document Representations using Auxiliary Knowledge for Extreme Classification}, author = {Mohan, Shikhar and Saini, Deepak and Mittal, Anshul and Ray Chowdhury, Sayak and Paliwal, Bhawna and Jiao, Jian and Gupta, Manish and Varma, Manik}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {36012--36028}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/mohan24a/mohan24a.pdf}, url = {https://proceedings.mlr.press/v235/mohan24a.html}, abstract = {The objective in eXtreme Classification (XC) is to find relevant labels for a document from an exceptionally large label space. Most XC application scenarios have rich auxiliary data associated with the input documents, e.g., frequently clicked webpages for search queries in sponsored search. Unfortunately, most of the existing XC methods do not use any auxiliary data. In this paper, we propose a novel framework, Online Auxiliary Knowledge (OAK), which harnesses auxiliary information linked to the document to improve XC accuracy. OAK stores information learnt from the auxiliary data in a knowledge bank and during a forward pass, retrieves relevant auxiliary knowledge embeddings for a given document. An enriched embedding is obtained by fusing these auxiliary knowledge embeddings with the document’s embedding, thereby enabling much more precise candidate label selection and final classification. OAK training involves three stages. (1) Training a linker module to link documents to relevant auxiliary data points. (2) Learning an embedding for documents enriched using linked auxiliary information. (3) Using the enriched document embeddings to learn the final classifiers. OAK outperforms current state-of-the-art XC methods by up to $\sim 5 %$ on academic datasets, and by $\sim 3 %$ on an auxiliary data-augmented variant of LF-ORCAS-800K dataset in Precision@1. OAK also demonstrates statistically significant improvements in sponsored search metrics when deployed on a large scale search engine.} }
Endnote
%0 Conference Paper %T OAK: Enriching Document Representations using Auxiliary Knowledge for Extreme Classification %A Shikhar Mohan %A Deepak Saini %A Anshul Mittal %A Sayak Ray Chowdhury %A Bhawna Paliwal %A Jian Jiao %A Manish Gupta %A Manik Varma %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-mohan24a %I PMLR %P 36012--36028 %U https://proceedings.mlr.press/v235/mohan24a.html %V 235 %X The objective in eXtreme Classification (XC) is to find relevant labels for a document from an exceptionally large label space. Most XC application scenarios have rich auxiliary data associated with the input documents, e.g., frequently clicked webpages for search queries in sponsored search. Unfortunately, most of the existing XC methods do not use any auxiliary data. In this paper, we propose a novel framework, Online Auxiliary Knowledge (OAK), which harnesses auxiliary information linked to the document to improve XC accuracy. OAK stores information learnt from the auxiliary data in a knowledge bank and during a forward pass, retrieves relevant auxiliary knowledge embeddings for a given document. An enriched embedding is obtained by fusing these auxiliary knowledge embeddings with the document’s embedding, thereby enabling much more precise candidate label selection and final classification. OAK training involves three stages. (1) Training a linker module to link documents to relevant auxiliary data points. (2) Learning an embedding for documents enriched using linked auxiliary information. (3) Using the enriched document embeddings to learn the final classifiers. OAK outperforms current state-of-the-art XC methods by up to $\sim 5 %$ on academic datasets, and by $\sim 3 %$ on an auxiliary data-augmented variant of LF-ORCAS-800K dataset in Precision@1. OAK also demonstrates statistically significant improvements in sponsored search metrics when deployed on a large scale search engine.
APA
Mohan, S., Saini, D., Mittal, A., Ray Chowdhury, S., Paliwal, B., Jiao, J., Gupta, M. & Varma, M.. (2024). OAK: Enriching Document Representations using Auxiliary Knowledge for Extreme Classification. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:36012-36028 Available from https://proceedings.mlr.press/v235/mohan24a.html.

Related Material