Hierarchical Information Criterion for Variable Abstraction

Mark Mirtchouk, Bharat Srikishan, Samantha Kleinberg
Proceedings of the 6th Machine Learning for Healthcare Conference, PMLR 149:440-460, 2021.

Abstract

Large biomedical datasets can contain thousands of variables, creating challenges for machine learn-ing tasks such as causal inference and prediction. Feature selection and ranking methods have been developed to reduce the number of variables and determine which are most important. However in many cases, such as in classification from diagnosis codes, ontologies, and controlled vocabularies, we must choose not only which variables to include but also at what level of granularity. ICD-9 codes, for example, are arranged in a hierarchy, and a user must decide at what level codes should be analyzed. Thus it is currently up to a researcher to decide whether to use any diagnosis of diabetes or whether to distinguish between specific forms, such as Type 2 diabetes with renal complications versus without mention of complications. Currently, there is no existing method that can automatically make this determination and methods for feature selection do not exploit this hierarchical information, which is found in other areas including nutrition (hierarchies of foods), and bioinformatics (hierarchical relationship of genes). To address this, we propose a novel Hierarchical Information Criterion (HIC) that builds on mutual information and allows fully automated abstraction of variables. Using HIC allows us to rank hierarchical features and select the ones with the highest score. We show that this significantly improves performance by an average AUROC of 0.053 over traditional feature selection methods and hand crafted features on two mortality prediction tasks using MIMIC-III ICU data. Our method also improves on the state of the art (Fu et al., 2019) with an AUROC increase from 0.819 to 0.887

Cite this Paper


BibTeX
@InProceedings{pmlr-v149-mirtchouk21a, title = {Hierarchical Information Criterion for Variable Abstraction}, author = {Mirtchouk, Mark and Srikishan, Bharat and Kleinberg, Samantha}, booktitle = {Proceedings of the 6th Machine Learning for Healthcare Conference}, pages = {440--460}, year = {2021}, editor = {Jung, Ken and Yeung, Serena and Sendak, Mark and Sjoding, Michael and Ranganath, Rajesh}, volume = {149}, series = {Proceedings of Machine Learning Research}, month = {06--07 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v149/mirtchouk21a/mirtchouk21a.pdf}, url = {https://proceedings.mlr.press/v149/mirtchouk21a.html}, abstract = {Large biomedical datasets can contain thousands of variables, creating challenges for machine learn-ing tasks such as causal inference and prediction. Feature selection and ranking methods have been developed to reduce the number of variables and determine which are most important. However in many cases, such as in classification from diagnosis codes, ontologies, and controlled vocabularies, we must choose not only which variables to include but also at what level of granularity. ICD-9 codes, for example, are arranged in a hierarchy, and a user must decide at what level codes should be analyzed. Thus it is currently up to a researcher to decide whether to use any diagnosis of diabetes or whether to distinguish between specific forms, such as Type 2 diabetes with renal complications versus without mention of complications. Currently, there is no existing method that can automatically make this determination and methods for feature selection do not exploit this hierarchical information, which is found in other areas including nutrition (hierarchies of foods), and bioinformatics (hierarchical relationship of genes). To address this, we propose a novel Hierarchical Information Criterion (HIC) that builds on mutual information and allows fully automated abstraction of variables. Using HIC allows us to rank hierarchical features and select the ones with the highest score. We show that this significantly improves performance by an average AUROC of 0.053 over traditional feature selection methods and hand crafted features on two mortality prediction tasks using MIMIC-III ICU data. Our method also improves on the state of the art (Fu et al., 2019) with an AUROC increase from 0.819 to 0.887} }
Endnote
%0 Conference Paper %T Hierarchical Information Criterion for Variable Abstraction %A Mark Mirtchouk %A Bharat Srikishan %A Samantha Kleinberg %B Proceedings of the 6th Machine Learning for Healthcare Conference %C Proceedings of Machine Learning Research %D 2021 %E Ken Jung %E Serena Yeung %E Mark Sendak %E Michael Sjoding %E Rajesh Ranganath %F pmlr-v149-mirtchouk21a %I PMLR %P 440--460 %U https://proceedings.mlr.press/v149/mirtchouk21a.html %V 149 %X Large biomedical datasets can contain thousands of variables, creating challenges for machine learn-ing tasks such as causal inference and prediction. Feature selection and ranking methods have been developed to reduce the number of variables and determine which are most important. However in many cases, such as in classification from diagnosis codes, ontologies, and controlled vocabularies, we must choose not only which variables to include but also at what level of granularity. ICD-9 codes, for example, are arranged in a hierarchy, and a user must decide at what level codes should be analyzed. Thus it is currently up to a researcher to decide whether to use any diagnosis of diabetes or whether to distinguish between specific forms, such as Type 2 diabetes with renal complications versus without mention of complications. Currently, there is no existing method that can automatically make this determination and methods for feature selection do not exploit this hierarchical information, which is found in other areas including nutrition (hierarchies of foods), and bioinformatics (hierarchical relationship of genes). To address this, we propose a novel Hierarchical Information Criterion (HIC) that builds on mutual information and allows fully automated abstraction of variables. Using HIC allows us to rank hierarchical features and select the ones with the highest score. We show that this significantly improves performance by an average AUROC of 0.053 over traditional feature selection methods and hand crafted features on two mortality prediction tasks using MIMIC-III ICU data. Our method also improves on the state of the art (Fu et al., 2019) with an AUROC increase from 0.819 to 0.887
APA
Mirtchouk, M., Srikishan, B. & Kleinberg, S.. (2021). Hierarchical Information Criterion for Variable Abstraction. Proceedings of the 6th Machine Learning for Healthcare Conference, in Proceedings of Machine Learning Research 149:440-460 Available from https://proceedings.mlr.press/v149/mirtchouk21a.html.

Related Material