Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically

Mrinal Das; Suparna Bhattacharya; Chiranjib Bhattacharyya; Gopinath Kanchi

Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically

Mrinal Das, Suparna Bhattacharya, Chiranjib Bhattacharyya, Gopinath Kanchi

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(2):253-261, 2013.

Abstract

In a recent pioneering approach LDA was used to discover cross cutting concerns(CCC) automatically from software codebases. LDA though successful in detecting prominent concerns, fails to detect many useful CCCs including ones that may be heavily executed but elude discovery because they do not have a strong prevalence in source-code. We pose this problem as that of discovering topics that rarely occur in individual documents, which we will refer to as subtle topics. Recently an interesting approach, namely focused topic models(FTM) was proposed for detecting rare topics. FTM, though successful in detecting topics which occur prominently in very few documents, is unable to detect subtle topics. Discovering subtle topics thus remains an important open problem. To address this issue we propose subtle topic models(STM). STM uses a generalized stick breaking process(GSBP) as a prior for defining multiple distributions over topics. This hierarchical structure on topics allows STM to discover rare topics beyond the capabilities of FTM. The associated inference is non-standard and is solved by exploiting the relationship between GSBP and generalized Dirichlet distribution. Empirical results show that STM is able to discover subtle CCC in two benchmark code-bases, a feat which is beyond the scope of existing topic models, thus demonstrating the potential of the model in automated concern discovery, a known difficult problem in Software Engineering. Furthermore it is observed that even in general text corpora STM outperforms the state of art in discovering subtle topics.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-das13,
  title = 	 {Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically},
  author = 	 {Das, Mrinal and Bhattacharya, Suparna and Bhattacharyya, Chiranjib and Kanchi, Gopinath},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {253--261},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/das13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/das13.html},
  abstract = 	 {In a recent pioneering approach LDA was used to discover cross cutting concerns(CCC) automatically from software codebases. LDA though successful in detecting prominent concerns, fails to detect many useful CCCs including ones that may be heavily executed but elude discovery because they do not have a strong prevalence in source-code. We pose this problem as that of discovering topics that rarely occur in individual documents, which we will refer to as subtle topics. Recently an interesting approach, namely focused topic models(FTM) was proposed for detecting rare topics. FTM, though successful in detecting topics which occur prominently in very few documents, is unable to detect subtle topics. Discovering subtle topics thus remains an important open problem. To address this issue we propose subtle topic models(STM). STM uses a generalized stick breaking process(GSBP) as a prior for defining multiple distributions over topics. This hierarchical structure on topics allows STM to discover rare topics beyond the capabilities of FTM. The associated inference is non-standard and is solved by exploiting the relationship between GSBP and generalized Dirichlet distribution. Empirical results show that STM is able to discover subtle CCC in two benchmark code-bases, a feat which is beyond the scope of existing topic models, thus demonstrating the potential of the model in automated concern discovery, a known difficult problem in Software Engineering. Furthermore it is observed that even in general text corpora STM outperforms the state of art in discovering subtle topics.}
}

Endnote

%0 Conference Paper
%T Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically
%A Mrinal Das
%A Suparna Bhattacharya
%A Chiranjib Bhattacharyya
%A Gopinath Kanchi
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-das13
%I PMLR
%P 253--261
%U https://proceedings.mlr.press/v28/das13.html
%V 28
%N 2
%X In a recent pioneering approach LDA was used to discover cross cutting concerns(CCC) automatically from software codebases. LDA though successful in detecting prominent concerns, fails to detect many useful CCCs including ones that may be heavily executed but elude discovery because they do not have a strong prevalence in source-code. We pose this problem as that of discovering topics that rarely occur in individual documents, which we will refer to as subtle topics. Recently an interesting approach, namely focused topic models(FTM) was proposed for detecting rare topics. FTM, though successful in detecting topics which occur prominently in very few documents, is unable to detect subtle topics. Discovering subtle topics thus remains an important open problem. To address this issue we propose subtle topic models(STM). STM uses a generalized stick breaking process(GSBP) as a prior for defining multiple distributions over topics. This hierarchical structure on topics allows STM to discover rare topics beyond the capabilities of FTM. The associated inference is non-standard and is solved by exploiting the relationship between GSBP and generalized Dirichlet distribution. Empirical results show that STM is able to discover subtle CCC in two benchmark code-bases, a feat which is beyond the scope of existing topic models, thus demonstrating the potential of the model in automated concern discovery, a known difficult problem in Software Engineering. Furthermore it is observed that even in general text corpora STM outperforms the state of art in discovering subtle topics.

RIS


TY  - CPAPER
TI  - Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically
AU  - Mrinal Das
AU  - Suparna Bhattacharya
AU  - Chiranjib Bhattacharyya
AU  - Gopinath Kanchi
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/05/13
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-das13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 2
SP  - 253
EP  - 261
L1  - http://proceedings.mlr.press/v28/das13.pdf
UR  - https://proceedings.mlr.press/v28/das13.html
AB  - In a recent pioneering approach LDA was used to discover cross cutting concerns(CCC) automatically from software codebases. LDA though successful in detecting prominent concerns, fails to detect many useful CCCs including ones that may be heavily executed but elude discovery because they do not have a strong prevalence in source-code. We pose this problem as that of discovering topics that rarely occur in individual documents, which we will refer to as subtle topics. Recently an interesting approach, namely focused topic models(FTM) was proposed for detecting rare topics. FTM, though successful in detecting topics which occur prominently in very few documents, is unable to detect subtle topics. Discovering subtle topics thus remains an important open problem. To address this issue we propose subtle topic models(STM). STM uses a generalized stick breaking process(GSBP) as a prior for defining multiple distributions over topics. This hierarchical structure on topics allows STM to discover rare topics beyond the capabilities of FTM. The associated inference is non-standard and is solved by exploiting the relationship between GSBP and generalized Dirichlet distribution. Empirical results show that STM is able to discover subtle CCC in two benchmark code-bases, a feat which is beyond the scope of existing topic models, thus demonstrating the potential of the model in automated concern discovery, a known difficult problem in Software Engineering. Furthermore it is observed that even in general text corpora STM outperforms the state of art in discovering subtle topics.
ER  -

APA


Das, M., Bhattacharya, S., Bhattacharyya, C. & Kanchi, G.. (2013). Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(2):253-261 Available from https://proceedings.mlr.press/v28/das13.html.

Subtle Topic Models and Discovering Subtly Manifested Software Concerns Automatically

Abstract

Cite this Paper

Related Material