
- title: 'Preface to the 1st ECML/PKDD workshop on Statistically Sound Data Mining'
  abstract: 'These proceedings contain the papers accepted to the 1st ECML/PKDD Workshop on Statistically Sound Data Mining, which took place at the French Institute for Computer Science (INRIA) in Nancy, at the opening of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) on the 15th of September 2014.'
  volume: 47
  URL: https://proceedings.mlr.press/v47/preface.html
  PDF: http://proceedings.mlr.press/v47/preface.pdf
  edit: https://github.com/mlresearch//v47/edit/gh-pages/_posts/2015-11-27-preface.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Workshop on Statistically Sound Data Mining at ECML/PKDD'
  publisher: 'PMLR'
  author: 
  - given: Wilhelmiina
    family: Hämäläinen
  - given: François
    family: Petitjean
  - given: Geoffrey I.
    family: Webb
  editor: 
  - given: Wilhelmiina
    family: Hämäläinen
  - given: François
    family: Petitjean
  - given: I.
    family: Webb
  address: Nancy, France
  page: 1-2
  id: preface
  issued:
    date-parts: 
      - 2015
      - 11
      - 27
  firstpage: 1
  lastpage: 2
  published: 2015-11-27 00:00:00 +0000
- title: 'Look before you leap: Some insights into learner evaluation with cross-validation'
  abstract: 'Machine learning is largely an experimental science, of which the evaluation of predictive models is an important aspect. These days, cross-validation is the most widely used method for this task. There are, however, a number of important points that should be taken into account when using this methodology. First, one should clearly state what they are trying to estimate. Namely, a distinction should be made between the evaluation of a model learned on a single dataset, and that of a learner trained on a random sample from a given data population. Each of these two questions requires a different statistical approach and should not be confused with each other. While this has been noted before, the literature on this topic is generally not very accessible. This paper tries to give an understandable overview of the statistical aspects of these two evaluation tasks. We also pose that because of the often limited availability of data, and the difficulty of selecting an appropriate statistical test, it is in some cases perhaps better to abstain from statistical testing, and instead focus on an interpretation of the immediate results. '
  volume: 47
  URL: https://proceedings.mlr.press/v47/vanwinckelen14a.html
  PDF: http://proceedings.mlr.press/v47/vanwinckelen14a.pdf
  edit: https://github.com/mlresearch//v47/edit/gh-pages/_posts/2015-11-27-vanwinckelen14a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Workshop on Statistically Sound Data Mining at ECML/PKDD'
  publisher: 'PMLR'
  author: 
  - given: Gitte
    family: Vanwinckelen
  - given: Hendrik
    family: Blockeel
  editor: 
  - given: Wilhelmiina
    family: Hämäläinen
  - given: François
    family: Petitjean
  - given: I.
    family: Webb
  address: Nancy, France
  page: 3-20
  id: vanwinckelen14a
  issued:
    date-parts: 
      - 2015
      - 11
      - 27
  firstpage: 3
  lastpage: 20
  published: 2015-11-27 00:00:00 +0000
- title: 'A Critical View on Automatic Significance-Filtering in Pattern Mining'
  abstract: 'Statistically sound validation of results plays an important role in modern data mining. In this context, it has been advocated to disregard patterns that cannot be automatically confirmed as statistically valid by the available data. In this short position paper, we argue against a mandatory automatic significance filtering of results.'
  volume: 47
  URL: https://proceedings.mlr.press/v47/lemmerich14a.html
  PDF: http://proceedings.mlr.press/v47/lemmerich14a.pdf
  edit: https://github.com/mlresearch//v47/edit/gh-pages/_posts/2015-11-27-lemmerich14a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Workshop on Statistically Sound Data Mining at ECML/PKDD'
  publisher: 'PMLR'
  author: 
  - given: Florian
    family: Lemmerich
  - given: Frank
    family: Puppe
  editor: 
  - given: Wilhelmiina
    family: Hämäläinen
  - given: François
    family: Petitjean
  - given: I.
    family: Webb
  address: Nancy, France
  page: 21-27
  id: lemmerich14a
  issued:
    date-parts: 
      - 2015
      - 11
      - 27
  firstpage: 21
  lastpage: 27
  published: 2015-11-27 00:00:00 +0000
- title: 'Statistically significant subgraphs for genome-wide association study'
  abstract: 'Genome-wide association studies (GWAS) have been widely used for understanding the associations of single-nucleotide polymorphisms (SNPs) with a disease. GWAS data are often combined with known biological networks, and they have been analyzed using graph-mining techniques toward a systems understanding of the biological changes caused by the SNPs. To determine which subgraphs are associated with the disease, a statistical test on each subgraph needs to be conducted. However, no statistically significant results were found because multiple testing correction causes an extremely small corrected significance level. We introduce a method called gLAMP to enumerate subgraphs having statistically significant associations with a diagnosis. gLAMP integrates the Limitless Arity Multiple-testing Procedure (LAMP) with a graph-mining algorithm called COmmon Itemset Network mining (COIN). LAMP gives us the smallest possible Bonferroni factor, and COIN provides us with efficient enumeration of testable subgraphs. Theoretical results of their combination show the potential to enumerate subgraphs statistically significantly associated with a disease.'
  volume: 47
  URL: https://proceedings.mlr.press/v47/sese14a.html
  PDF: http://proceedings.mlr.press/v47/sese14a.pdf
  edit: https://github.com/mlresearch//v47/edit/gh-pages/_posts/2015-11-27-sese14a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Workshop on Statistically Sound Data Mining at ECML/PKDD'
  publisher: 'PMLR'
  author: 
  - given: Jun
    family: Sese
  - given: Aika
    family: Terada
  - given: Yuki
    family: Saito
  - given: Koji
    family: Tsuda
  editor: 
  - given: Wilhelmiina
    family: Hämäläinen
  - given: François
    family: Petitjean
  - given: I.
    family: Webb
  address: Nancy, France
  page: 29-36
  id: sese14a
  issued:
    date-parts: 
      - 2015
      - 11
      - 27
  firstpage: 29
  lastpage: 36
  published: 2015-11-27 00:00:00 +0000
- title: 'U-statistics on network-structured data with kernels of degree larger than one'
  abstract: 'Most analysis of U-statistics assumes that data points are independent or stationary. However, when we analyze network data, these two assumptions do not hold any more. We first define the problem of weighted U-statistics on networked data by extending previous work. We analyze their variance using Hoeffding’s decomposition and also give exponential concentration inequalities. Two efficiently solvable linear programs are proposed to find estimators with minimum worst-case variance or with tighter concentration inequalities.'
  volume: 47
  URL: https://proceedings.mlr.press/v47/wang14a.html
  PDF: http://proceedings.mlr.press/v47/wang14a.pdf
  edit: https://github.com/mlresearch//v47/edit/gh-pages/_posts/2015-11-27-wang14a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Workshop on Statistically Sound Data Mining at ECML/PKDD'
  publisher: 'PMLR'
  author: 
  - given: Yuyi
    family: Wang
  - given: Christos
    family: Pelekis
  - given: Jan
    family: Ramon
  editor: 
  - given: Wilhelmiina
    family: Hämäläinen
  - given: François
    family: Petitjean
  - given: I.
    family: Webb
  address: Nancy, France
  page: 37-48
  id: wang14a
  issued:
    date-parts: 
      - 2015
      - 11
      - 27
  firstpage: 37
  lastpage: 48
  published: 2015-11-27 00:00:00 +0000
