Evaluation and Experimental Design in Data Mining and Machine Learning


Special Issue


A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such experiments: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results. Learning how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature, can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions [1-5] and have occasionally called into question published results, or the usability of published methods. At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML/PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level.

An issue directly related to the first choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings.


In this workshop-series, we mainly solicit contributions that discuss those questions on a fundamental level, take stock of the state-of-the-art, offer theoretical arguments, or take well-argued positions, as well as actual evaluation papers that offer new insights, e.g. question published results, or shine the spotlight on the characteristics of existing benchmark data sets.

As such, topics include, but are not limited to



  1. Basaran, Daniel, Eirini Ntoutsi, and Arthur Zimek. “Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets.” Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 2017.
  2. Kovács, Ferenc, Csaba Legány, and Attila Babos. "Cluster validity measurement techniques." 6th International symposium of hungarian researchers on computational intelligence. 2005.
  3. Kriegel, Hans-Peter, Erich Schubert, and Arthur Zimek. "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?." Knowledge and Information Systems 52.2 (2017): 341-378.
  4. Nijssen, Siegfried, and Joost Kok. "Frequent subgraph miners: runtimes don't say everything." Proceedings of the Workshop on Mining and Learning with Graphs. 2006.
  5. Zheng, Zijian, Ron Kohavi, and Llew Mason. "Real world performance of association rule algorithms." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.