1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019)
Workshop at the SIAM International Conference on Data Mining (SDM19), May 2‑4, 2019
A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such experiments: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results. Learning how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature, can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions [1-5] and have occasionally called into question published results, or the usability of published methods. At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML/PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level.
An issue directly related to the first choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings.
In this workshop, we mainly solicit contributions that discuss those questions on a fundamental level, take stock of the state-of-the-art, offer theoretical arguments, or take well-argued positions, as well as actual evaluation papers that offer new insights, e.g. question published results, or shine the spotlight on the characteristics of existing benchmark data sets.As such, topics include, but are not limited to
- Benchmark datasets for data mining tasks: are they diverse/realistic/challenging?
- Impact of data quality (redundancy, errors, noise, bias, imbalance, ...) on qualitative evaluation
- Propagation/amplification of data quality issues on the data mining results (also interplay between data and algorithms)
- Evaluation of unsupervised data mining (dilemma between novelty and validity)
- Evaluation measures
- (Automatic) data quality evaluation tools: What are the aspects one should check before starting to apply algorithms to given data?
- Issues around runtime evaluation (algorithm vs. implementation, dependency on hardware, algorithm parameters, dataset characteristics)
- Design guidelines for crowd-sourced evaluations
The workshop will feature a mix of invited speakers, a number of accepted presentations with ample time for questions since those contributions will be less technical, and more philosophical in nature, and a panel discussion on the current state, and the areas that most urgently need improvement, as well as recommendation to achieve those improvements. An important objective of this workshop is a document synthesizing these discussions that we intend to publish at a prominent venue.
SubmissionPapers should be submitted as PDF, using the SIAM conference proceedings style, available at https://www.siam.org/Portals/0/Publications/Proceedings/soda2e_061418.zip?ver=2018-06-15-102100-887. Submissions should be limited to nine pages and submitted via Easychair at https://easychair.org/conferences/?conf=edml19.
- Submission deadline: February 15, 2019
- Notification: March 15, 2019
- SDM pre-registration deadline: April 2, 2019
- Camera ready: April 15, 2019
- Conference dates: May 2-4, 2019
- Workshop date: May 4, 2019
The EDML workshop takes place on Saturday, May 4, in Imperial 2 & 3 (but check the SDM program for schedule and location updates.)
|9:30 AM||invited presentation:
Ricardo J. G. B. Campello, University of Newcastle
Evaluation of Unsupervised Learning Results: Making the Seemingly Impossible Possible
When labels are not around, evaluating the final or intermediate results produced by a learning algorithm is usually not simple. In cluster analysis and unsupervised outlier detection, evaluation is important in many different aspects. It is a crucial task, e.g., for model selection, model validation, assessment of ensemble members accuracy and diversity, among others. In cluster analysis this task has been investigated for decades, but it is relatively well understood only under certain oversimplified model assumptions. In outlier detection, unsupervised evaluation is still in its infancy. Even when labels are available in the form of a ground-truth, such as in controlled benchmarking experiments, evaluation can still be challenging because the semantic described by the ground-truth labels may not be properly captured by the data as represented in the given feature space. In this talk, I intend to discuss some particular aspects regarding outlier and clustering evaluation, focusing on a few recent results as well as some challenges for future research.
Alexandru Mara, Jefrey Lijffijt and Tijl De Bie
EvalNE: A Framework for Evaluating Network Embeddings on Link Prediction
Network embedding (NE) methods aim to learn low-dimensional representations of network nodes as vectors, typically in Euclidean space. These representations are then validated on a variety of downstream prediction tasks. Link prediction is one of the most popular choices for assessing the performance of NE methods. However, the complexity of link prediction requires a carefully designed evaluation pipeline in order to provide consistent, reproducible and comparable results. We argue this has not been considered sufficiently in many recent works. The main goal of this paper is to overcome difficulties associated to evaluation pipelines and reproducibility of results. We introduce EvalNE, an evaluation framework to transparently assess and compare the performance of NE methods on link prediction. EvalNE provides automation and abstraction for tasks such as hyper-parameter tuning, model validation, edge sampling, computation of edge embeddings and model validation. The framework integrates efficient procedures for edge and non-edge sampling and can be used to easily evaluate any off-the-shelf embedding method. The framework is freely available as a Python toolbox. Finally, demonstrating the usefulness of EvalNE in practice, we conduct an empirical study in which we try to replicate and analyse experimental sections of several influential papers.
Martin Aumüller and Matteo Ceccarello
Benchmarking Nearest Neighbor Search: Influence of Local Intrinsic Dimensionality and Result Diversity in Real-World Datasets
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.
Feras Batarseh and Ajay Kulkarni
Context-Driven Data Mining through Bias Removal and Incompleteness Mitigation
The results of data mining endeavors are majorly driven by data quality. Throughout these deployments, serious show-stopper problems are still unresolved, such as: data collection ambiguities, data imbalance, hidden biases in data, the lack of domain information, and data incompleteness. This paper is based on the premise that context can aid in mitigating these issues. In a traditional data science lifecycle, context is not considered. Context-driven Data Science Lifecycle (C-DSL); the main contribution of this paper, is developed to address these challenges. Two case studies (using datasets from sports events) are developed to test C-DSL. Results from both case studies are evaluated using common data mining metrics such as: coefficient of determination (R2) and confusion matrices. The work presented in this paper aims to re-define the lifecycle and introduce tangible improvements to its outcomes.
|12:00 PM||lunch break|
|1:30 PM||invited presentation (slides):
Kate Smith-Miles, University of Melbourne
Instance Spaces for Objective Assessment of Algorithms and Benchmark Test Suites
Objective assessment of algorithm performance is notoriously difficult, with conclusions often inadvertently biased towards the chosen test instances. Rather than reporting average performance of algorithms across a set of chosen instances, we discuss a new methodology to enable the strengths and weaknesses of different algorithms to be compared across a broader generalised instance space. Initially developed for combinatorial optimisation, the methodology has recently been extended for machine learning classification, and to ask whether the UCI repository and OpenML are sufficient as benchmark test suites. Results will be presented to demonstrate: (i) how pockets of the instance space can be found where algorithm performance varies significantly from the average performance of an algorithm; (ii) how the properties of the instances can be used to predict algorithm performance on previously unseen instances with high accuracy; (iii) how the relative strengths and weaknesses of each algorithm can be visualized and measured objectively; and (iv) how new test instances can be generated to fill the instance space and offer greater insights into algorithmic power.
Sevvandi Kandanaarachchi, Mario Munoz and Kate Smith-Miles
Instance space analysis for unsupervised outlier detection
The success of various methods for unsupervised outlier detection depends on how well their definition of an outlier matches the dataset properties. Each method has strengths and weaknesses, rendering the challenge to understand how measurable properties of a dataset can be used to predict which method is best suited to a given dataset. In this paper, we conduct the first instance space analysis for unsupervised outlier detection methods. The construction of suitable meta-features enables us to predict the most appropriate outlier method for a given problem, visualize the datasets and methods' performance across a 2-d instance space, and gain insights into outlier detection method strengths and weaknesses.
|3:00 PM||coffee break|
|3:45 PM||invited presentation:
Bart Goethals, University of Antwerp
Lessons learned from the FIMI workshops
In 2003 and 2004, Mohammed Zaki and myself organised the Frequent Itemset Mining Implementations (FIMI) workshops in an attempt to characterize and understand the algorithmic performance space of frequent itemset mining algorithms. In the years before, a huge number of algorithms had been developed in order to efficiently solve the frequent itemset mining (FIM) problem and several new algorithms were shown by their authors to run faster than previously existing algorithms. Unfortunately, the performance behavior of several of these algorithms was not always as was claimed by its authors when tested on some different datasets. Our goal was to understand precisely why and under what conditions one algorithm would outperform another. In this talk, I will present some concrete examples of experimental evaluations that motivated the FIMI workshops, discuss the workshop’s results and whether we reached our goals, and my personal view on the evaluation of data mining methods.
|4:45 PM||invited presentation (slides):
Miloš Radovanović, University of Novi Sad
Clustering Evaluation in High-Dimensional Data
Clustering evaluation plays an important role in unsupervised learning systems, as it is often necessary to automatically quantify the quality of generated cluster configurations. This is especially useful for comparing the performance of different clustering algorithms as well as determining the optimal number of clusters in clustering algorithms that do not estimate it internally. Many clustering quality indexes have been proposed over the years and different indexes are used in different contexts. There is no unifying protocol for clustering evaluation, so it is often unclear which quality index to use in which case. In this talk, we review existing clustering quality measures and evaluate them in the challenging context of high-dimensional data clustering. High-dimensional data is sparse and distances tend to concentrate, possibly affecting the applicability of various clustering quality indexes. We analyze the stability and discriminative power of a set of standard clustering quality measures with increasing data dimensionality. Our evaluation shows that the curse of dimensionality affects different clustering quality indexes in different ways, and that some are to be preferred when determining clustering quality in many dimensions.
|5:15 PM||discussion and closing|
|online presentation (vimeo)||paper:
Christian Lezcano and Marta Arias
Characterizing Transactional Databases for Frequent Itemset Mining
This paper presents a study of the characteristics of transactional databases used in frequent itemset mining. Such characterizations have typically been used to benchmark and understand the data mining algorithms working on these databases. The aim of our study is to give a picture of how diverse and representative these benchmarking databases are, both in general but also in the context of particular empirical studies found in the literature. Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones. Our study shows that our list of metrics is able to capture much of the datasets' inner complexity and thus provides a good basis for the characterization of transactional datasets. Finally, we provide a set of representative datasets based on our characterization that may be used as a benchmark safely.
ProceedingsThe proceedings are available at http://ceur-ws.org/Vol-2436/.
- Eirini Ntoutsi
Leibniz University Hannover & L3S Research Center, Germany
Eirini Ntoutsi is an Associate Professor in Intelligent Systems at the Faculty of Electrical Engineering and Computer Science at the Leibniz Universität Hannover (LUH) and member of the L3S research center. Prior to joining LUH, she was a postdoctoral researcher at the Ludwig-Maximilians-University (LMU) in Munich, Germany. She obtained her PhD from the University of Piraeus, Greece and she holds a diploma and a master in computer science from the Computer Engineering & Informatics Department (CEID), University of Patras, Greece. Her research can be summarized as learning over complex data (like high-dimensional, multi-view, with limited labels, ...) and data streams. She published more than 70 papers at international data mining and machine learning venues. She has organized several workshops, a Dagstuhl perspective workshop and served as publicity co-chair for ICDM 2017.
- Erich Schubert
Technical University Dortmund
Erich Schubert is associate professor at the Technical University Dortmund, Germany in the Artificial Intelligence group. He joined TU Dortmund in 2018 after previous positions at Heidelberg University, Germany and Ludwig-Maximilians-Universität München (LMU Munich), Germany, where he obtained his PhD with Prof. Kriegel. His research interests include unsupervised learning, in particular clustering and outlier detection, along with index-based acceleration techniques for these approaches and evaluation methods. He has published over 38 peer reviewed papers at international conferences and in journals, and served as proceedings chair for the SISAP 2016 conference, and assisted with the GIR’17 workshop in Heidelberg.
- Arthur Zimek
University of Southern Denmark
Arthur Zimek is Associate Professor and Head of the Section "Data Science and Statistics" in the Department of Mathematics and Computer Science (IMADA) at University of Southern Denmark (SDU), in Odense, Denmark. He joined SDU in 2016 after previous positions at Ludwig-Maximilians-University Munich (LMU), Germany, Technical University Vienna, Austria, and University of Alberta, Edmonton, Canada. Arthur holds master-level degrees in bioinformatics, philosophy, and theology, involving studies at universities in Germany (TUM, HfPh, LMU Munich, and JGU Mainz) as well as Austria (LFU Innsbruck). His research interests include ensemble techniques for unsupervised learning, clustering, outlier detection, and high dimensional data, developing data mining methods as well as evaluation methodology. He published more than 80 papers at peer reviewed international conferences and in international journals. He co-organized several workshops and mini-symposia at SDM, KDD, and Shonan and served as workshop co-chair for SDM 2017.
- Albrecht Zimmermann
University Caen Normandy, France
Albrecht Zimmermann is associate professor at the University Caen Normandie, France in the CoDaG (Constraints, Data Mining, and Graphs) group. He joined the group in 2015 after previous stays at INSA Lyon, France, the Catholic University of Leuven, Belgium, and the Albert-Ludwigs University Freiburg, Germany. His research interests include pattern and pattern set mining, and their applications to bio- and chemoinformatics settings, result verification of unsupervised data mining methods, data generation, and sports analytics. He has co-organized four editions of the “Machine Learning and Data Mining for Sports Analytics” workshop @ ECML/PKDD (MLSA 2013, 2015, 2107, 2018), as well as three editions of the invitation-only “Spring Workshop on Mining and Learning” (SMiLe 2008, 2010, 2012). He served as registration chair of ICDM 2012, and as workshop and tutorial co-chair of ECML/PKDD 2016.
Program CommitteeThe following colleagues already agreed to serve on the PC committee of this workshop:
- Martin Aumüller, IT University of Copenhagen
- James Bailey, University of Melbourne
- Roberto Bayardo, Google
- Christian Borgelt, University of Salzburg
- Ricardo Campello, University of Newcastle
- Sarah Cohen-Boulakia, Université Paris-Sud
- Ryan R. Curtin, Symantec Corporation
- Tijl De Bie, University of Gent
- Marcus Edel, Freie Universität Berlin
- Bart Goethals, University of Antwerp
- Markus Goldstein, Ulm University of Applied Sciences
- Nathalie Japkowicz, American University
- Daniel Lemire, University of Quebec
- Philippe Lenca, IMT Atlantique
- Helmut Neukirchen, University of Iceland
- Jürgen Pfeffer, Technical University Munich
- Miloš Radovanović, University of Novi Sad
- Protiva Rahman, Ohio State University
- Mohak Shah, LG Electronics
- Kate Smith-Miles, University of Melbourne
- Joaquin Vanschoren, Eindhoven University of Technology
- Ricardo Vilalta, University of Houston
- Mohammed Zaki, Rensselaer Polytechnic Institute
- Basaran, Daniel, Eirini Ntoutsi, and Arthur Zimek. “Redundancies in Data and their Effect on the Evaluation of Recommendation Systems: A Case Study on the Amazon Reviews Datasets.” Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 2017.
- Kovács, Ferenc, Csaba Legány, and Attila Babos. "Cluster validity measurement techniques." 6th International symposium of hungarian researchers on computational intelligence. 2005.
- Kriegel, Hans-Peter, Erich Schubert, and Arthur Zimek. "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?." Knowledge and Information Systems 52.2 (2017): 341-378.
- Nijssen, Siegfried, and Joost Kok. "Frequent subgraph miners: runtimes don't say everything." Proceedings of the Workshop on Mining and Learning with Graphs. 2006.
- Zheng, Zijian, Ron Kohavi, and Llew Mason. "Real world performance of association rule algorithms." Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001.