Spring 2016 / DM843
Unsupervised Learning

General Information

One trend can be observed over almost all fields of informatics: we have to cope with an ever-increasing amount of available data of all kinds. This amount of data renders it impossible to inspect the dataset "by hand", or even deduce knowledge from the given data, without sophisticated computer aided help. In this course we will discuss one of the most common mechanism of unsupervised machine learning for investigating datasets: Clustering. Clustering separates a given dataset into groups of similar objects, the clusters, and thus allows for a better understanding of the data and their structure. We discuss a number of clustering methods and their application to various different fields such as biology, economics or sociology.

Lectures

#	Date	Content	Slides	Readings
1	Mon, 04.04.2016	Introduction	here
2	Tue, 05.04.2016	Mathematical Foundations	here
3	Mon, 11.04.2016	Detecting Clusters Graphically	here
4	Tue, 12.04.2016	Dimensionality Reduction (PCA, PCoA)	here
5	Mon, 18.04.2016	Proximity Measures (updated 19.04)	here
6	Tue, 19.04.2016	Hierarchical Clustering	here
7	Mon, 25.04.2016	Optimization Based Clustering	here
8	Tue, 26.04.2016	Gaussian Mixture Models & Expectation Maximization	here
9	Mon, 02.05.2016	Evaluation a Cluster Analysis	here
10	Tue, 03.05.2016	Subspace, Ensemble & Co-clustering	here
11	Mon, 09.05.2016	Student Talks	--
12	Tue, 10.05.2016	Student Talks	--
13	Tue, 17.05.2016	no lectures	--	--
14	Wed, 18.05.2016	no lectures	--	--

Exercises

#	Date	Questions	Download
1	Wed, 13.04.2015	Getting familiar with R. Good introduction: here	here
2	Wed, 20.04.2015	PCA & PCoA	Exercise Sheet British Food Consumption City distances DK City distances DE
3	Wed, 27.04.2015	Hierarchical Clustering	Exercise Sheet Dataset
4	Wed, 04.05.2015	GAP statistic & Expectation Maximization	Exercise Sheet Dataset
5	Wed, 11.05.2015	Q&A Session
6	Wed, 19.05.2015	--

Procedure of the oral exam

The exam will last about 25 minutes. At the beginning, one topic from the list below will be drawn randomly. For each topic the examinee should be prepared to make a short presentation of about 8 minutes. It is allowed to bring one page of hand-written notes (DIN A4 or US-Letter, one-sided) for each of the topics. The examinee will have 2 minutes to briefly study the notes for the drawn topic before the presentation. The notes may be consulted during the exam if needed but it will negatively influence the evaluation of the examinee's performance. During the presentation, only the blackboard can be used (you cannot use overhead transparencies, for instance).

After the short presentation, additional question about the presentation's topic but also about other topics in the curriculum will be asked.

Below is the list of possible topics and some suggested content. The listed content are only suggestions and is not necessarily complete nor must everything be covered in the short presentation. It is the responsibility of the examinee to gather and select among all relevant information for each topic from the course material. On the course website you can find suggested readings for each of these topics.

Topics for the Oral Exam:

Graphical Analysis of Clusters

Histograms, Scatter Plots
Density Estimation (Parzen Windows, Kernel vs. kNN)
Principal Component Analysis
Principal Coordinate Analysis

Proximity Measures

Different datatypes
Common measures for various data types
Metrics
Similarities for structural data

Hierarchical Clustering

Function principles / linking functions
Dendrograms
Comparison to crisp clusterings
Function principle of BIRCH

Optimization Based Clustering

Dissection of the co-variance matrix (W and B)
Cluster criteria
K-means
GAP statistic

Cluster Analysis

Steps of a cluster analysis
Data preprocessing
Cluster Validation (internal vs. external)

Expectation Maximization

Function principle
Gaussian Mixture Models
Maximum Likelihood Estimators
Similarity to k-means

Advanced Clustering

Subspace clustering
Ensemble clustering
Co-clustering

Student Talks

You will receive one or two scientific papers about the clustering tool assigned to you. You are supposed to create a small presentation which should be 15-20 minutes. It is important that you stay within this time frame! In your small talk, you should cover the following aspects:

Present the underlying idea and the algorithm of the tool
How was the tool evaluated (e.g., on what datasets, compared against what other tools, formal proofs of certain properties, etc.)
What is your opinion about the pros and the cons of the algorithm

The paper is only the starting point and you should use this as the basis for your presentation, but feel free to dig-up any other sources.

The talks will be on the 9th and 10th of May. Please take into account, that this talk is mandatory to be eligible for the exam. In case you have not yet registered for a talk, please write me an email and I will assign you a topic.

#	Date (tentative)	Topic	Name	Papers
1	09.05.2016	Markov Clustering	Jakob	Paper
2	09.05.2016	Spectral Clustering	Bastian	Paper
3	09.05.2016	ClusterDP	Jonas	Paper Supplement
4	09.05.2016	Self Organizing Maps	Dan	Introduction SOMS for Clustering 1 SOMS for Clustering 2 SOMS in R
5	09.05.2016	Transitivity Clustering	Jon	Paper Supplement FORCE Algorithm
6	09.05.2016	DBSCAN	Troels	Paper
7	09.05.2016	Affinity Propagtion	Kristine	Paper Supplement

Materials

All lecture slides are relevant for the exams.
All readings noted in the lecture list are relevant for the exam.
Brian S. Everitt, Sabine Landau, Morven Leese, Daniel Stahl, Cluster Analysis, 5th Edition, ISBN: 9780470749913
A good introduction to R: here

Spring 2016 / DM843 Unsupervised Learning