Spring 2017 / DM843/DM856
Unsupervised Learning

General Information

One trend can be observed over almost all fields of informatics: we have to cope with an ever-increasing amount of available data of all kinds. This amount of data renders it impossible to inspect the dataset "by hand", or even deduce knowledge from the given data, without sophisticated computer aided help. In this course we will discuss one of the most common mechanism of unsupervised machine learning for investigating datasets: Clustering. Clustering separates a given dataset into groups of similar objects, the clusters, and thus allows for a better understanding of the data and their structure. We discuss a number of clustering methods and their application to various different fields such as biology, economics or sociology.

Lectures

#	Date	Content	Slides	Comments
1	Wed, 01.02.2017	Introduction	here
2	Thu, 02.02.2017	Mathematical Foundations	here
3	Wed, 08.02.2017	Graphical Analysis & Density Estimation	here
4	Thu, 09.02.2017	PCA & PCoA	here
5	Wed, 15.02.2017	Proximity Calculation	here
6	Thu, 16.02.2017	Hierarchical Clustering	here
7	Wed, 22.02.2017	Optimization Based Clustering	here
8	Thu, 23.02.2017	Model-based Clustering	here
9	Wed, 01.03.2017	Cluster Evaluation	here
10	Thu, 02.03.2017	Advanced Topics in Clustering	here
11	Wed, 08.03.2017	Student Talks	--	--
12	Tue, 09.03.2017	Student Talks	--	--
13	Wed, 15.03.2017	Student Talks	--	--
14	Thu, 16.03.2017	--	--	--

Exercises

#	Date	Questions	Download	Solutions
1	Tue, 07.02.2017	Introduction to R	here
2	Tue, 14.02.2017	PCA & PCoA	here British Food Consumption City distances DK City distances DE	PCA PCoA
3	Tue, 21.02.2017	Hierarchical Clustering	here dataset	here
4	Tue, 28.02.2017	GAP & GMM	here dataset	GAP GMM EM
5	Tue, 07.03.2017	--	--
6	Tue, 14.03.2017	--	--

Procedure of the oral exam

The exam will last about 20 minutes. At the beginning, one topic from the list below will be drawn randomly. For each topic the examinee should be prepared to make a short presentation of about 5 minutes. It is allowed to bring one page of hand-written notes (DIN A4 or US-Letter, one-sided) for each of the topics. The examinee will have 2 minutes to briefly study the notes for the drawn topic before the presentation. The notes may be consulted during the exam if needed but it will negatively influence the evaluation of the examinee's performance. During the presentation, only the blackboard can be used (you cannot use overhead transparencies, for instance).

After the short presentation, additional question about the presentation's topic but also about other topics in the curriculum will be asked.

Below is the list of possible topics and some suggested content. The listed content are only suggestions and is not necessarily complete nor must everything be covered in the short presentation (Except the small project. This topic might only pop up in the later part of the exam.). It is the responsibility of the examinee to gather and select among all relevant information for each topic from the course material.

Topics for the Oral Exam:

Graphical Analysis of Clusters

Histograms, Scatter Plots
Density Estimation (Parzen Windows, Kernel vs. kNN)
Principal Component Analysis
Principal Coordinate Analysis

Proximity Measures

Different datatypes
Common measures for various data types
Metrics
Similarities for structural data

Hierarchical Clustering

Function principles / linking functions
Dendrograms
Comparison to crisp clusterings
Function principle of BIRCH

Optimization Based Clustering

Dissection of the co-variance matrix (W and B)
Cluster criteria
K-means
GAP statistic

Cluster Analysis

Steps of a cluster analysis
Data preprocessing
Cluster Validation (internal vs. external)

Expectation Maximization

Function principle
Gaussian Mixture Models
Maximum Likelihood Estimators
Similarity to k-means

Advanced Clustering

Subspace clustering
Ensemble clustering
Co-clustering

Small Project (not a presentation topic)

Describe what tools you have decided to use

Why this choice? How did you try to optimize the parameters?

How did you evaluate your clustering? How did you incorporate your partial gold standard?

Student Talks

You are allowed to team-up with a fellow student in order to prepare and give the presentation. You will receive one or two scientific papers about the clustering tool assigned to you. You are supposed to create a small presentation which should be 15-20 minutes. It is important that you stay within this time frame! In your small talk, you should cover the following aspects (if possible):

Present the underlying idea and the algorithm of the tool
How was the tool evaluated (e.g., on what datasets, compared against what other tools, formal proofs of certain properties, etc.)
What is your opinion about the pros and the cons of the algorithm

The paper is only the starting point and you should use this as the basis for your presentation, but feel free to dig-up any other sources.

The talks will be on the 8th and 9th of April. Please take into account, that this talk is mandatory to be eligible for the exam. In case you have not yet registered for a talk, please write me an email and I will assign you a topic. Please find some further information on the project here

If you have not yet a topic assigned, please either talk to your class mates and join a team or selected one of the still available topics listed below (first come, first serve). The list will be updated accordingly. Let me know your decision as soon as possible.

MCODE
ClusterDP
DBSCAN
AMR
OptiGrid

List of so far assigned teams and topics:

#	Date (tentative)	Topic	Team	Papers
1	08.03.2016	Fuzzy c-means	Wojciech & Simon	Paper 1 Paper 2 Validity
2	08.03.2016	Markov Clustering	Casper	here
3	08.03.2016	Self organizing maps	Isabella	Introduction SOMS for Clustering 1 SOMS for Clustering 2 SOMS in R
4	08.03.2016	CLARANS	Simone	here
5	09.03.2016	Affinity Propagation	Peter	paper supplement
6	09.03.2016	clusterOne	Tsiamis & Dominika	paper supplement
7	09.03.2016	densityCut	Joseph & Nasteco	paper
8	09.03.2016	Transitivity Clustering	Gleb	paper supplement FORCE
9	15.03.2016	clusterDP	Ann-Sophie & Christian	paper supplement
10	15.03.2016	Spectral Clustering	Nath	here

Project

Project Material

Please find the project desciption here. If there is anything unclear or you think I didn't have captured the protocol as we discussed, please let me know. If you have any further questions, don't hesitate to write me an email or stop by at my office.
Protein sequences here. And the partial gold standard here.
A very short description how to perform a BLAST all-vs.-all run: https://www.biostars.org/p/6541/ Don't forget to set the E-value cut-off to 100.
You can download BLAST from the NCBI website: here. There you find the source code as well as precompiled versions for all usual operating systems.

Some Remarks regarding BLAST

If you run blastp including the parameter -outfmt 6 (as stated in the little tutorial) you will receive a file with 12 columns. The meaning of them is for example described here.

If you look at the link, you will see that you have to use the 11th column (the second last) in oder to get an E-value. With the -log(E-value) conversion, we are calculating a similarities. Now it might happen that some E-values are zero, that means the proteins are extremely similar and thus should also receive a very high similarity. Nevertheless, -log(0) equals infinity which is technically very hihg but not really practical. This can be repaired, e.g., by one of the following ways:

add some small pseudo count to each similarity (e.g., the smallest real float or double value > 0 before you calculate the log)
set all zeros to the lowest otherwise observed value >0.

With that you will receive a valid similarity matrix. If your tool requires distances, you need to convert these similarities to distances in a second preprocessing step.

Summary of the Project

In this project, we aim to separate a set of proteins into their protein families (it is not of utmost importance to know what exactly a protein family is). Without any prior biological knowledge, it is very hard to identify suitable parameters in order to identify protein families. Luckily, we have a partial gold-standard dataset consisting of 27 families, all belonging to one superfamily. The overall dataset consists of several superfamilies with an unknown number of families.

After selecting three clustering methods, we will use the gold standard in order to train the parameter sets. The best clustering is determined by using the discussed evaluation protocol of last class. These parameters are used in order to cluster the entire dataset. The results of clustering the entire dataset are also evaluated according to the protocol defined in class. At the end, you are supposed to summarize all your findings in a protocol, convincing me to favor one tool over another.

Reading Material on Cluster Evaluation

One ot the most important tasks of your project is to evaluate your clustering in order to conclude the best apporach to the problem. You will find (besides the lecture slides) some additional recommended readings on cluster evaluation.

Rendón, E. et al. (2011). Internal versus External cluster validation indexes. International Journal of computers and communications, 5(1), 27-34. download
Handl, J. et al. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15), 3201-3212. download
(Optional) Halkidi, M. et al. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3), 107-145. download
(Optional) Xu, Rui, et al. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 2005, 16. Jg., Nr. 3, S. 645-678. download

Available Clustering Tools

As already announced in the lecture, you have to choose three of the tools listed below. If available, you can download the corresponding paper (either the original publication, or a good overview article) of the tool/method. This is only a suggsetion. You can also use any of the tools discussed by your fellow students during the presentaitons or use any other clustering tool you deem appropriate.

Any variant of k-means.
Any variant of hierarchical clustering.
Spectral clustering. publication
Affinity propagation. publication
Transitivity clustering / FORCE. publication
Markov clustering. publication

Details & Deadline

You are allowed to work in teams of up to 4 students. You will receive feedback on your report and your chosen strategy. Be aware that questions regarding the project might be part of the oral exam. Do not hesitate to contact me in case you have any additional questions.

Upload your report not later than Friday, 7th of April, 23:59:59 Danish Time

Materials

All lecture slides are relevant for the exams.
All readings noted in the lecture list are relevant for the exam.
Brian S. Everitt, Sabine Landau, Morven Leese, Daniel Stahl, Cluster Analysis, 5th Edition, ISBN: 9780470749913
A good introduction to R: here
R Cheat-Sheets: here

Spring 2017 / DM843/DM856 Unsupervised Learning