Master Thesis / ISA Suggestions
Here, you can find a list of proposed projects you can conduct either as an ISA or preferably as a MSc thesis. Please feel free to contact me at anytime. Furthermore, if you have an own idea and think it might suit in the scope of our group, do not hesitate to contact us. The suggested projects can and will be adjusted to personal needs and preferences.
Federated Machine Learning
Description
Vast amount of data is collected and stored every day. Nevertheless, when looking at the most pressing problems of our time, how come that complex machine models, for instance for cancer treatment, are only trained on a few 1000 samples? The core problem of this issue is that the data is available but must not be shared due to legal restrictions and bureaucratic hurdles, in particular when we are talking about sensitive data.
In this project, you will help to elevate this problem by utilizing an approach called federated machine learning. Here, the data remains at the safe site of storage and the machine learning comes to the data. A small, local, model is trained on each site, and then combined into a global model while only transmitting anonymous model parameters. With that, we will be able to train high-quality models while not infringing on data security and privacy issues.
The project is part of a large consortium project FeatureCloud. You will implement federated solutions for specific problems which seamlessly integrate into the FeatureCloud platform. Potential projects are:
- Federated Clustering Algorithms
- Federated Similarity Calculation
- Federated Cluster Evaluation
- Federated Network Enrichment
Impact of fibre, red and processed meat on risk of chronic inflammatory diseases: a prospective UK Biobank cohort study on prognostic factors and personalised medicine in the UK Biobank
Description
UKBiobank is an incredible large datasoruce and has collected data from 500.000 British individuals enabling the investigation of gene-lifestyle interactions in relation to development of diseases. Data is already available and requires advanced data science skill to cope with the sheer amount of available data.
The aim of this project to programm statistical models to investigate gene-diet interactions in an efficient way. You will learn to perform observational and advanced computer analyses such as interaction studies, case-control and case-only studies. Further, you will learn how to do network and pathway analyses.
You will be supervised by Richard Röttger and Vibeke Andersen, professor, Research Leader, Research Unit for Molecular Diagnostics and Clinical Research, University Hospital of Southern Denmark.
Practical Information. The study is designed as a Master Thesis project and can also be done in a team: Two students working on their own project, but collaborating on methods, interpretation, writing manuscript, etc., are preferred. You will be responsible for writing a manuscript under supervision. It is not crucial to have in depth knowledge of Medicine or Biology. Could form a basis for a PhD study, if wanted.
GWAS on the impact of red and processed meat on chronic diseases
Description
We have access to an impressive wealth of data concerning human well-being. It is apparent that even people sharing common life-styles, their history of diseases and their susceptibility to, for instance cancer or chronic diseases is vastly different. Here, genetic differences might play a crucial role in determining what are the factors which leads to one person living a happy life while another person is suffering from a chronic disease. To this end, so-called genome-wide association studies (GWAS) are performed, which seek to statistically link outcomes (e.g., diseases) with genetic variants.
In this project, we seek to understand the impact of lifestyle including diet on developing risk of chronic inflammatory disease by investigating the gene-environment interactions. We will work on the massive data of UK Biobank which requires careful study design and efficient, distributed implementations and the usage of large cluster computers or even the ABACUS supercomputer.
The project will be done in collaboration with Prof. Vibeke Andersen and entails the following steps:
- Familiarization with the provided data
- Design and specification of the GWAS study
- Efficient implementation and deployment of the study onto a cluster computer
Machine Learning - Can we trust it?
Description
In recent years, we have seen a tremendous growth of impressive examples of the possibilities of machine learning: From go-playing supercomputers to self-driving cars the development was just breathtaking. Nevertheless, machine learning is not a one-size fits all procedure and requires a considerable amount of knowledge when it comes to applying the right hammer to the right problem. In many fields, standard machine learning techniques are used by non-experts in order to make predictions, learn models and infer knowledge. In biomedicine, machine learning is used in nearly every field with consequences for research and patients. The question is: Can we trust the results? Do we discover a signal or rather spurious noise? How sensitive are these methods to outliers, parameter setting, etc.? Can we explain why and how the machine came up with a certain result?
In this project, we want to automatize the conduction and evaluation of machine learning approaches in order to facilitate a large-scale comparison analysis of common techniques on common tasks with respect to their performance, their sensitivity to noise and stability. The ultimate goal is to provide solid criteria for the practitioner with which method with which parameters to choose for a concrete problem.
The project phase might be as follows:
- Familiarize with the topic and make a tool and problem selection
- Design and implement the automation of a machine learning tasks
- Design the exact parameters of the study
- Conduct and evaluate the study
Parameter-free hierarchical clustering
Description
Clustering is a standard technique in unsupervised machine learning grouping objects in a dataset into groups of similar objects. Clustering is widely applied in a plethora of different fields, from astronomy, social sciences, economics to bioinformatics. Even though clustering is a standard technique, the process itself is quite complicated and many mistakes can be made. Especially, as the ground truth is unknown, it is hard to decide whether a result is good or bad. The practitioner has to answer many different questions, for example how to pre-process the data or what tool to use. Even when a tool is chosen, every clustering approach requires at least one parameter, defining whether we want to have many small or few large clusters, has to be set. Furthermore, it might happen that the clusters in a dataset are not of equal properties and you actually would require multiple parameters for the same dataset.
In this project, we aim to extend our existing Transitivity Clustering to a parameterless hierarchical clustering tool. Is to build a hierarchical cluster structure and compare the quality of the clustering on each node to a clustering with similar properties but without structure in order to define the quality of the split. This will allow for a dynamic tree-cut in order to derive the best possible clustering for the whole dataset without user interference.
The project will be structured as follows:
- Familiarize yourself with Transitivity Clustering
- Extend Transitivity Clustering to an hierarchical approach
- Design and implement the relative quality measure
- Implement the dynamic tree-cut into the hierarchical clustering