DM825 - Introduction to Machine Learning Sheet 7, Spring 2011 [pdf format]

Exercise 1 Neural Networks for Time Series Prediction.

A common data analysis task is time series prediction, where we have a set of data that show something varying over time, and we want to predict how the data will vary in the future. Examples are stock markets, river levels and house prices.

The data set PNoz.dat contains the daily measurement of the thickness of the ozone layer above Palmerston North in New Zealand between 1996 and 2004. Ozone thickness is measured in Dobson units, which are 0.01 mm thickness at 0 degree Celsius and 1 atmosphere pressure. The reduction in stratosferic ozone is partly responsible for global warming and the increased incidence of skin cancer. The thickness of the ozone varies naturally over the year, as you can see from the plot. (There are four fields in the data, and the ozone level is the third).

 K <- read.table("PNoz.dat") names(K) <- c("year","day","ozone.level","sulphur.dioxide.level") plot(K\$ozone.level,xlab="Time (Days)",ylab="Ozone (Dobson units)",pch=".",cex=1.5)

Your task is to use the multi-layer perceptron to predict the ozone levels into the future and see if you can detect an overall drop in the mean ozone level. Plot 400 predicted values together with the actual value.

The following is a reminder of the steps to carry out in the analysis:

• Select inputs and outputs for your problem and consequently the input and output nodes for the network.
• Normalize the data by rescaling.
• Split the data into training, validation and test (use the rule 50/25/25 if enough data or use cross validation with little data).
• Identify the main parameters to configure, e.g., the network architecture and others.
• Train the network and compare for different parameters
• Assess the performance on the test data.
• Analyse the bias and variance trade off.

Exercise 2 Probability theory.

In class we often used the rule:

p(xi|x1,…,xN)=
p(x1,…,xN)
 ∫ p(x1,…,x−i,…,xN) dxi

Derive this rule from the product rule and sum rule.

Exercise 3 Naive Bayes.

Consider the binary classification problem of spam email in which a binary label Y ∈ {0, 1} is to be predicted from a feature vector X = (X1, X2, …, Xn), where Xi=1 if the word i is present in the email and 0 otherwise. Consider a naive Bayes model, in which the components Xi are assumed mutually conditionally independent given the class label Y.

1. [a] Draw a directed graphical model corresponding to the naive Bayes model.
2. Find a mathematical expression for the posterior class probability p(Y = 1 | x), in terms of the prior class probability p(Y = 1) and the class-conditional densities p(xi | y).
3. Make now explicit the hyperparameters of the Bernoulli distributions for Y and Xi. Call them, µ and θi, respectively. Assume a beta distribution for the prior of these hyperparameters and show how to learn the hyperparameters from a set of training data (yj,xj)j=1m using a Bayesian approach. Compare this solution with the one developed in class via maximum likelihood.