# DM825 - Introduction to Machine Learning Sheet 4, Spring 2011 [pdf format]

Exercise 1 Generalized Linear Models and Neural Networks.

1. [(a)]Show that the multinomial distribution is a member of the exponential family determining the canonical response function, called softmax or normalized exponential, and its inverse, the link function. Take into account the fact that the parameters θk of the multinomial distribution are not independent because ∑Kk=1θk=1.

Solution: The solutioin of this exercise is developed on pages 114 and 115 of [B1].

2. The Fisher’s `iris` data set gives the measurements in centimeters of the variables petal length and width, respectively, for 50 flowers from each of 2 species of iris. The species are “Iris setosa”, and “versicolor” and “virginica”. [sheet4_1b.R]

Use a multiple logistic model (i.e., multinomial) to predict the test using generalized linear models to fit the parameters. Given the multivariate nature of multinomial variables we cannot use the `glm` function in R. An alternative function is `multinom` from the package `nnet`.

Solution:

 head(iris) str(iris) library(nnet) # fit the multinomial model res <- multinom(Species ~ .,data=iris) summary(res) # The estimated class probabilities for the training data can be found in # res\$fitted.values # the classification assigns the object to the class with maximal # estimated probability res\$fitted.values pr <- predict(res,iris) table(iris\$Species,pr) # # We can plot the estimated class probabilities as a function of # Petal.Width, for Sepal.Length 5.8, Sepal.Width 3 and Petal.Length 4.35 # x1 <- seq(0.1,2.5,0.1) n <- length(x1) b <- summary(res) beta2 <- b\$coefficients[1,] beta3 <- b\$coefficients[2,] p <- matrix(0,n,3) for (i in 1:n){ x <- c(1,5.80,3,4.35,x1[i]) e2 <- exp(beta2%*%x) e3 <- exp(beta3%*%x) et <- 1+e2+e3 p[i,1] <- 1/et p[i,2] <- e2/et p[i,3] <- e3/et } plot(x1,p[,1],type="l",ylim=c(0,1),xlab="Sepal.Length",ylab="Class probability") lines(x1,p[,2],lty=2) lines(x1,p[,3],lty=3)
3. Neural networks provide a flexible non-linear extension of multinomial regression. In R the function `nnet` from the package `nnet` provides an implementation to fit single-hidden-layer neural networks, possibly with skip-layer connections (i.e., a link from the input node directly to the output nodes). Check the example of this function. Compare its results with the GLM at the previous point and comment.

Solution:

 head(iris) str(iris) library(nnet) # fit the network on a training set of half of the data samp <- c(sample(1:50,25),sample(50:100,25),sample(101:150,25)) res <- nnet(Species ~ .,data=iris,subset=samp,size=2,maxit=1000) summary(res) # The estimated class probabilities for the training data can be found in # res\$fitted.values res\$fitted.values # the classification assigns the object to the class with maximal # estimated probability # Let's see the performance on the test set pr <- predict(res,iris[-samp,],type="class") table(iris\$Species[-samp],pr) # comparison with multinomial regression: res <- multinom(Species ~ .,data=iris,subset=samp) pr <- predict(res,iris[-samp,]) table(iris\$Species[-samp],pr)

Exercise 2 Perceptron. This exercise asks you to implement the perceptron algorithm and plot its result. As data set we use a simplified case with binary classification from the `iris` case. [sheet4_2.R]

Solution

Perceptron animation

Exercise 3 Neural Networks. In the derivation of the backward propagation procedure we used the fact that the partial derivation of the error for the output units is given by

 ∂ Err ∂ ak
jyj     (1)

For a single output:

1. [(a)] Show that this fact holds true for squared errors Err function.
2. Using a probabilistic interpretation of the network output, show that (1) holds true for any of the following conditional distributions and output activation functions: y|x, θN(f(x, θ), β1) and identity (for regression), y|x, θBern(f(x, θ)) and logistic sigmoid (for binary classification), yk|x, θBern(fk(x, θ)) and softmax function (for multinomial classification).
3. Show that (1) is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function (or for the canonical link function).