Model Selection Assignment

 

1)   Assume that the simple, linear model Yi = bo + b1 Xi + N(0, s2) creates data with

a)    b0 = 0.

b)   b1 = 1.

c)    s = 4.

d)   N(m,s2) is a Gaussian distribution with mean m and variance s2.

e)    Xi = i, i = 1..10.

f)    50 identical simulated subjects.

2)   The data are:

a)    The total data averaged across all 50 subjects are:

i)     0.9344    1.9569    3.0837    3.9141    4.9887    6.0183    7.0098    8.0336    8.9962    9.9516

b)   The calibration data averaged across subjects 1-25 are:

i)     1.6125    4.0449    3.4481    5.0825    3.5781    5.5630    7.1484    8.1263    8.9737    9.2795

c)    The validation data averaged across subjects 26-50 are:

i)     2.4833    1.2259    3.1703    2.9547    6.3008    5.6370    8.2899    9.0172    9.6241    9.1766

3)   Assume the two models:

a)    Model 1

i)     Y_hati = bo + b1 Xi + N(0, s2)

ii)   b0 is free to vary.

iii) b1 is free to vary.

iv) fixed s = 4.

v) Note that the noise term is included to be consistent with regression notation and so that likelihood can be calculated; you don't need to simulate noise.

b)   Model 2

i)     Y_hati = bo + b1 Xi + N(0, s2)

ii)   b0 is free to vary.

iii) fixed b1 = 1.

iv) fixed s = 4.

v)   Note that this model is nested in the former one.

4)   Equations to know:

a)    The log likelihood (monotonically related to likelihood) of these models is given by: log L = - n/2 log 2p - n/2 log s2 – 1/(2s2) S (Yi - b0 - b1Xi)2.

b)   n = the number of subjects that contributed to the data.  Note that this changes depending on which data you use.

c)    p is the number pi.

d)   S is the summation over i.

e)    You may assume throughout that we have fixed s = 4.

f)    log is the natural log.

g)    Parentheses go around operations from left to right in this order: powers, multiplication and division, addition and subtraction.

h)    The maximum likelihood estimators for b1 and b0 are the same as the SSE estimators

i)     .

ii)   .

iii) A bar indicated the mean of a variable.

5)   To turn in

a)    Max log likelihood

i)     Find the max log likelihood parameter values for Models 1 & 2 for the total data.

(1) Report the fits and best fitting parameter values.

(2) Which model fit better?

(3) Does this make sense given how the data were created?

b)   AIC

i)     How many parameters does each model have?

ii)   Compare Model 1 and Model 2 via AIC (AIC is -2*lnL + 2K).

iii) Give the adjusted likelihood values, i.e., the AIC values.

iv) Which model fit better with AIC?

v)   Are these results different than with max log likelihood?

vi) Why might they be?

vii)        Which gave ÒbetterÓ results, max log likelihood or AIC?

c)    Cross-validation

i)     Find and give the max log likelihood parameter values for Models 1 & 2 for the calibration data.

ii)   Using these parameter values (5.c.i), what is the log likelihood on the validation set?

iii) Which model fit better on the calibration set?

iv) Which model fit better on the validation set when using the calibration set parameters?

v)   Are they different?

vi) Why might they be?

vii)        Did the ÒrightÓ model win?