************************************************************; * Logistic Regression * in comparison with (Linear) Discriminant Analysis * * The idea here is that we have a trait (Y=1 or Y=0) whose * distribution depends on covariates X (in general vector-valued) * As in Linear Discriminant Analysis, we want to predict a rule * * p(Y=1|X) = p_X = g(a+bX) * * where a+bX stands for a linear regression term. * * Our first guess might be to generalize a classical regression * * Y = a + bX + error * * However, since Y=1 or Y=0, there is no sensible way to define * the error distribution. A second guess might be to view the * model for a classical regression as * * E(Y|X) = a + bX, Y is Normal(mu,sigma^2) for some mu * * That characterizes the Normal variable Y = Normal(a+bX,sigma^2) * in terms of a parameter E(Y|X)=a+bX, with sigma^2 to be estimated * as part of the parameter estimation. * * For Y=1 or Y=0, this suggests the model * * P(Y=1|X) = p_X = a + bX * * since P(Y=1) is the only parameter for a Bernoulli random * variable. However 0 < p_X < 1 while a+bX should be free to * roam along the entire real line. Setting * * P(Y=1|X) = p_X = exp(a + bX) * * at least predicts p_X > 0. As a further extension, * * p(Y=1|X) = p_X = exp(a+bX)/(1 + exp(a+bX)) (*) * * maps the entire real line (a+bX) into values of p, 0P(Y=0|X), which is equivalent to * * p(Y=1|X)/P(Y=0|X) = exp(a+bX) > 1 or * * log[p(Y=1|X)/P(Y=0|X)] = a+bX > 0 * * Thus a logistic regression leads to a linear discriminant function * exactly as in Fisher's method. * * The only difference between the two methods is that, with logistic * regression, we make no model assumptions (and no prior assumptions) * and estimate the hyperplane a+bX=0 from the likelihood (***). * * In contrast, Fisher's LDA assumes that the data for the two groups * are distributed as data from two different multivariate normal * distributions with the same covariance function, then (essentially) * uses maximum likelihood to estimate the parameters of the normal * distributions. * * Which method is superior would depend on the strength of your belief * that the covariate data for the two groups are normally * distributed. If in fact they are, then Fisher's method should * give better results. Conversely, if for example, most of the * covariates X are binary (i.e., X_a=0 or 1), then a logistic * regression might be better. * ************************************************************; title 'LOGISTIC REGRESSION - Ferns in two meadows'; title2 'TWO CLASSES with 4 covariates'; title3 'SEE PDISCRIM.SAS FOR VIEWS OF DATASET'; options ps=60 ls=75 pageno=1 nocenter; data ferns; input Subj y1-y4 @@; if Subj le 14 then do; yy=1; type='XX'; end; else do; yy=0; type='OO'; end; datalines; 1 57 66 18 34 15 52 61 32 79 2 45 95 60 75 16 25 64 33 97 3 36 91 68 66 17 27 59 56 113 4 33 97 45 93 18 47 64 41 108 5 54 75 39 49 19 68 89 55 75 6 45 83 47 65 20 37 79 36 83 7 52 84 45 57 21 46 55 54 71 8 40 74 42 72 22 47 48 31 61 9 51 96 35 64 23 37 50 36 65 10 49 89 51 72 24 34 67 56 66 11 63 82 34 66 25 33 82 50 93 12 48 100 28 57 26 30 67 41 95 13 52 99 28 72 27 67 111 44 93 14 52 84 41 95 28 51 62 44 72 29 61 72 43 101 30 30 49 54 81 31 31 80 43 86 32 47 73 50 80 33 40 65 51 78 34 36 88 36 90 35 35 73 47 87 36 44 56 41 57 run; proc sort; by Subj; run; proc print; title4 'THE DATA AS SAS SEES IT'; id Subj; * Don't repeat the observation number twice!; run; *****************************************************; * An oddity of `proc logistic' in SAS is that, by default, it models * the probility P(Y=0|X) and not P(Y=1|X). To reverse this and * model P(Y=1|X) instead, you need the option `descending'. The * only difference that this makes in the logistic function is * that the signs of the coefficients a and b are reversed. * (Exercise: Prove this.) * * Exactly as before, `proc logistic' finds y1 and y3 unconvincing * (that is, not significant) while y2 and y4 are significant * with effects of opposite sign. *****************************************************; proc logistic data=ferns descending outest=regdat; title3 'LOGISTIC REGRESSION FOR Y1-Y4'; model yy = y1-y4; run; *****************************************************; * The option `outest=regdat' writes the estimated regression * parameters to a SAS dataset. We can then use `proc iml' to do a * Resubstitution analysis and compare the results with the results * of the resubstitution analysis for LDA. We leave the more * demanding Crossvalidation analysis as an exercise in the use of * SAS macros and `proc iml'. *****************************************************; proc print data=regdat; title4 'OUTEST DATA (PARTIAL VARIABLE LIST)'; id _NAME_; var _TYPE_ Intercept y1-y4; run; *****************************************************; * A second dataset with data to be classified ; *****************************************************; data moredat; input Subj y1-y4; datalines; 1 34 99 15 50 2 45 70 49 86 3 50 82 51 104 4 46 119 43 79 ; title3 'NOW ENTERING PROC IML'; proc iml; * Define matrices and column vectors for use in proc iml; use ferns; read all var { y1 y2 y3 y4 } into ff; read all var { yy } into yy; read all var { Subj } into Subj; read all var { type } into type; nn=nrow(ff); * Number of observations; * Use `proc iml' operations to add an initial column of ones to ff; * In general, j(a,b) is an axb matrix composed of all ones; ff = j(nn,1) || ff; print "Fern Data and classification", ff " " type yy [format=3.]; * Since regdat has one record, beta is a row vector; use regdat; read all var { Intercept y1 y2 y3 y4 } into beta; print "Beta (coefficients) are", beta; print "Doing a Resubstitution Analysis:"; * For each observation X, define Ydot=a+bX and Pr(Y=1|X) * Note that most of the following operations are vector operations; ydot = ff*beta`; * ff is nx5, beta is 1x5, ydot is nx1; prex1 = exp(ydot)/(1+exp(ydot)); * prex1 is nx1; * Compare the vector yy (=1,0) with predicted values (ydot>0,ydot<0); * This gives a column vectors of errors (errflag[i]=1) and correctly * classified ferns (errflag[i]=0); * Recall that A#B means componentwise multiplication (not matrix * multiplication) of vectors or matrices; * (ydot>0) and (yy=0) evaluate to a column vector of 0s (for FALSE) * and 1s (for TRUE); errflag = (ydot>0)#(yy=0) + (ydot<0)#(yy=1); * Find the predicted type (XX or OO) from ydot[] ; ctype=type; do i=1 to nn; if (ydot[i]>0) then ctype[i]='XX'; else ctype[i]='OO'; end; * Multiply the subject numbers of errors by 100 to make them stand out; errname = (errflag # Subj) * 100; * Note that all of the following are nx1 column vectors ; print "Resubstitution analysis for coefficients:", " TYPE is the original type - CTYPE is the inferred type", " PREX1 is Logistic Prob(TYPE=XX)", Subj " " ydot [format=8.4] " " type prex1 [format=9.4] " " ctype errname; errsum=sum(errflag); print "Number of misclassifications: " errsum; * Now apply the derived rule to new data; use moredat; read all var { y1 y2 y3 y4 } into moredat; read all var { Subj } into Msubj; mm=nrow(moredat); moredat = j(mm,1) || moredat; * Add an initial column of 1s; mdot = moredat*beta`; mprex1 = exp(mdot)/(1+exp(mdot)); mtype=type[1:mm,]; do i=1 to mm; if (mdot[i]>0) then mtype[i]='XX'; else mtype[i]='OO'; end; mdat=moredat[1:mm,2:5]; print "Classification of test data:", Msubj " " mdat [format=4.0] " " mdot [format=7.4] mprex1 [format=9.4] " " mtype; quit;