************************************************************; * (Linear) Discriminant Analysis: * * Idea: We know that m=14 ferns in one meadow have a given property * (they contain a valuable antibiotic?) and n=22 ferns in another * are known NOT to have this property. We collect 4 measurements * Y1-Y4 on each of the 36 ferns. * * Given this data another fern, what is our best guess as to * which type the new fern belongs to, based on the values Y1-Y4 * for the new fern? * * In general, how can we find a good rule for classifying individuals * into two groups on the basis of covariates Y1-Y4, either for * understanding the the relationship between Y1-Y4 and a trait, * or in case we need to classify a new individual? * * We derive a rule using Bayesian methods: Bayesian procedures * begin with choosing a ``prior distribution'': In this case, * probabilities p1 for Type 1 and p2 for Type 2 such that p_1>0, * p_2>0, and p_1+p_2=1. These are our ``prior beliefs'' about the * likelihood of Types 1 and 2 in the absence of any observations. * * Possible ``prior belief'' probabilities in this case are * * (a) p_1 = p_2 = 1/2 (equal prior belief for the two * types, also called a ``uniform prior'') and * * (b) p_1 = m/(m+n), p_2 = n/(m+n) * (priors proportional to sample proportions) * * Prior (b) might be reasonable if we started with a random * sample of N=m+n ferns and then classified them using expert * advice. Since Type 1 appears in m ferns out of N, a guess for * the overall likelihood of Type 1 would be p_1=m/N. * * The second step in the Bayesian procedure is to choose probability * distributions to model the distribution of Y1-Y4 within each * sample. Let f_1(X) be the probability distribution of X=Y1-Y4 * for Type 1 and f_2(X) the probability distribution for Type 2. * * Now suppose we observe X=Y1-Y4 for a new fern. Bayes' rule * for conditional probabilities for events A and B is * * Pr(A|B) = Pr(A and B)/Pr(B) = Pr(B|A)Pr(A)/Pr(B) * * = Pr(B|A)Pr(A)/(Pr(B|A)P(A) + Pr(B|A^c)Pr(A^c)) * * for A^c={not A}, since Pr(B) = Pr(B and A) + Pr(B and A^c). * * In this case Pr(X|Type i)=f_i(X) and p_i=P(Type i), so by * Bayes' rule * * p_i(X) = Prob(Type i | X) = Prob(X and i)/Prob(X) * * = p_i f_(X)/(p_1 f_1(X) + p_2 f_2(X)) * * The probabilities p_i(X) given X are called the ``posterior * distribution'' of Type given the observation X. In general, * a Bayesian analyis is an analysis in which someone ``updates'' * a prior distribution (here p1,p2) to form a posterior * distribution (here p_1(X),p_2(X)) using Bayes' Rule, as above. * * Given covariates X=Y1-Y4 for a new fern, we decide the type of the * fern as the type that has the highest probability in the * posterior distribution. This is somewhat akin to the * ``preponderance of evidence'' rule in civil trials, as opposed * to the ``beyond the shadow of a doubt'' rule in criminal trials. * The latter is more like a decision based on a null hypothesis * H_0 and a P-value. * * Thus, we choose Type 1 if p_1(X) > p_2(X) and choose Type 2 if * p_1(2) > p_1(X). In other words, we choose Type 1 if and only if * p_1(X)/p_2(X) > 1. Since the expressions for p_i(X) above have * the same denominator, this is equivalent to * * (*) Choose Type 1 iff f_1(X)/f_2(X) > p_2/p_1 * * in terms of the within-group densities f_i(X). * * A SECOND WAY to derive the decision rule (*) is as follows. * A decision rule can be thought of as equivalent to a * function c(X) of X with 0 <= c(X) <= 1, with the * interpretation * (a) If c(X)=1, we choose Type 1 * (b) If c(X)=0, we choose Type 2 * (c) If 0 p_2 f_2(x) * * = 0 if p_2 f_2(x) > p_1 f_2(x) * * Thus finding the decision rule c(x) that makes Pr(Error) the * smallest is exactly equivalent to the decision rule based * on maximizing the Bayesian posterior probability. * * For continuous multivariate data X=Y1-Y4, the obvious choice for * f_1(X) and f_2(X) are multivariate Gaussian or multivariate normal * distributions. We write a multivariate Gaussian distribution as * N(mu,A) for a mean vector mu and covariance matrix A. Given * a sample of m individuals known to be Type 1 and a second sample * of n individuals known to be Type 2, a reasonable guess for the * mean vectors mu_1,mu_2 and covariance matrices A_1,A_2 of * f_1(X),f_2(X) are the sample mean vectors and sample covariance * matrices of the two samples. The sample covariance matrices * A_1,A_2 are 4x4 positive semidefinite matrices for Y1-Y4 based * on the 36 ferns (in this case). * * Two different ways to estimate A_1,A_2 are: * * POOLED COVARIANCE ESTIMATOR: Here * * A_1 = A_2 = S = pooled sample covariance matrix * * Define two within-type covariance sums as the 4x4 matrices * * (Su_1)_ab = Sum_j (Y_1j - Mu_1)_a (Y_1j - Mu_1)_b * * where Mu_1 is the sample mean of the Type 1 sample and * * (Su_2)_ab = Sum_j (Y_2j - Mu_2)_a (Y_2j - Mu_2)_b * * for the Type 2 sample. The pooled sample covariance matrix is * * S = (1/(m+n-2)(Su_1 + Su_2) * * This is analogous to the pooled covariance estimate in the * traditional two-sample t-test. * * Define * * (**) L_i(X) = X'S^{-1}Mu_i - 1/2 Mu_i'S^{-1}Mu_i + log p_i * * If f_i(X) are multivariate Gaussian densities with parameters * N(mu_i,A_i), then one can show that f_1(X)/f_2(X) > p_2/p_1 * if and only if * * L(X) = L_1(X) - L_2(X) > 0 * * That is, there exists a linear function * * L(X) = C0 + C1*Y1 + C2*Y2 + C3*Y3 + C4*Y4 * * such that Type 1 has the higher posterior probability) if and * only if L(X) > 0. * * The functions L(X), L_1(X), and L_2(X) are called * LINEAR DISCRIMINANT FUNCTIONS or LDFs, in this case for * distinguishing Type 1 from Type 2. In fact, * * L(X) = X'S^{-1}(Mu_1 - Mu_2) * + 1/2 Mu_2'S^{-1}Mu_2 - 1/2 Mu_1'S^{-1} Mu_1 * + log(p1/p2) * * Note that the prior probabilities (p1,p2) affect the * LDFs L(X),L_1(X),L_2(X) only in the constant term C0. * * UNPOOLED COVARIANCE ESTIMATORS: In this case, we assume that * the covariance matrix of f_i(X) is the sample covariance * matrix for the observations in Sample i. That is, f_1(X) * is N(mu_1,A_1) and f_2(X) is N(mu_2,A_2), where mu_1,mu_2 * are the sample means Mu_1,Mu_2 and * * A_1 = S_1 = (1/(m-1))Su_1, A_2 = S_2 = (1/(n-1))Su_2 * * Now set * * L_i(X) = - 1/2 X'S_i^{-1}X * + X'S_i^{-1}Mu_i - 1/2 Mu_i'S_i^{-1}Mu_i + log p_i * * Then, again, (*) is equivalent to L(X) = L_1(X)-L_2(X) > 0. * Now L(X) is a QUADRATIC function of X, which is then called * a QUADRATIC DISCRIMINANT FUNCTION (QDF). * * TESTING A DISCRIMINANT FUNCTION OR CLASSIFICATION PROCEDURE: * * In either case, the basic idea is that we use the 36=14+22 * observations whose types are known as a ``training set'' * to determine the discriminant function or classification * run and then (perhaps) apply it to classify future * observations. If we have several additional ferns whose * types are known that were not used in the training set, * we can use them as a ``test set'' to help validate the * rule before applying it to new, unknown ferns. * * If we haven't set aside a test data set, there are two * possible procedures that we can carry out to test the * discrimination rule just using the training data set: * * (i) apply the rule to the 36 known ferns and see how many * of the known ferns are correctly classified. * * This is called a (simple) ``RESUBSTITUTION ANALYSIS''. For a * more stringent test, we can * * (ii) for each of the 36 known ferns, REDERIVE the rule for * the 35 ferns other than that fern, and then see how often * the left-out fern is correctly classified, using the rule * that doesn't use that fern. * * This is called a ``CROSSVALIDATION ANALYSIS''. Since the * fern that is being checked is not allowed to influence the * test rule in its favor, (ii) is a fairer (and more * conservative) test. * * K TYPES INSTEAD OF 2 TYPES * * If we had k > 2 types instead of 2 types, then the decision * rule would still be to choose the type that is most probable * in the posterior distribution. This means choose Type i if * * (***) p_i(X) = max(j) p_j(X) * * If densities f_i(X) are defined using joint-Gaussian densities * and POOLED covariances, and if the single-type LDFs L_i(X) * are defined as in (**), then (***) is equivalent to * * L_i(X) = max(j) L_j(X) * * If there are k=2 types, then this rule is equivalent to * choosing Type 1 if L(X) = L_1(X) - L_2(X) > 0. * * USING SAS TO DO DISCRIMINANT ANALYSIS: * * The basic SAS procedure for discriminant analysis is called * `proc discrim'. The default is to assume joint-Gaussian * distributions within each type, using sample means for the * mean vectors and pooled covariances, with a uniform prior. * * The coefficients for the within-type LDFs (**) are in the SAS * output. For k=2 states, two discriminant functions will be * listed in the SAS output, and you have to subtract the * coefficients yourself if you want a single LDF L(X). * * In the following SAS program, we * * (i) define data sets for the data and also for four other * observations (`data moredat') that we want to apply the * derived rule to, * * (ii) carry out several procedures to get an idea of the * distribution of Y1-Y4 among the two groups. The covariates * Y2 and Y4 seem to distinguish the groups the strongest, * with Type 1 high for Y2 and low for Y4. * * (iii) apply PROC DISCRIM to the main data, using both * resubstitution and crossvalidation to test the rule. There * are 4 errors in the resubstition analysis and 6 errors in * the crossvalidation analysis. We also apply the rule to the * four additional observations in the data set ``moredat''. * * (iv) apply ``proc discrim' again with Y2 and Y4 only and * see if the resubstitution and crossvalidation results are * the same. There are 4 resubstitution errors (as before) and * also 4 crossvalidation errors. Thus, using fewer covariates * leads to fewer crossvalidation errors! * * This suggests that Y1 and Y3 did more harm than good in the * original analysis, presumably by adding noise to the LDFs. * * NOTE: SAS has a procedure to do a `stepwise' discriminant * analysis: This starts with no covariates and then adds the * covariate that has the highest relative P-value for a * multivariate analog of a regression of type on Y1-Y4, as * long as that P-value is significant. It then looks for the * best remaining covariate in the sense of highest relative * P-value, as long as that is significant. At any stage, * all covariates that are being kept are check for their * relative P-values, and are dropped if these are not * significant. When the procedure has stopped, there are * fewer but presumably more decisive covariates than before. * The syntax for this stepwise method is * * proc stepdisc data=ferns; * title2 'STEPWISE DISCRIMINANT ANALYSIS'; * class type; * var y1-y4; * run; * * In this case, `proc stepdisc', starting from Y1-Y4, chooses * Y2 and then Y4 and then stops. This means that SAS thinks * that Y2 and Y4 are sufficient. As mentioned above, the * discriminant analysis with fewer variables made fewer * errors in the crossvalidation analysis, and so provided * a better discrimination rule. * ************************************************************; title 'DISCRIMINANT ANALYSIS - Ferns in two meadows'; title2 'TWO CLASSES with 4 covariates'; options ps=60 ls=75 pageno=1 nocenter; data ferns; input subj y1-y4 @@; if subj le 14 then type='XX'; else type='OO'; datalines; 1 57 66 18 34 15 52 61 32 79 2 45 95 60 75 16 25 64 33 97 3 36 91 68 66 17 27 59 56 113 4 33 97 45 93 18 47 64 41 108 5 54 75 39 49 19 68 89 55 75 6 45 83 47 65 20 37 79 36 83 7 52 84 45 57 21 46 55 54 71 8 40 74 42 72 22 47 48 31 61 9 51 96 35 64 23 37 50 36 65 10 49 89 51 72 24 34 67 56 66 11 63 82 34 66 25 33 82 50 93 12 48 100 28 57 26 30 67 41 95 13 52 99 28 72 27 67 111 44 93 14 52 84 41 95 28 51 62 44 72 29 61 72 43 101 30 30 49 54 81 31 31 80 43 86 32 47 73 50 80 33 40 65 51 78 34 36 88 36 90 35 35 73 47 87 36 44 56 41 57 run; * Make sure that rows are sorted by subject; proc sort; by subj; run; **********************************************; * First, some preliminary pictures of the data:; **********************************************; proc means mean stddev; title2 'A FIRST LOOK AT THE DATA'; title3 'NOTE THAT THE MEAN DIFFERENCES ARE GREATEST FOR Y2 and Y4'; class type; var y1-y4; run; proc plot; title3 'TYPE 1 SEEMS TO HAVE HIGH Y2 and LOW Y4'; plot y2*y4 = type / vpos=30; run; proc ttest data=ferns; title2 'TWO-SAMPLE T-TESTS'; title3 'ONLY Y2 and Y4 ARE SIGNIFICANTLY DIFFERENT'; class type; var y1-y4; run; **********************************************; * Another dataset with test data **********************************************; data moredat; input subj y1-y4; datalines; 1 34 99 15 50 2 45 70 49 86 3 50 82 51 104 4 46 119 43 79 ; ************************************************************; * Run `proc discrim'. The options are: * data=ferns: Use the main SAS data set, not moredat. * list: Show posterior probabilities for all ferns in the * resubstitution analysis * listerr: The above only for misclassified ferns * crossvalidate: Check the LDF by crossvalidation * crosslisterr: Show posterior probabilities for ferns * misclassified in the crossvalidation analysis * testdata=: Apply the LDF to additional observations * testlist: Like `list' for the test data ************************************************************; ************************************************************; * By default, `proc discrim' uses a uniform prior for type, * or p_1 = p_2 = 1/2 in this case. To specify * proportional priors, say ``priors proportional'' within * the body of the procedure. ************************************************************; proc discrim data=ferns list crossvalidate crosslisterr testdata=moredat testlist; title2 'LINEAR DISCRIMINANT ANALYSIS'; class type; var y1-y4; run; ************************************************************; * Recall that only Y2 and Y4 were significantly different * using two-sample t tests. ************************************************************; proc discrim data=ferns listerr crossvalidate crosslisterr; title2 'LINEAR DISCRIMINANT ANALYSIS USING ONLY Y2 AND Y4'; title3 'THERE ARE FEWER CROSSVALIDATION ERRORS!'; class type; var y2 y4; run;