************************************************************; * (Linear) Discriminant Analysis: * * Idea: Suppose that we know that m=14 particular ferns have a given * property (they contain a valuable antibiotic?) and n=22 * different ferns are known NOT to have this property. We collect * 4 measurements Y1-Y4 on each of the 14+22=36 ferns. * * Given another fern, what is our best guess as to which of the * two types of fern the new fern belongs to, based on the values * Y1-Y4 for the new fern? (In other words, what is the best way * to CLASSIFY a new fern on the basis of these measurements?) * * In general, how can we find a good rule for classifying * individuals into two groups on the basis of covariates Y1-Y4, * either for understanding the the relationship between Y1-Y4 and * a trait, or in case we need to classify a new individual? * * We derive a rule using BAYESIAN methods: In general, BAYESIAN * procedures begin with choosing a ``prior distribution'' that * a new, unknown fern belongs to one of the two classes, here * p1 for Type 1 and p2 for Type 2, such that p_1>0, p_2>0, and * p_1+p_2=1. These are our ``prior beliefs'' about the * likelihood of Types 1 and 2 in the absence of any data. * Possible ``prior belief'' probabilities in this case are * * (a) p_1 = p_2 = 1/2 (equal prior belief for the two * types, also called a ``uniform prior'') and * * (b) p_1 = m/(m+n), p_2 = n/(m+n) * (priors proportional to sample proportions) * * Prior (b) might be reasonable if we started with a random * sample of N=m+n ferns and then classified them using expert * advice. Since Type 1 appears in m ferns out of N, a basic * guess for the overall likelihood of Type 1 would be p_1=m/N. * * The second step in the Bayesian procedure is to have probability * distributions for X=(Y_1,Y_2,Y_3,Y_4) for each sample. * Let f_1(X) be the probability distribution of X=Y1-Y4 * for Type 1 and f_2(X) the probability distribution for Type 2. * * Now suppose we observe X=Y1-Y4 for a new fern. BAYES' RULE * for conditional probabilities for events A and B is * * Pr(A|B) = Pr(A and B)/Pr(B) = Pr(B|A)Pr(A)/Pr(B) * * = Pr(B|A)Pr(A) / (Pr(B|A)P(A) + Pr(B|A^c)Pr(A^c)) * * for A^c={not A}, since Pr(B) = Pr(B and A) + Pr(B and A^c). * * In this case Pr(X|Type i)=f_i(X) and p_i=P(Type i), so by * Bayes' rule, assuming that X is known, * * (*) p_i(X) = Prob(Type i | X) = Prob(X and i)/Prob(X) * * = p_i f_i(X) / (p_1 f_1(X) + p_2 f_2(X)) * * = p_i f_i(X) / f_T(X), f_T(X)=p_1f_1(X)+p_2f_2(X) * * The probabilities p_i(X) given X are called the ``posterior * distribution'' of Types=1,2 given the observation X. * * Given the posterior probabilities p_1(X),p_2(X), we choose Type 1 * if p_1(X) > p_2(X) and choose Type 2 if p_2(X) > p_1(X). That is, * given X, we choose that Type with the highest posterior * probability. This is somewhat akin to the ``preponderance * of evidence'' rule in civil trials, as opposed to the ``beyond * the shadow of a doubt'' rule in criminal trials. The latter is * more like a decision based on a null hypothesis H_0 and a P-value. * * This means that we choose Type 1 if and only if p_1(X)/p_2(X) > 1. * Since the expressions in (*) have the same denominator f_T(X), * this is equivalent to * * (**) Choose Type 1 iff f_1(X)/f_2(X) > p_2/p_1 * * in terms of the within-group densities f_i(X). * * A SECOND WAY to derive the DECISION RULE (**) is as follows. * A decision rule can be thought of as equivalent to a function * c(X) of X with 0 <= c(X) <= 1, with the interpretation * (a) If c(X)=1, we choose Type 1 * (b) If c(X)=0, we choose Type 2 * (c) If 0 p_2 f_2(x) * * = 0 if p_2 f_2(x) > p_1 f_2(x) * * Thus finding the decision rule c(x) that makes Pr(Error) the * smallest is exactly equivalent to the decision rule based * on maximizing the Bayesian posterior probability. * * CONTINUOUS MULTIVARIATE DATA: For multivariate data X=Y1-Y4, * an obvious choices for f_1(X) and f_2(X) are multivariate * normal distributions. Specifically, assume that X for Type 1 * is N(Mu_1,Sig_1) and X for Type 2 is N(Mu_2,Sig_2). * * We estimate the means Mu_1,Mu_2 by the sample means of X=Y1-Y4 * from the two samples. Two different ways to estimate * Sig_1,Sig_2 are: * * POOLED COVARIANCE ESTIMATOR: Here we estimate * * Sig_1 = Sig_2 = S = pooled sample covariance matrix * * If * * (Su_1)_ab = Sum(i=1,m) (Y_1i - Mu_1)_a (Y_1i - Mu_1)_b * * (Su_2)_ab = Sum(j=1,n) (Y_2j - Mu_2)_a (Y_2j - Mu_2)_b * * then pooled sample covariance matrix is * * S = (1/(m+n-2)(Su_1 + Su_2) * * This the same as the pooled covariance matrix in the * two-sample Hotelling T^2 test. * * The ratio of the Gaussian densities f_1(X)/f_2(X) can then * be expressed in terms of Mu_1,Mu_2, and S, which can be * written as follows. * * Define * * (***) L_i(X) = X'S^{-1}Mu_i - 1/2 Mu_i'S^{-1}Mu_i + log p_i * * where Mu_1 is the sample mean for the initial Type 1 individuals * and Mu_2 is the sample mean for the Type 2 individuals. * If f_i(X) are multivariate Gaussian densities with parameters * N(mu_i,Sig_i), then one can show that f_1(X)/f_2(X) > p_2/p_1 * if and only if * * L(X) = L_1(X) - L_2(X) > 0 * * That is, there exists a linear function * * L(X) = C0 + C1*Y1 + C2*Y2 + C3*Y3 + C4*Y4 * * such that Type 1 has the higher posterior probability) if and * only if L(X) > 0. * * Note that the set of X for which we choose Type 1 is then the * half-space L(X)>0. However, if we are trying to distinguish * three or more Types, it is more useful to consider (***). * * The functions L(X), L_1(X), and L_2(X) are called * LINEAR DISCRIMINANT FUNCTIONS or LDFs, in this case for * distinguishing Type 1 from Type 2. In fact, * * L(X) = X'S^{-1}(Mu_1 - Mu_2) * + 1/2 Mu_2'S^{-1}Mu_2 - 1/2 Mu_1'S^{-1} Mu_1 * + log(p1/p2) * * Note that the prior probabilities (p1,p2) affect the * LDFs L(X),L_1(X),L_2(X) only in the constant term C0. * * UNPOOLED COVARIANCE ESTIMATORS: In this case, we assume that * the covariance matrix of f_i(X) is the sample covariance * matrix for the observations in Sample i. That is, f_1(X) * is N(mu_1,Sig_1) and f_2(X) is N(mu_2,Sig_2), where mu_1,mu_2 * are the sample means Mu_1,Mu_2 and * * Sig_1 = S_1 = (1/(m-1))Su_1, Sig_2 = S_2 = (1/(n-1))Su_2 * * Now set * * L_i(X) = - 1/2 X'S_i^{-1}X * + X'S_i^{-1}Mu_i - 1/2 Mu_i'S_i^{-1}Mu_i + log p_i * * Then, again, (***) is equivalent to L(X) = L_1(X)-L_2(X) > 0. * Now L(X) is a QUADRATIC function of X, which is then called * a QUADRATIC DISCRIMINANT FUNCTION (QDF). In this case, the * set of X for which we choose Type 1 can be a fairly general * region bounded by higher-dimensional ellipses, parabolas, * or hyperbolas. * * TRAINING SETS AND STATISTICAL LEARNING: * * In either case, the basic idea is that we use the 36=14+22 * observations whose types are known as a ``training set'' * to determine the discriminant function or classification * rule and then (perhaps) apply it to classify future * observations. Before proceeding to classify real data, * we should first think about how reliable our rule might be. * * If we have several additional ferns whose types are known that * were not used in the training set, we can use them as a * ``test set'' to help validate the rule before applying it to * new, unknown ferns. * * If we haven't set aside a test data set, there are two * possible procedures that we can carry out to test the * discrimination rule just using the training data set: * * (i) apply the rule to the 36 known ferns and see how many * of the known ferns are correctly classified. * * This is called a (simple) ``RESUBSTITUTION ANALYSIS''. For a * more stringent test, we can * * (ii) for each of the 36 known ferns, REDERIVE the rule for * the 35 ferns other than that fern, and then see how often * the left-out fern is correctly classified, using the rule * that doesn't use that fern. * * It is more typical to split up the original data into around * 10 subsets, and see how often ferns are correctly classified * after rederiving the rule for that subset being dropped. * * These are called ``CROSSVALIDATION ANALYSES''. Since the * fern that is being checked is not allowed to influence the * test rule in its favor, (ii) is a fairer (and more * conservative) test. * * K TYPES INSTEAD OF 2 TYPES * * If we had k > 2 types instead of 2 types, then the decision * rule would still be to choose the type that is most probable * in the posterior distribution. This means choose Type i if * * (****) p_i(X) = max(j) p_j(X) * * If densities f_i(X) are defined using joint-Gaussian densities * and POOLED covariances, and if the single-type LDFs L_i(X) * are defined as in (***), then (****) is equivalent to * * L_i(X) = max(j) L_j(X) * * If there are k=2 types, then this rule is equivalent to * choosing Type 1 if L(X) = L_1(X) - L_2(X) > 0. * * USING SAS TO DO DISCRIMINANT ANALYSIS: * * The basic SAS procedure for discriminant analysis is called * `proc discrim'. The default is to assume joint-Gaussian * distributions within each type, using sample means for the * mean vectors and pooled covariances, with a uniform prior. * * The coefficients for the within-type LDFs (**) are in the SAS * output. For k=2 states, two discriminant functions will be * listed in the SAS output, and you have to subtract the * coefficients yourself if you want a single LDF L(X). * * In the following SAS program, we * * (i) define data sets for the data and also for four other * observations (`data moredat') that we want to apply the * derived rule to, * * (ii) carry out several procedures to get an idea of the * distribution of Y1-Y4 among the two groups. The covariates * Y2 and Y4 seem to distinguish the groups the strongest, * with Type 1 high for Y2 and low for Y4. * * (iii) apply PROC DISCRIM to the main data, using both * resubstitution and crossvalidation to test the rule. There * are 4 errors in the resubstition analysis and 6 errors in * the crossvalidation analysis. We also apply the rule to the * four additional observations in the data set ``moredat''. * * (iv) apply ``proc discrim' again with Y2 and Y4 only and * see if the resubstitution and crossvalidation results are * the same. There are 4 resubstitution errors (as before) and * also 4 crossvalidation errors. Thus, using fewer covariates * leads to fewer crossvalidation errors! * * This suggests that Y1 and Y3 did more harm than good in the * original analysis, presumably by adding noise to the LDFs. * * NOTE: SAS has a procedure to do a `stepwise' discriminant * analysis: This starts with no covariates and then adds the * covariate that has the highest relative P-value for a * multivariate analog of a regression of type on Y1-Y4, as * long as that P-value is significant. It then looks for the * best remaining covariate in the sense of highest relative * P-value, as long as that is significant. At any stage, * all covariates that are being kept are check for their * relative P-values, and are dropped if these are not * significant. When the procedure has stopped, there are * fewer but presumably more decisive covariates than before. * The syntax for this stepwise method is * * proc stepdisc data=ferns; * title2 'STEPWISE DISCRIMINANT ANALYSIS'; * class type; * var y1-y4; * run; * * In this case, `proc stepdisc', starting from Y1-Y4, chooses * Y2 and then Y4 and then stops. This means that SAS thinks * that Y2 and Y4 are sufficient. As mentioned above, the * discriminant analysis with fewer variables made fewer * errors in the crossvalidation analysis, and so provided * a better discrimination rule. * ************************************************************; title 'DISCRIMINANT ANALYSIS - Ferns in two meadows'; title2 'TWO CLASSES with 4 covariates'; options ps=60 ls=75 pageno=1 nocenter; ************************************************************; * SAS Note: * y1-y4 is the same as y1 y2 y3 y4 in SAS statements ************************************************************; data ferns; input Subj Type y1-y4; datalines; 1 1 57 66 18 34 2 1 45 95 60 75 3 1 36 91 68 66 4 1 33 97 45 93 5 1 54 75 39 49 6 1 45 83 47 65 7 1 52 84 45 57 8 1 40 74 42 72 9 1 51 96 35 64 10 1 49 89 51 72 11 1 63 82 34 66 12 1 48 100 28 57 13 1 52 99 28 72 14 1 52 84 41 95 15 2 52 61 32 79 16 2 25 64 33 97 17 2 27 59 56 113 18 2 47 64 41 108 19 2 68 89 55 75 20 2 37 79 36 83 21 2 46 55 54 71 22 2 47 48 31 61 23 2 37 50 36 65 24 2 34 67 56 66 25 2 33 82 50 93 26 2 30 67 41 95 27 2 67 111 44 93 28 2 51 62 44 72 29 2 61 72 43 101 30 2 30 49 54 81 31 2 31 80 43 86 32 2 47 73 50 80 33 2 40 65 51 78 34 2 36 88 36 90 35 2 35 73 47 87 36 2 44 56 41 57 run; proc print data=ferns; title3 'THE DATA AS SAS SEES IT'; id Subj; run; **********************************************; * First, some preliminary pictures of the data:; **********************************************; proc means mean stddev; title3 'A FIRST LOOK AT THE DATA'; title4 'NOTE THAT THE MEAN DIFFERENCES ARE GREATEST FOR Y2 and Y4'; class Type; var y1-y4; run; proc ttest data=ferns; title3 'TWO-SAMPLE T-TESTS'; title4 'ONLY Y2 and Y4 ARE SIGNIFICANTLY DIFFERENT'; class Type; var y1-y4; run; proc plot; title3 'A PLOT FOR Y2 Vs Y4:'; title4 'TYPE 1 SEEMS TO HAVE HIGH Y2 and LOW Y4'; plot y2*y4 = Type / vpos=30; run; **********************************************; * A dataset with 4 new ferns to test **********************************************; data moredat; input Subj y1-y4; datalines; 1 34 99 15 50 2 45 70 49 86 3 50 82 51 104 4 46 119 43 79 ; ************************************************************; * Run SAS's `proc discrim'. The options are: * data=ferns: Use the main SAS data set, not moredat. * list: Show posterior probabilities for all ferns in the * resubstitution analysis * listerr: The above only for misclassified ferns * crossvalidate: Check the LDF by crossvalidation * crosslisterr: Show posterior probabilities for ferns * misclassified in the crossvalidation analysis * testdata=: Apply the LDF to additional observations * testlist: Like `list' for the test data ************************************************************; ************************************************************; * By default, `proc discrim' uses a uniform prior for type, * or p_1 = p_2 = 1/2 in this case. To specify * proportional priors, say ``priors proportional'' within * the body of the procedure. ************************************************************; proc discrim data=ferns list crossvalidate crosslisterr testdata=moredat testlist; title3 'LINEAR DISCRIMINANT ANALYSIS'; class type; var y1-y4; run; ************************************************************; * Recall that only Y2 and Y4 were significantly different * using two-sample t tests. ************************************************************; proc discrim data=ferns listerr crossvalidate crosslisterr; title3 'LINEAR DISCRIMINANT ANALYSIS USING ONLY Y2 AND Y4'; title4 'THERE ARE FEWER CROSSVALIDATION ERRORS!'; class type; var y2 y4; run; ************************************************************; * Plot Type 1, Type 2, and `Moredat' records on the same Y2,Y4 plot * First, combine the two datasets: ************************************************************; data jointdat; * `Jointdat' has 36 Type 1 and 2 records * followed by 4 Mordat records; set ferns moredat; * Mordat records have no Type: SAS sets Type=. * (meaning unassigned) for these records. Change to Type=3; if Type=. then Type=3; run; proc print; title3 'THE JOINT DATA AS SAS SEES IT:'; run; ************************************************************; * Now do the plot: ************************************************************; proc plot; title3 'A PLOT FOR Y2 Vs Y4:'; title4 'WHERE DO THE POINTS WITH TYPE=3 BELONG?'; plot y2*y4 = Type / vpos=30; run;